What Managers May Not Know About Root Cause Analysis (RCA)
Guest post by Mark Latino
If managers knew what the overall power of a well supported Root Cause Analysis (RCA) effort meant for their bottom-line, they would be breaking down doors to implement the process.
Unfortunately, this is often not the case, so this paper is an attempt to educate such individuals about the characteristics of an effective RCA methodology. The paper focuses on three aspects of RCA:
- What is RCA?
- What it takes to implement an effective RCA process as a way of conducting business rather than a finite ‘program’ that will eventually end
- How does RCA contribute to a company’s bottom-line?
RCA is a Reliability tool designed to clear obstacles that hinder continuous improvement. It’s meant to move an organization from incremental advancements (improvements) to quantum leaps forward. A well-managed RCA initiative is linked to the larger focus of overall asset reliability. Overall asset reliability is comprised of equipment, process and human reliability.
Many managers feel if the equipment runs (when the power is switched on) then they have “Reliability.” The equipment may only run well because the company is working on the equipment more often than it should. This is not true Reliability. Excessive Preventive Maintenance (PM) is expensive and negatively affects bottom-line profits. Long-term success can only be accomplished when all three (3) areas of equipment, process and human Reliability are working in unison.
Equipment Reliability has been improved primarily due to technology (not people). Equipment today is manufactured to a high precision standard with the use of computers and robotics to design and manufacture equipment. Processes are more reliable than ever before because software has become the decision-maker that keeps processes within their range of highest quality throughput, using the least amount of energy and materials. Once again, technology is the driver that makes this possible. The bad news is the equipment manufacturer, as well as the software developer, can sell their products to anyone they choose which includes a manufacturer’s (their customers) competitors.
There is no competitive advantage with technology alone. Where competitive advantage exists is in how effective a company manages the human asset at the interface of technological advancement. Effective human asset management can’t be bought off the shelf; it must come from within the organization. The mistake many organizations make is the human is often considered the least valued asset when funding cuts are initiated. Often the first places examined for cost reduction are with staff reductions.
Holistic RCA is an effective tool for learning about human decision errors as well as the systems in place, or not in place, that affect human behavior.
RCA needs accurate data to draw from when conducting an analysis. Reliability activities produce such data through Condition-Based Monitoring (CBM) as well as PM activities. Other data sources are needed too, but Reliability data is excellent for getting an analysis moving in the right direction.
What Is RCA?
An RCA myth encountered by many managers is they think RCA methods are all the same, when in fact they are NOT. The range in what is believed by many managers as an effective RCA method is vast. Many RCA techniques have little to no emphasis on establishing all the possible ways a problem can occur, while others expand the user’s overall effectiveness by having pre-built logic templates to insure all the possibilities of occurrence are discovered. Verification of each possibility can also range from someone said it happened (weak verification) to re-construction and testing of each possibility (strong verification).
RCA methods can be shallow, or they can be robust, it depends on what the management wants to accomplish. The intent of “true” Root Cause Analysis is to eliminate the possibility of recurrence. For this to take place the methodology used must have a problem definition that is accurate and factual. Possible ways the problem can occur must be identified, and each possibility must be verified as true (did happen) or false (did not happen) using sound evidence (not hearsay). You simply follow the verified facts to uncover the root causes (intentionally plural) of the incident.
Root causes of a robust process fall into three categories:
- Physical Root Causes (Observable consequences) are the tangible failure roots, like a fatigued component, an inhaled substance, a corroded pipe, etc.
- The Human Root Causes (Decisions) are considered the inappropriate human intervention(s), like misaligned the equipment, installed the wrong material for the service, forgot to open a valve, etc. The Human roots are errors in decision making that led to the physical consequences that surfaced.
- The Latent Root Causes (Organizational systems influencing decisions) are the systems that drive human behavior like; inadequate torque wrench calibration system, the accountability system was weak and rarely enforced, training was less than adequate, etc. Latent root causes are systems that affect the human decision-making process.
Shallow analysis is often used for minor incidents or meeting minimum reporting requirements for regulatory agencies. Shallow analysis comes into focus when the purpose of the RCA process is not clearly defined. If employees are conducting an RCA to meet some type of paperwork requirements, it will most likely be shallow (meeting minimum requirements only) and its quality adversely impacted due to time pressures.
Some companies have time requirements for the completion of an RCA. These time constraints trigger the use of a streamlined RCA process that often does not uncover the true root causes.
For example, use of the 5-Why methodology. The 5-Why methodology prompts the practitioner to ask “why did the problem occur” five (5) times, with the fifth answer to the question being considered THE root cause. This may meet some regulatory requirements, but it will not be capable of eliminating an undesirable event mechanism. Most companies use the 5-Why method for less serious incidents like first aid, near miss, minor process leak, etc. They use a more stringent or robust method for major loses, serious injury, environmental excursions, etc.
In most cases there will always be many root causes and many paths that led to an undesirable event. Also, because an analysis should never end with the Physical root causes it must be further developed in order to determine the part the human played in the event, as well as how the system(s) failed to alter the decision-making behavior. Although the 5-Why method is often accepted in manufacturing because it is simple to use and doesn’t take much time to complete, it is ineffective for solving complex problems because there will unlikely be only one (1) root cause to any problem or incident.
The problem with this type of methodology is it starts with an opinion and usually ends with an opinion. If the purpose of the RCA is to eliminate problems, the 5-Why method will not be effective. There may be other contributors that are not considered with this method, or worse, the first answer (opinion) is incorrect making all the following answers incorrect. When this occurs, the corrective actions implemented will not eliminate recurrence.
Other RCA methodologies are not as shallow but when this happens, they are constrictive, (in the sense that it requires the investigator to pick from a pre-selected list of possible outcomes). These types of RCA’s assume all possible outcomes have already been discovered. This forces the investigator to make a selection, even if there are other possibilities on the table that are not covered in the selection list. If we are time pressured in these situations, we just want to get to the next step and we are likely to select anything to get there (more than likely the infamous ‘other’ selection).
Many managers underestimate the amount of support needed for a successful RCA system. The paradigm of, “Send the candidates to RCA training and they will solve problems,” rarely works.
The RCA infrastructure is often not well thought out and when practitioners encounter obstacles, they are not able to complete their RCA successfully. This usually results in abandonment of the practitioner’s internal drive to execute the process correctly.
Newly trained RCA analysts often return to their respective jobs with the expectation they are capable of conducting a successful RCA when asked. It does not quite work this way.
There are common barriers encountered by newly trained analysts. The student/analyst:
- has not performed an RCA right away (it may be months later) and they forget how to perform an RCA as they learned it.
- will get the problem definition wrong (use hypotheses as factual modes) and end up with a disconnected analysis.
- will not know whether a “trigger” has been reached until weeks later and the event data is no longer available (cannot solve problems without data).
- will ask to have evidence analyzed by a third party and there will be no budgeted funding available.
- will have a deadline for completing the RCA and may cut corners to meet the deadline (some as little as 48 hours).
- has no experience performing RCA’s (student does not know what success looks like)?
- has no mentor available to review analyses and give feedback.
- is not able to implement recommendations (low priority in work order system).
- Is not able to track the results of implemented recommendations.
Management can increase the success rate of RCA’s by making sure the infrastructure is in place before the students receive training. The basic needs for RCA success are:
- provide performance criteria
- provide reasonable time for analysis
- process the recommendations
- remove barriers
- provide technical support
- provide skill-based training
- provide IT support
- create committed RCA teams
- provide effective leading and lagging metrics and a means to track implemented solutions
What Success Looks Like:
When a student returns from training and has become skilled through practice they will possess the tools and ability to perform an Opportunity Analysis (OA) as well as a Root Cause Analysis.
Opportunity Analysis is a tool designed to proactively uncover the most significant items/issues (normally those 20% of the problems causing 80% of the losses). These become the best candidates for RCA because the analysis allows management to know what the payback is for each item/issue when eliminated. The items/issues are sorted from the highest annual losses to the least. These significant items/issues and the RCA tool should be used with a “whatever it takes attitude” to solve these problems.
As the problems are solved and the recommendations are implemented and tracked, the facility should see significant gains in uptime, decrease in safety incidents, higher quality product, as well as in operating cost reductions.
Successful Case Study (Application, not theory):
An example of the kind of success I am talking about was in a plant where I once worked in a Reliability role. This plant produced sulfuric acid as one of their products. The way it worked was a local refinery was our customer; the company would buy their spent acid. Once the spent acid was delivered, the site would incinerate it using two burner boilers. Sulfur dioxide (S02), and sulfur trioxide (S03) gases, were stripped off, and used to make various strengths of sulfuric acid (H2SO4).
The burner boilers were supposed to have preventative maintenance performed every three years. In recent years (the last 3 years) the burners had been experiencing failures two and three times a year. When demand was low it didn’t seem like anyone was too concerned about this problem. However, as demand went up and refinery upgrades were producing more and more spent acid, the boilers unreliability was fast becoming an expensive issue.
To increase incineration efficiency the burner boiler fuel was changed from pyrites ore to molten sulfur. This change did help the efficiency, but other problems were created as a result of the change.
The boilers started to experience unexpected wall tube failures. In the boilers there was a wall embedded with tubes that circulates water to cool the gases as they rise over the wall. The wall tube failures were responsible every time the boilers failed for the past four years. An unexpected boiler failure is costly as it takes about two to three weeks to complete all the repairs.
The mission or goal of the RCA was to eliminate the wall tube failure mechanism.
A preliminary plan of action was adopted by the RCA team and was immediately initiated by gathering paper data such as maintenance histories, operating procedures, material specifications, interviews with operations and maintenance, inspections of past tube failures. Anything that could be gathered and studied while the boiler was in operation was done. The data collection would continue during the next burner boiler outage to secure specific positional and parts data from the most recent outage.
It didn’t take very long before we were faced with another unexpected boiler tube failure. The plan of action was to systematically collect failure data as the wall was disassembled. This was completed and many samples were sent to the corporate laboratories for testing.
Wall tube temperature tests confirmed a dew point existed inside a section of the wall, was causing sulfuric acid to exist and attack the tubes.
The solution was to move the tubes out of the wall by moving the wall eighteen inches in front of the tubes. This would still provide the cooling needed and eliminate the dew point therefore, removing the means to create acid.
There was never another wall tube failure for the remaining three years I worked in that facility.
The elimination of the failure mechanism eliminated the need for unexpected repairs as many as eight times a year saving the company over six million dollars of unnecessary repair expense that went straight to the bottom-line. The uptime increased significantly and was stable, allowing more predictable production performance.
Can managers afford NOT to thoroughly understand the potential of effective RCA?
If you need help convincing leadership of the realistic ROI’s that can be obtained via an effective RCA system, feel free to use this RCA ROI CALCULATOR. You put in the numbers and determine if its worth the investment or not!
About the Author
Mark Latino is President of Reliability Center, Inc. (RCI). Mark came to RCI after 19 years in corporate America. During those years, a wealth of reliability, maintenance, and manufacturing experience was acquired. Mark worked for Weyerhaeuser Corporation in a production role during the early stages of his career. He was an active part of Allied Chemical Corporation’s (Now Honeywell) Reliability Strive for Excellence initiative that started in the 70’s to define, understand, document, and live the Reliability culture until he left in 1986. Mark spent 10 years with Philip Morris primarily in a production capacity that later ended in a Reliability engineering role. Mark is a graduate of Old Dominion University (ODU) and holds a BS Degree in Business Management that focused on Production & Operations Management.