If managers knew what the overall power of a well supported Root Cause Analysis (RCA) effort meant for their bottom-line, they would be breaking down doors to implement the process.
Unfortunately, this is often not the case, so this paper is an attempt to educate such individuals about the characteristics of an effective RCA methodology. The paper focuses on the three aspects of RCA we believe leadership teams need to understand the most.
Table of Contents
Chapter 1: Communicate RCA Effectively
Chapter 2: Make RCA a Continuous Way of Conducting Business
Chapter 3: Show RCA’s Impact to the Bottom-Line
Chapter 1: Communicate RCA Effectively
RCA is a Reliability tool designed to clear obstacles that hinder continuous improvement. It’s meant to move an organization from incremental advancements (improvements) to quantum leaps forward. A well-managed RCA initiative is linked to the larger focus of overall asset reliability. Overall asset reliability is comprised of equipment, process and human reliability.
Many managers feel if the equipment runs (when the power is switched on) then they have “Reliability.” The equipment may only run well because the company is working on the equipment more often than it should. This is not true Reliability. Excessive Preventive Maintenance (PM) is expensive and negatively affects bottom-line profits. Long-term success can only be accomplished when all three (3) areas of equipment, process and human Reliability are working in unison.
Equipment Reliability has been improved primarily due to technology (not people). Equipment today is manufactured to a high precision standard with the use of computers and robotics to design and manufacture equipment. Processes are more reliable than ever before because software has become the decision-maker that keeps processes within their range of highest quality throughput, using the least amount of energy and materials. Once again, technology is the driver that makes this possible. The bad news is the equipment manufacturer, as well as the software developer, can sell their products to anyone they choose which includes a manufacturer’s (their customers) competitors.
There is no competitive advantage with technology alone. Where competitive advantage exists is in how effective a company manages the human asset at the interface of technological advancement. Effective human asset management can’t be bought off the shelf; it must come from within the organization. The mistake many organizations make is the human is often considered the least valued asset when funding cuts are initiated. Often the first places examined for cost reduction are with staff reductions.
Holistic RCA is an effective tool for learning about human decision errors as well as the systems in place, or not in place, that affect human behavior.
RCA needs accurate data to draw from when conducting an analysis. Reliability activities produce such data through Condition-Based Monitoring (CBM) as well as PM activities. Other data sources are needed too, but Reliability data is excellent for getting an analysis moving in the right direction.
An RCA myth encountered by many managers is they think RCA methods are all the same, when in fact they are NOT. The range in what is believed by many managers as an effective RCA method is vast. Many RCA techniques have little to no emphasis on establishing all the possible ways a problem can occur, while others expand the user’s overall effectiveness by having pre-built logic templates to insure all the possibilities of occurrence are discovered. Verification of each possibility can also range from someone said it happened (weak verification) to re-construction and testing of each possibility (strong verification).
RCA methods can be shallow, or they can be robust, it depends on what the management wants to accomplish. The intent of “true” Root Cause Analysis is to eliminate the possibility of recurrence. For this to take place the methodology used must have a problem definition that is accurate and factual. Possible ways the problem can occur must be identified, and each possibility must be verified as true (did happen) or false (did not happen) using sound evidence (not hearsay). You simply follow the verified facts to uncover the root causes (intentionally plural) of the incident.
Root Causes of a Robust Process Fall into 3 Categories:
1. Physical Root Causes
These are Observable Consequences with tangible failure roots, like a fatigued component, an inhaled substance, a corroded pipe, etc.
2. Human Root Causes
These are Human Decisions and are often considered the inappropriate human intervention(s), like misaligned the equipment, installed the wrong material for the service, forgot to open a valve, etc. The Human roots are errors in decision-making that led to the physical consequences that surfaced.
3. Latent Root Causes
These are Organizational Systems that drive human behavior like; inadequate torque wrench calibration system, the accountability system was weak and rarely enforced, training was less than adequate, etc. Latent root causes are systems that affect the human decision-making process.
Shallow Cause Vs Root Cause Analysis:
Shallow Cause Analysis is often used for minor incidents or meeting minimum reporting requirements for regulatory agencies. Shallow analysis comes into focus when the purpose of the RCA process is not clearly defined. If employees are conducting an RCA to meet some type of paperwork requirements, it will most likely be shallow (meeting minimum requirements only) and its quality adversely impacted due to time pressures.
Some companies have time requirements for the completion of an RCA. These time constraints trigger the use of a streamlined RCA process that often does not uncover the true root causes.
For example, use of the 5-Why methodology. The 5-Why methodology prompts the practitioner to ask “why did the problem occur” five (5) times, with the fifth answer to the question being considered THE root cause. This may meet some regulatory requirements, but it will not be capable of eliminating an undesirable event mechanism. Most companies use the 5-Why method for less serious incidents like first aid, near miss, minor process leak, etc. They use a more stringent or robust method for major loses, serious injury, environmental excursions, etc.
In most cases there will always be many root causes and many paths that led to an undesirable event. Also, because an analysis should never end with the Physical root causes it must be further developed in order to determine the part the human played in the event, as well as how the system(s) failed to alter the decision-making behavior. Although the 5-Why method is often accepted in manufacturing because it is simple to use and doesn’t take much time to complete, it is ineffective for solving complex problems because there will unlikely be only one (1) root cause to any problem or incident.
The problem with this type of methodology is it starts with an opinion and usually ends with an opinion. If the purpose of the RCA is to eliminate problems, the 5-Why method will not be effective. There may be other contributors that are not considered with this method, or worse, the first answer (opinion) is incorrect making all the following answers incorrect. When this occurs, the corrective actions implemented will not eliminate recurrence.
Other RCA methodologies are not as shallow but when this happens, they are constrictive, (in the sense that it requires the investigator to pick from a pre-selected list of possible outcomes). These types of RCA’s assume all possible outcomes have already been discovered. This forces the investigator to make a selection, even if there are other possibilities on the table that are not covered in the selection list. If we are time pressured in these situations, we just want to get to the next step and we are likely to select anything to get there (more than likely the infamous ‘other’ selection).
Chapter 2: Make RCA a Continuous Way of Conducting Business
Many managers underestimate the amount of support needed for a successful RCA system. The paradigm of, “Send the candidates to RCA training and they will solve problems,” rarely works.
The RCA infrastructure is often not well thought out and when practitioners encounter obstacles, they are not able to complete their RCA successfully. This usually results in abandonment of the practitioner’s internal drive to execute the process correctly.
Newly trained RCA analysts often return to their respective jobs with the expectation they are capable of conducting a successful RCA when asked. It does not quite work this way.
There are common barriers encountered by newly trained analysts.
The Student/Analyst Often:
● Has not performed an RCA right away (it may be months later) and they forget how to perform an RCA as they learned it.
● Will get the problem definition wrong (use hypotheses as factual modes) and end up with a disconnected analysis.
● Will not know whether a “trigger” has been reached until weeks later and the event data is no longer available (cannot solve problems without data).
● Will ask to have evidence analyzed by a third party and there will be no budgeted funding available.
● Will have a deadline for completing the RCA and may cut corners to meet the deadline (some as little as 48 hours).
● Has no experience performing RCA’s (student does not know what success looks like)?
● Has no mentor available to review analyses and give feedback.
● Is not able to implement recommendations (low priority in work order system).
● Is not able to track the results of implemented recommendations.
Management can increase the success rate of RCA’s by making sure the infrastructure is in place before the students receive training.
The Basic Needs for RCA Success Are:
● Provide performance criteria
● Provide reasonable time for analysis
● Process the recommendations in a timely manner
● Remove barriers
● Provide technical support
● Provide skill-based training
● Provide IT support
● Create committed RCA teams
● Provide effective leading and lagging metrics and a means to track implemented solutions
What Success Looks Like:
When a student returns from training and has become skilled through practice they will possess the tools and ability to perform an Opportunity Analysis (OA) as well as a Root Cause Analysis.
Opportunity Analysis is a tool designed to proactively uncover the most significant items/issues (normally those 20% of the problems causing 80% of the losses). These become the best candidates for RCA because the analysis allows management to know what the payback is for each item/issue when eliminated. The items/issues are sorted from the highest annual losses to the least. These significant items/issues and the RCA tool should be used with a “whatever it takes attitude” to solve these problems.
As the problems are solved and the recommendations are implemented and tracked, the facility should see significant gains in uptime, decrease in safety incidents, higher quality product, as well as in operating cost reductions.
Chapter 3: Show RCA’s Impact to the Bottom-Line
Successful Case Study (Application, not theory):
An example of the kind of success I am talking about was in a plant where I once worked in a Reliability role. This plant produced sulfuric acid as one of their products. The way it worked was a local refinery was our customer; the company would buy their spent acid. Once the spent acid was delivered, the site would incinerate it using two burner boilers. Sulfur dioxide (S02), and sulfur trioxide (S03) gases, were stripped off, and used to make various strengths of sulfuric acid (H2SO4).
The burner boilers were supposed to have preventative maintenance performed every three years. In recent years (the last 3 years) the burners had been experiencing failures two and three times a year. When demand was low it didn’t seem like anyone was too concerned about this problem. However, as demand went up and refinery upgrades were producing more and more spent acid, the boilers unreliability was fast becoming an expensive issue.
To increase incineration efficiency the burner boiler fuel was changed from pyrites ore to molten sulfur. This change did help the efficiency, but other problems were created as a result of the change.
The boilers started to experience unexpected wall tube failures. In the boilers there was a wall embedded with tubes that circulate water to cool the gases as they rise over the wall. The wall tube failures were responsible every time the boilers failed for the past four years. An unexpected boiler failure is costly as it takes about two to three weeks to complete all the repairs.
The Goal of the RCA was to Eliminate the Wall Tube Failure Mechanism.
A preliminary plan of action was adopted by the RCA team and was immediately initiated by gathering paper data such as maintenance histories, operating procedures, material specifications, interviews with operations and maintenance, inspections of past tube failures. Anything that could be gathered and studied while the boiler was in operation was done. The data collection would continue during the next burner boiler outage to secure specific positional and parts data from the most recent outage.
It didn’t take very long before we were faced with another unexpected boiler tube failure. The plan of action was to systematically collect failure data as the wall was disassembled. This was completed and many samples were sent to the corporate laboratories for testing.
Wall tube temperature tests confirmed a dew point existed inside a section of the wall, was causing sulfuric acid to exist and attack the tubes.
The solution was to move the tubes out of the wall by moving the wall eighteen inches in front of the tubes. This would still provide the cooling needed and eliminate the dew point therefore, removing the means to create acid.
There was Never Another Wall Tube Failure for the Remaining Three Years I Worked in that Facility.
The elimination of the failure mechanism eliminated the need for unexpected repairs as many as eight times a year saving the company over six million dollars of unnecessary repair expense that went straight to the bottom-line. The uptime increased significantly and was stable, allowing more predictable production performance.
Can Managers Afford NOT to Thoroughly Understand the Potential of Effective RCA?
When management considers whether to purchase Root Cause Analysis (RCA) during these times, we can immediately see this would normally be viewed as not “necessary” by those who typically review the budget for cost cutting purposes. This is often because the reviewer likely does not understand what RCA is, and as a result, they group it into a commodity category as either ‘Training’ and/or ‘Software’. Once viewed as an unnecessary commodity, it dies in the review process.
There is no rocket science behind the calculation that determines margin: revenue – expenses = profit/(loss). In good times when we can sell whatever we offer, we control margin on the revenue side by finding ways to produce more product with the same fixed assets. In bad times when we cannot sell what we are capable of making, we control margin by cutting expenses.
However, unexpected expenses arise from unexpected failures. In many budgets we account for these unexpected failures, to a certain degree, by embedding them in ambiguous categories on the financials under ‘General’, ‘Routine’ and ‘Other’. These are essentially slush funds to cover unexpected occurrences.
As we mentioned earlier, those that remain after we reduce head count are more prone to commit human error. This leads to increased unexpected failures which in turn add unexpected costs to the short-term financials.
More sophisticated and complex working environments understand that RCA (when used properly) is not a commodity at all, but rather a necessity to fight this cycle. This is a basic principle of Reliability Engineering. A good RCA system (not just a task) will provide an organization the methodology and tools to proactively identify the Significant Few failures.
These are the 20% or Less of the Failure Events Costing an Organization 80% or More of their Losses.
If we view RCA as an expense, we will not identify these opportunities, and they shall remain buried in the rubble of the financial reports. Opportunities remain spread across numerous financial categories, and unless situations are corrected, those departments will continue to absorb these costs masked under categories such as “routine” (unexpected costs).
If we view RCA as an investment, we will use the creativity and innovation of our employees to root out these opportunities, properly analyze them and prevent them from happening again. Overworked and understaffed organizations will be relieved of the burden of having to address unnecessary failures. Perhaps, the most important realization would be the true impact to the bottom-line financials due to the drastic reduction of unexpected failures. These significant cost reductions are but a fraction of what any RCA training or software investments would initially cost.