
Many companies adopt root cause failure analysis (RCFA) and then drop it. They use it for while and get no benefit. RCFA is often ineffective when used to solve individual problems. But when used to find systematic causes of problems and improve business systems, it provides grand payback for the effort.
Keywords: root cause failure analysis, business process improvement
At an international enterprise asset management and maintenance conference in 2008 the speaker asked the 240 delegates assembled to raise their hands if they had Root Cause Failure Analysis (RCFA) training. In the audience 220 hands went up. The speaker then asked those people whose companies still used RCFA to leave their hands up. Every hand went down.
What is wrong with Root Cause Failure Analysis? By the evidence from the impromptu sampling at the conference it seems that companies do not consider it worth using. Yet world leaders in industry like DuPont Chemicals, General Electric, Toyota, and other notable businesses, credit part of their operating success to using RCFA. If RCFA made such an important difference to these companies then there is nothing seriously wrong with the methodology itself. It is the reasons why most companies do not get the improvements they want from RCFA that need to be investigated, not the method.
The Purpose of Root Cause Failure Analysis
The diagram in Figure 1 shows you when and why RCFA is used. It is intended to address and solve any failure – both specific failure and systemic business failure. Figure 2 indentifies the place for RCA in incident and problem management processes. Root cause analysis is applied in both proactive and reactive situations to identify and address trouble. (By the way, Root Cause Analysis (RCA) and RCFA are the same method. The „F‟ implies equipment failure while RCA encompasses all failures. But the methodology is identical.)
Industrial Accident Triangle and Equipment Failure Triangle
In 1931 H.W. Heinrich developed the accident triangle after analysing industrial accident data and forever changed the world of safety management. Figure 3 is the updated safety pyramid; following work in 1969 by Frank E. Bird Jr., the Director of Engineering Services for The Insurance Company of America1. The triangle tantalisingly implies a relationship between the number of incidents and the number of serious injuries. The implication being that reducing the large base of incidents will reduce the number of serious injuries.
Over the intervening years companies proactively focused on reducing the number of hazards that could lead to incidents. There was value in the approach and great reductions in industrial safety incidents occurred, but not so much that serious incidents stopped by an equal proportion. Evidence has accumulated that the accident triangle‟s implication of a direct causal connection between the number of hazards and the possibility for serious accidents is not accurate. Though an association exists, it seems that incidents and serious accidents have different causes2. None-the-less, useful safety improvement definitely occurs when the chance of danger is reduced, and companies continue working to prevent hazards.
Figure 4, from Winston Ledet of The Manufacturing Game, shows a failure triangle for industrial equipment. It also tantalisingly hints at a relationship between the number of defects in plant and equipment and the likelihood of a serious operational failure. The model is particularly appealing to those of us who have worked with industrial equipment maintenance, as evidence from failures seen over the years with our own eyes supports the failure triangle model. The strength of causality from defect through to serious failure is not known to the Author.
The apparent similarity between industrial safety accidents and industrial equipment failures is also enticing. In both cases a wide base of uncontrolled risk eventually leads to a disaster. Whether applying to man or machine, the message in both triangles is the same –– small problems left neglected provide opportunity for trouble to arise later. The triangles provide a supporting premise for the tongue-in-cheek Murphy‟s Law –– “Anything that can go wrong will go wrong3” –– by recognising there are many opportunities available for failure to initiate. The similarity between the two triangles also raises the possibility that the principles used to reduce safety accidents also apply to reducing equipment failures.
For both industrial accidents and equipment failures RCA is used to investigate noteworthy disasters with the aim of pinpointing their cause (or causes). Once the event tree cause and effects are identified, changes are made to prevent loss incidents reoccurring by choosing solutions singly or in combination from the hierarchy of control, such as engineering them out, by segregating the event from producing a disaster, by developing improved procedures and training, and/or by providing additional personal protection.
Why Companies Give-Up on RCFA
The problem for RCA when investigating individual equipment failures is apparent from the failure triangle – a huge number of causes could have participated in the final loss event. Figure 5 highlights that the 20,000 defects in the failure triangle base could have arisen anywhere during the equipment life-cycle, little of which can be controlled in the operations phase of life. To find exactly the defect(s) that started an incident and identify all combinations of the cause-effect events progressing to the final disaster is a task fraught with numerous mistakes easily and unwittingly made. Even if a root cause defect is removed, 19,999 defects remain to cause unending problems. It is terribly demoralizing to contemplate.
The vast number of the possible cause-effect paths, and the near impossibility to prevent continual problems being created by numerous others throughout the life-cycle, eventually makes companies give-up on RCA. They try RCA but the amount of work required, the slow progress and the mountain of remaining problems disheartens people and they start putting their time into finding other solutions.
RCA is time and resource hungry and users are easily fooled by coincidence, misunderstandings and personal bias. It makes RCA a poor method for business to use to solve individual equipment failures. What RCA does do well is quickly identify business system failures. The process of using RCA works badly for solving individual problems, but it works brilliantly for showing-up black-holes in business processes.
An example will help explain the dilemma of solving single problems with RCA and show its great worth in detecting business system black holes and procedural failures. Figure 6 is a drawing of the valving to a gas analyser that controlled product quality in a petrochemical facility. During project work Valve 1 was shut instead of Valve 2 and the analyser was accidently isolated. Invalid measurements disrupted production for three hours until the problem was discovered. The RCA that followed consisted of a meeting of busy Operations personnel. They traced the cause to the person who shut the valve not knowing which valve to shut. To address the problem they distributed a ruling that, “Only the Plant Operator is allowed to shut valves.” In reality nothing changed because the current rule was that only plant operators were allowed to operate valves. The chance of the event repeating remained as large as ever.
Fortunately the incident was used as an exercise in a RCA training course at their site and it was re-examined in greater detail by a mixed team of cross-functional experts. For this small problem twenty two (22) causes over the life-cycle were found to have played a role in the failure.

It was established that the person who shut the wrong valve was drawn into a trap set-up ten years earlier by the people that installed the analyser. When the analyser was installed it should have been connected directly to the main header with dedicated piping. But the side branch with Valve 1 provided easy isolation without stopping production. Hence the quick fix was to leave Valve 1 in the line and fit a tee to the analyser with a new valve, Valve 2, installed next to Valve 1 for isolation. Ten years later Valve 1 was mistakenly shut and brought the operation to a halt and destroyed three hours worth of costly production.
The RCA found 21 contributing causes of the failure and one main cause – the sample point being connected to the wrong place. No one will fix 22 causes of a problem one-by-one. It is impossible to do so during the operational phase of the life-cycle. Like all of us, the RCA team elected the simple option in the circumstances – tie an engraved tag to each valve explaining its purpose. They did not fix the root cause, but they probably stopped a repeat of the failure because the event path was broken by the addition of new information at a decision point. If the only outcome of this RCA was two new valve tags to protect against the wrong closure of two valves it would be seen as wasteful effort with little benefit.
Fixing one problem is the least valuable use of RCA. You get maximum protection for the business by taking every RCA solution company-wide. If engraved information tags were fitted on all valves throughout the operation the chance of wrongly closing any valves would be greatly diminished. Now one RCA improves the entire business. One business-wide change removes dozens of future failures, maybe hundreds, and will deliver savings and improved safety for the life of the plant.
You get the full power of RCA when you improve your business processes with what you learn from each single failure. If we use RCA to solve one cause-effect path we may be lucky and fix one problem for the moment. Even if successful it still leaves all the other possible causes to the failure untouched, and so our problems continue. If instead we fix the business processes that allowed the risk to arise, we fix the problem and we reduce the possibility throughout the business of similar circumstances occurring. Now one RCA investigation improves the entire business forever. This approach to RCA is the most valuable. Use each failure to solve the systematic problems that allowed it. It will not take many RCA‟s before you see marked improvement in your operation‟s performance. Figure 7 points you to the most effective way to use RCA. Propagate the learning by fixing the business systems shown-up by the RCA to be failure-causing.
We need to flavour RCA with a new purpose if we are to use it effectively in industry. We must refocus our aim for RCA to one of business-wide improvement and not single problem-solving. A problem-solving focus will keep you immersed in problems forever; to the point that people give-up on RCA because it does not stop problems. If instead RCA is used to fix business processes you get rapid success for the effort because you improve your business systems. The improvements identified by each RCA will flow throughout your business and touch all parts of it. The success rapidly accumulates into higher equipment reliability and greater operating profits.
Best regards to you,
Mike Sondalini
1 Geller., E. Scott., „Psychology of Safety Handbook‟, Edition 2, CRC Press, 2001
2John Booth Davies, John Davies, Alastair Ross, Brendan Wallace, Linda Wright, „Safety Management‟, Taylor and Francis, 2003
3 http://en.wikipedia.org/wiki/Murphy’s_law
Leave a Reply