RCFA and 5-Whys Tips for Successful Use

When you do a Root Cause Failure Analysis or a 5- Why there are no promises that you will actually find the true root cause and fix your problem. Investigating the cause of a failure is fraught with traps, such as making wrong assumptions, insufficient evidence, misinterpreting the evidence, misunderstanding, personal bias and second-guessing. There are necessary issues you need to be aware of that affect the RCA and 5-Why methods, and there are some good practices that you can adopt to improve your chance of doing a successful analysis when applied to equipment failures.

Keywords: root cause failure analysis, 5-Why analysis

The life of a failure incident starts at some time and some place in the past. Other than by ‘Acts of God’, industrial accidents and equipment failures are not an accident; they are caused either by human initiated events – lifeless objects do not make choices or action decisions – or by natural physics and bioscience, like corrosion and decay. Study of safety incidents find they happen because a series of circumstances and occurrences across time merge to culminate in the final failure1. There is never just one cause of a failure. It is almost a lie to call an investigation into a failure a Root Cause Failure Analysis – it is more truthful to call it a Random Causes Failure Analysis. Figure 1 points-out the great difficulty of ever finding the root cause(s) of any incident.

Figure 1 –Failure Causes Can Start Anywhere

We know that we humans are imperfect. We are limited by the capabilities and capacities of our body and brain designs2. Our muscles tire, we need sleep, our language talents vary, and we differ in mathematical abilities, as do dozens of other attributes and skills. A downside effect of our humanness is that we make human error (Included in the many upsides are our amazing creativity and innovation.). We can make mistakes at any time. Figure 23 lists typical human error rates across a range of activities. It shows the frequency our frailties start failures and disasters; it tells an interesting story of what it means to be human. It is a truth that human error is unavoidable; it is impossible to stop. But that does not mean it must lead to failure.

Figure 2 – Human Error Varies According to the Task Complexity and Situational Stress

Note the list of task types in the table under the ‘Complicated, non-routine task’ heading. That is where most engineering and maintenance work activities sit; they are complicated technical tasks not done often. Their human error rates are massive – at least one error in every ten opportunities to make an error – and it gets worse when stress is added. Human error is the single biggest reason that companies have poor plant and equipment reliability4. Your plant and equipment are fine; they are failed by poor business processes that allow humans to break them. Machines fail because company managers don’t foresee the effects of human error and human factors and do not protect the company from our inbuilt limitations; thus ensuring failure and disaster will eventually occur.

We make matters far worse by designing our machines and business processes to be easily failed by human error. We build them as series configuration of parts and tasks and consequently introduce the problem shown in Figure 3 countless times in our machines and across our companies. Fortunately, the human error rate table also advises us exactly what to do. Note how the sigma quality improves as a task becomes simpler and the work is less complicated. You reduce human error by making a job’s design simple (then simpler), by removing complication, by removing uncertainty, by directing decisions, and by removing causes of physical and mental stress. Everything that you can do to reduce human factor problems will let people do better quality work.

Figure 3 – The Danger of Series Arrangement Designs

As machines increase in numbers of parts you increase the chance of failure because the series arrangements grow longer, and more parts become available to fail – there are more things to go wrong. Similarly, when business processes have many tasks you provide many opportunities for failure to occur from human error. You will have a constant stream of disasters arriving simply because the probability of failure from countless opportunities is so heavily weighed against you. These never-ending problems eventually burn people out; all because of the stress and fatigue caused by poorly designed series processes throughout our companies and machinery.

When failures happen, as they inevitably must if people are involved, it is difficult to identify the true cause(s) because many contributing errors will have occurred across the life-cycle of the failed item. In Figure 4 the pump-set fault tree shows that a centrifugal pump can be failed from 553 possible causes. If you did an RCFA on a pump-set breakdown you would have to consider which of the 553 causes occurred to the pump under investigation. Most businesses could never provide the time necessary to conduct that RCFA. Instead, we seek the obvious causes and factors and discard those events considered impossible or too remote to reduce the length of the RCFA. This means that because of process complexity many RCFAs inevitably come-up with the wrong cause and fix the wrong issue, even though we may be convinced that we have found the problem.

Figure 4 – What Caused the Pump Set Failure if there are 553 Ways to Fail a Pump Set?

Use a Consistent and Comprehensive RCFA Process

We can reduce the number of failed RCFAs if we have a robust RCFA process that every investigative team religiously follows and if we have irrefutable evidence from the failure incident. Figure 5 makes the point that it is the evidence from failed parts that makes clear which of the many possible and diverging paths to the equipment failure caused the incident. If there is no indisputable evidence from a failure incident, then stop the RCFA immediately. Don’t let people waste their time debating opinions that can never be proven and possibly go on to cause pointless grief to others.

Every company that uses RCFA needs a documented process of how their teams run RCFAs. The procedure will detail how evidence is collected and protected, the team members’ selection process, the responsibilities of the facilitator, the investigative tools and analysis methods to use with examples of best-practice usage, it will provide pro-forma documents, forms and agendas, it will contain criteria to track and monitor the progress of the RCFA, and it will clearly indicate what expenditures are allowed by the team in their efforts to find the truth, along with providing guidance on other issues affecting the success of the RCFA.

Figure 5 – Only Indisputable Evidence is Acceptable in an RCFA

Use well respected investigative and analysis methods when to doing an RCFA. There are many Total Quality Control and Six Sigma techniques that can be applied to analyse events and historic data. Figure 6 indicates some of the common ones easy to use.

Most importantly the RCFA must force the team to look far wider for contributing causes than human behaviour normally encourages. We all make assumptions based on what we think we know and believe what our limited human senses ‘tell’ us. This is an important reason why a documented RCFA procedure must be followed – to ensure the team does not fall into the trap of taking a blinkered view from the start. The serial natures of our machinery and business process designs mean there will be numerous life-cycle factors to consider; some stretching back to conception.

Tools to expand perspectives and de-blinker RCFA team member minds include flow charting the intended design and its behaviour, like that shown in Figure 7 for an overflowing tank and using fishbone diagrams to identify possible influences from various key factors such as measurement, method, machinery, people, materials and environment. These tools are essential for the team to apply at the start if a robust and comprehensive investigation has any chance of occurring.

When the evidence from the plant and equipment is confusing, or the failure mechanisms involved are poorly understood, it may prove beneficial to conduct a Failure Mode and Effects Analysis (FMEA) on the individual parts involved/affected with the failure to deeply understand the underlying Physics of Failure effects and consequences (i.e. the forces, loads and stresses acting on parts and their effects). Questions about the physical and scientific mechanisms involved with the failure will naturally arise during the FMEA. These questions can then be answered using the evidence available coupled with sound engineering reasoning and materials testing.

Figure 6 – Contents and Coverage of the RCFA Process

Figure 7 – Start with a Flow Chart of the Failed Process Design to See Risks and Complexity

Figure 8 – Cause-and-effect Diagram Construction with Failure-Sequence Phases

Start from Certain Facts when Building a Cause and Effect Tree

RCFA has the crazy intention of identifying all possible failure paths and by using the evidence from the incident pinpoint the path that caused the failure. The complexity of business processes and unidentifiable influences across life-cycles makes this a difficult requirement to meet on even simple failures and virtually impossible on disasters. Imagine trying to identify all 553 ways the pump set in Figure 4 could fail? It would be a huge amount of work that people could never do well. Then you would need solid evidence at every step in the cause-effect tree to isolate the true failure cause(s) out of the 553 possibilities.

Knowing that the design of our machines and businesses easily lead the RCFA investigation astray, the cause-effect diagram that the team constructs need to have a structure that ‘forces’ them to work from known, indisputable evidence back to what may have occurred at the root(s) of the incident.

Figure 8 recommends that the first phase of an RCFA or 5-Why only consider scientific facts from the evidence to start the cause-effect tree. For example, in Figure 11, the cause-effect tree for the roof collapse from vehicle impact shown in Figure 10 starts from the scientific explanation – the roof fell because cement between the column and foundation sheared, not because the trainer hit the roof. A team may never get to the real root cause, but starting with the scientific causes-and-effects means the RCFA can always come-up with solutions to stop or lessen the consequences of a failure. In this case the use of brick columns with cement joints meant there was no resistance to the tilting caused by the roof moving under the impact. Knowing that, the team can at least propose better choices of construction materials and structural designs that will be more robust in such situations.

Figure 9 – Proving the Actual Failure-Sequence of an Event

Figure 10 – The Roof Collapsed because the Columns Fell, Not because the Trailer Hit the Roof

Figure 11 – Start with the Scientific Sequence of Events

If an indisputable scientific explanation cannot be found the RCFA team should consider stopping because they have only speculation and opinion to work with, which is likely to send the investigation astray and never find the whole truth. Once indisputable physics explains the science of a failure we then try and identify the sequence of physical actions that created the opportunity for failure. Sure, evidence is necessary to confirm our suppositions. The next phase of the fault tree is to find which business systems failed to stop the cascading events. Lastly, we come to latency, which are the inner beliefs, values and norms of the people and organisations involved across the life-cycle of the incident. You may need to go back decades to understand the views and attitudes of people and company culture.

The actual failure path(s) needs to be proven true. That is only possible if there is unquestionable evidence for each cause-effect step, which becomes less likely to exist as the fault tree ‘grows’ towards its roots. The ‘incident actions’ and ‘latent causes’ phases, where people need to tell the absolute truth about themselves and others, are often short of tangible proof.

Using 5-Why Methodology Rightly

The 5-Why methodology is well structured for confirming a failure path once a cause-and-effect tree is drawn. It is a poor method for identifying the cause-and-effect tree. It is doubtful that simply by asking ‘why’ five times you can find the root cause of an incident with high degree of certainty. ‘5-Why’ is just a tag to name the method, it may take three, seven, or ten ‘whys’ to get to what may be a speculative root. Just because you can answer a ‘why’ question does not prove the answer is right. This is the great trap with using 5-Why; people think they will unearth the full truth with the methodology. As soon as a fault tree splits into contributing causes the 5-Way method fails as a robust, stand-alone analysis tool. But when used to confirm the failure path from the presence of real evidence, as shown in Figure 9, the method is universally useful.

If 5-Why is used, you need to include a means to test each cause-and-effect step and prove the answer to the ‘why’ question with facts. This is the purpose of the 3W2H set of additional questions – With what, When, Where, How, and How much – that need to be used in combination with the 5-Why method.

Figure 12 – Why-Tree of a Despatch Process Failure

Figure 13 – Seeking Understanding of Incident Latency Drivers

Figures 12 and 13 are a simple cause-and-effect tree from the physical evidence to the latent causes of an incident.

Figure 14 – A 5-Why Record Form Must Show Sure Cause-Effect Evidence

Figure 14 uses a 5-Why Table to confirm the failure path with factual evidence. The failure was a late delivery to a client who invoked a $25,000 penalty clause. The RCFA team was charged with understanding what happened and why, and to prevent the problem in future. 5-Why was used to confirm the fault tree; not to develop it.

RCFA Does Not Solve Problems

Companies expect RCFA to solve their problems, but that is an impossible expectation. The output of every RCFA or 5-Why is a report. They only produce paper. They do not solve or stop the actual failure. Future failures can only be stopped or lessened by implementing the changes recommended by the RCFA or 5-Why. You must take the ideas from the investigation and do them in the real world. The written recommendations start the improvement process, but to cause them to happen they need a separate project that the organisation funds and implements.

The function of RCFA and 5-Why is to come-up with answers and does not include implementing the answers. RCFA stops once the report is presented. After delivering the report other business processes must take the recommendations to completion. Otherwise, there will be plenty of RCFA reports produced by teams, but nothing will change to improve the organisation. Doing the RCFA is the easy 20% of improving a business process. The hard yards come after the report.

Figure 15 – Implement RCFA Outcomes using Change Management and Project Methodology

The process that a company uses to implement RCFA recommendations needs to be identified in the RCFA Procedure document so everyone knows what will happen to the RCFA output. The RCFA recommendations need to be taken into a project management and change management process that cover the requirements shown in Figure 15.

RCFA and 5-Why methodology can help improve organisations if people care to know the truth and then act appropriately to resolve the ‘human element’ issues and remove the ‘black-holes’ in their business processes that draw their people into certain failure.

Mike Sondalini

1 Hopkins, Andrew., ‘Safety, Culture and Risk – the organisational causes of disasters’, Forward by James Reason, CCH Australia, 2005

2 Gladwell, Malcolm., ‘Blink, the power of thinking without thinking’, Back Bay Books, 2005

3 Smith, David J., ‘Reliability, Maintainability and Risk’, Appendix 6, Seventh Edition, Elsevier – Butterworth Heinemann

4 Barringer, H. Paul, P.E. ‘Use Crow-AMSAA Reliability Growth Plots To Forecast Future System Failures’, Barringer and Associates, Humble, TX, USA, www.barringer1.com

Use a Consistent and Comprehensive RCFA Process

Start from Certain Facts when Building a Cause and Effect Tree

Using 5-Why Methodology Rightly

RCFA Does Not Solve Problems

About Mike Sondalini

Leave a Reply Cancel reply