When facilitating a Root Cause Analysis (RCA), the proper questioning process will make or break the effectiveness of the entire analysis. When we hear of the 5-Why’s as a valid RCA approach, is simply asking ‘Why?’ 5x good enough….or IS IT JUST OK?
Think about it this way, if I asked you ‘How Could’ the crime have occurred versus ‘Why’ the crime occurred, would your answers be different?
I am going to take a very basic (101) case study and format it using a logic tree (graphical expression of cause-and-effect logic). As we are guided through this mental process we will discuss the differences between asking ‘How Can?’ and ‘Why?’.
Let’s say we have an Unexpected Shutdown (Event) of some process line in our facility. We know from our inspection of the failure scene and disciplined data collection efforts that a bearing failed in a critical pump (Mode). We have the physical failed bearing, so we know this to be a fact. What we don’t know at this point, is how the bearing failed. So this begins our exploratory process. As Figure 1 shows, we have had seals and shafts fail on this same pump in the past (as our maintenance histories would show), but in this occurrence it is the bearings we are interested in exploring.
Think of this Top Box as the equivalent of the ‘crime scene’. Picture the yellow crime scene tape around these blocks and that everything inside it, is a fact because it is visible and it exists. Now we start to explore for how these facts came to be.
Keep in mind that because we are dealing with visible evidence, we are dealing with the available physics of this particular bad outcome. So our question would become:
How could we have had a Critical Pump Bearing Failure resulting in an Unexpected Shutdown?
Before you start generating possibilities (hypotheses) in your head, I want you to consider if I changed the questioning, when exploring the physics of a failure to only ‘Why?’ If I were to ask you ‘Why did the critical pump bearing fail resulting in an unexpected shutdown?’, would your answers have been different?
The likely response is YES. Because by asking ‘Why’ when exploring the physics of a failure, it presupposes that we know the singular answer. I say singular because as expressed in the very commonly used, traditional 5-Whys approach, the use of ‘Why’ evokes a singular and thus linear response. Unfortunately, most failures do not occur linearly. Most undesirable outcomes have parallel paths to failure that converge together on any given day to produce the outcomes we see. From a technical standpoint, a linear 5-Whys tree cannot capture those parallel paths to failure.
So bear with me as we continue our logic tree asking the ‘How Could?’ question during the exploration of the physics of the failure. Keep in mind that using a Logic Tree is just a graphical reconstruction of logic that lead to a bad outcome in this case. During this reconstruction, the RCA team is simply moving backwards in time and coming up with viable hypotheses that answer the ‘How Can’ questions. Using this tool is a means of getting your RCA SME’s literally on the same page and agreeing on the logic that is supported by preserved evidence.
So in this case when I ask ‘How can a bearing fail?’ what would be our potential answers (hypotheses)? I’ve been doing this work for 34 years now and I can tell you the standard things I hear:
- Too much lubrication
- Too little lubrication
- Contaminated lubrication
- Installed wrong
- Wrong bearing
- Manufacturer’s defect
- and many more possibilities!
However, these are jumping to conclusions way too early. By using the logic tree, we are working backwards in short increments of time. Picture yourself as the bearing (like Jackie Mason I believe it was in the movie Caddyshack…where his golf instructor advised him to visualize ‘being the ball’:-). If I am the bearing, what just happened to me?
There are only four (4) ways in which such components can fail:
Note: For a ton of job aides that will get into the weeds on these failure mechanisms, please scroll through my LI blog page and look for pics of broken parts 🙂
With all of the possibilities that you were thinking in your head when I asked the initial question, wouldn’t they all eventually cause one or more of these failure patterns to occur? See what I am getting at when I talk about How Can v Why?
So our revised Logic Tree would now look like Figure 2:
Now this is where our proper evidence collection efforts will pay off. In our hypothetical case here, we turned the bearing over to our metallurgical group and their analysis determined that the failure patterns were due to Fatigue. This is important because what if we thought it was Overload and followed that path? If I asked ‘How could I have Overload v Fatigue’, my answers would be very different. So it is extremely important to validate our hypotheses using sound evidence.
So now we know our bearing failed due to Fatigue. What’s our next question? How could our critical bearing have failed due to Fatigue? What are our possibilities (hypotheses)? Remember, move backwards in small increments of time as you visualize this happening?
In Figure 3, our RCA team comes up with Thermal Fatigue v Mechanical Fatigue.
Our metallurgical report concludes there is evidence of Mechanical Fatigue. Remember again, evidence is key because the answers to ‘how can we have Mechanical v Thermal Fatigue‘ are very different. So we continue on with ‘How could we have had mechanical fatigue of the critical bearing?
Our RCA SME’s conclude the only possibility they can think of is related to conditions with High Vibration. So in Figure 4, we update our logic tree accordingly. We review our maintenance and reliability histories in this case and we find that we have had vibration related issues with this pump. So the questioning continues with ‘How could we have had excessive vibration on this critical bearing? Our SME’s to the rescue again come up with all the possibilities they can think of and conclude: Misalignment, Imbalance, Looseness and/or Resonance.
A review of past vibration records and PM histories show that misalignment is apparent and there are no indicators of imbalance and/or resonance. By now you know the next question, ‘How could we have had misalignment causing high vibration and so on up the logic tree?’
We explore the possibilities that the pump was either misaligned during installation or became misaligned during operation. Again we review our evidence we collected and we determine there has been issues with this pump since it was installed. There was not a period were vibration level were normal and then at some point in operations became abnormal. Another way to test this hypothesis is to watch the mechanic that aligned this pump and see if they are doing it properly. If a mechanic knows you are watching them and still does not align properly, chances are they do not know how to align properly in the first place.
In this case our evidence concludes it was not align properly from the beginning. So we update our logic tree accordingly as seen in Figure 5.
IMPORTANT: Notice at this level that we have placed a blue circle on this block and labeled it an HR. This stands for Human Root. This is where a choice was made, a decision. In this case, the person aligning chose to align in that fashion. This is a critical juncture in the logic tree reconstruction, where it is now switching from the physics of the failures (consequences of decisions) to focusing on the rational of the decision itself. This is the conversion from the hard technologies (the physical sciences) to the soft technologies (the social sciences). IT IS AT THIS POINT WHERE WE SWITCH THE QUESTIONING TO ‘WHY’.
Now I am going to get between the ears of the decision-maker and try and put myself in their spot, at that time. Why would this mechanic have not aligned properly at that time? Do you see where we have left the physical sciences and are now looking at why people make the decisions they do (that lead to physical consequences we can see)? In our case, our evidence and interviews conclude two possibilities, the mechanic had inadequate tools to align properly and/or they simply did not know how to align properly. So we will update our logic tree as displayed in Figure 6.
Evidence collected in terms of inspection of alignment tools/technologies, procedures/systems and interviews reveal the tools being used were inadequate (outdated and worn) and the mechanics were not trained properly. The trained mechanics retired and those taking over their tasks, had just watched them do it a few times. So it was basically OJT training and not formal, hands-on vendor training in proper practices. In this case, since both are true, both are followed!
We continue, ‘Why didn’t the mechanics have the proper tools and training to do the job right?’
For the first leg, our review of related systems finds there was no requirement for an annual review of the alignment tools and technologies to ensure they were fit for service as well as still applicable (not replaced by newer technologies).
So at this point we will put a red circle around this hypothesis proven to be true and call it a Latent Root Cause (or Systems Roots related). This is an actionable cause at this point where it would be easy and inexpensive to write a corrective action. Figure 7 represents our updated logic tree.
Let move on to our final leg and explore why the mechanic did not know how to align properly. In our discussions with our RCA SME’s, a review of our evidence and a review of our interviews, we find Lack of Adequate Training, Lack of Adequate Management Oversight and Time Pressure on the mechanic. Training is a typical RCA recommendation and not the most effective recommendation as it address the individual and not the system. However, it doesn’t negate the need to upgrade the skills of staff.
The important one here, and one we should all ask ourselves under these conditions, ‘Why was someone permitted to be in a position they were not qualified to be in?’ That is a management oversight issue and should explored. This happens every day, every where and in every industry! This type of oversight is critical to Reliability in the field. This is also important because is normalization deviance is permitted to occur, it increases the risk of an unexpected failure.
When our personnel start to take short cuts (due to time pressures) and nothing bad happens, they have deviated from a designed standard and set a new norm (practice). By taking such short cuts with no bad consequences, they have lowered the standards…they are Forgetting to be Afraid!!
Reading this logic tree backwards simply tells an evidence-based story that starts with the impact of deficient systems on decision-making, and then demonstrates the physical consequences of such decisions.
There is a bigger picture of how to use this structured logic to create AI knowledge bases but that is another paper:-).
Can you now see:
- how systems influence decisions (social sciences) and
- how choices/decisions trigger physical and observable consequences (physical sciences)
- how asking ‘how can’ is more appropriate to explore the physics of failure and how asking ‘why’ is more appropriate when exploring the human mind?
I have embedded many links in this article that will get more into the weeds on these issues. I invite you visit our RESOURCES web page to have free access to our articles library, video case study library as well as free tools and job aides.
This was a lengthy one but it should be a easy read. Thanks for your interest and patience!