RCA: Going From Good to Great

Last Verified September 4, 2023

I’ve been in the Reliability and RCA space now for 38 years now (yes, I’m old 😊), but recently I’ve had a major change in perspective. For 37 of those years my family owned and ran a business (Reliability Center, Inc) that offered training, consulting, and software in the Root Cause Analysis (RCA) space. We developed and created the PROACT® RCA Methodology & Software which has been adopted by many Fortune 500 and Global 1000 companies. However, in 2019, we enacted succession plans as we (my brothers, sisters, and I) approached retirement ages. We sold RCI in 2019.

NOW comes the perspective change, I am not an RCA provider anymore who is beholden to a proprietary brand, but I’m now an RCA consumer with deep domain knowledge of what the core principles of effective RCA are. In this paper, I would like to remove the RCA provider brand labels, and delve into ‘What makes any RCA effort, good versus great?” When we remove the labels and look at any investigative occupation, all the steps are basically the same. So, let’s explore together!

While I could write a chapter on each of these line items (and we have done so in our latest book Root Cause Analysis: Improving Performance for Bottom-Line Results [5th Ed) – see above), I’ll address them in brevity in this article.

table listing differences between traditional and Best Practices for RCA — **Figure 1:** Moving From ‘Good’ to ‘Great’ RCA

1. RCA is a Task versus a System.

Far too often when I assess existing RCA efforts, the tool used (not necessarily the methodology) is acceptable as the ‘RCA’. For instance, once someone creates a 5-Whys analysis, fishbone diagram or causal factor/logic tree with a bunch of sticky notes (and perhaps a minimal report), that’s the RCA.

When viewing RCA as a ‘system’ we must take into consideration:

A. Analyzing the right undesirable outcomes, not ALL of them

B. Putting effective management systems in place to support a true RCA system

C. Developing data collection strategies to replace ‘hearsay’ as a validation technique

D. Building in Human and Organizational Performance (HOP) principles into the RCA approach

E. Rejecting Shallow Cause Analyses, when they don’t measure up to best practice RCA standards

F. Incorporating Defect Elimination (DE) strategies to capture the chronic failures

G. Aggregating RCA knowledge into a single, sharable knowledge management database

H. Employing AI technologies to extract and share experience and expertise related to past, current, and future RCAs

I. Developing and executing effective corrective actions

J. Correlating RCA performance to corporate dashboards

K. Integrating RCA efforts into work execution systems (i.e. – CMMS, APM, SAP, etc.)

L. And MUCH more.

Holistic RCA is NOT a task, it is a strategic, integrated system!!

2. RCA Focus on INTENT, Not Outcomes

Far too often we see hindsight RCA focus on the bad outcomes of decisions made. In such cases, the conclusions are predominately to blame the decision-maker and discipline them. Additionally, the ineffective, go-to, appeasement solutions are to retrain the individual, employ punitive measures (i.e. – time off without pay) and/or to change a procedure/policy. As the Hierarchy of Interventional Effectiveness (see Figure 2) shows, these are the least effective types of corrective actions.

Graphic showing less effective (education & training, rules & policies, Reminders, checklists & double checks) to More effective (Simplification & standardization, Automation & computerization, and Forcing functions) — **Figure 2:** The Hierarchy of Interventional Effectiveness

When we stop an RCA at a decision-maker and assign blame, then that says we are not interested in WHY the decision made sense at the time, to that person. In the interest of true RCA, we cannot prevent the recurrence of something, if we don’t understand the flawed management systems (latent root causes) that adversely influenced the decision-maker.

It is the understanding of the ‘sensemaking’ of the decision-maker, the system’s in which they relied on and the relationships between those systems (their interdependencies) that will determine the effectiveness of an RCA.

3. All RCA Tools are NOT created Equal

While I could go on for days expanding on this statement, I’ll use a simple graphic to make my point. We’ll just contrast the commonly used 5-Whys tool to the Causal Factor/Logic Tree tool (I use causal factor and logic tree interchangeably) in Figure 3.

graphical comparison between 5-Why and Logic Tree / Causal factor tree — **Figure 3:** 5-Whys vs Causal Factor/Logic Trees

Are the two tools equal in technical capability? 5-Whys assume that failure is linear, therefore the logic is always sequential. For those of us in the real world, we know failure is not linear. Things happen in parallel (most of the time), to combine and converge for an undesirable outcome.

Linear logic tools will often result in blaming someone or stopping short and just replacing a failed part. By asking only ‘WHY’, we connote we only want a singular answer and an opinion.

Is there a difference in asking ‘WHY’ versus ‘How Could?”. I’ll end this section with a question: Is there a difference between WHY a crime occurs, and HOW COULD a crime occur 😊? All RCA tools are NOT created equal.

4. Holistic RCA Analyses Way More than Just Broken Parts

What’s the difference between RCFA and RCA? Does anyone know why we took the ‘F’ out of RCFA? This is because RCFA stood for Root Cause Failure Analysis. This was often interpreted as only using your preferred tool to understand the physics of a failure at a component level. So many viewed RCFA as being like a metallurgical analysis of a failed part.

By getting the ‘F’ out and getting back to RCA, this broadened the interpretation and understanding of the acronym. To me, by using RCA, it is now encompassing all undesirable outcomes, not just mechanical and electrical type of stuff. It now includes undesirable outcomes from operations, safety, quality, administration, and any other department.

RCFA (F in blue) vs RCA — **Figure 4:** Getting the ‘F’ Out

5. Moving from ‘People are the Problem’ to ‘People & Systems are the Solutions’

IMHO, any ‘RCA’ ending in blame (outside of sabotage which is malice with intent) is a Shallow Cause Analysis, not a true RCA. If we are stopping our RCA’s short and concluding with either replacing parts/vendors and/or blaming decision-makers, WE ARE NOT DOING TRUE RCA.

As we spoke to earlier, if we are focusing on ‘blame’, then our analysis is focusing on the outcome of the decision, and not the ‘intent’ of the decision. Our RCA’s should drill down past the decision-makers to truly understand WHY, on that day at that time, did they feel the decision they made was the right one? When we truly delve into understanding human reasoning, we will be on the pathway to be doing legitimate RCAs.

A group that heavily influenced the human reliability side of my RCA perspective is the Community of Human and Organizational Learning. I highly recommend that if you’re into holistic RCA, that you engage with this Community better understand the human and systems contributions to undesirable outcomes.

Graphic showing difference from Shallow Cause Analysis (How can?) with focus on physical roots, to RCA (Why?) with Latent Roots — **Figure 5:** Expression of Root Cause Analysis vs Shallow Cause Analysis

6. Understanding the Difference Between ‘Work as Imagined (WAI)’ vs ‘Work as Done (WAD)’

We write our procedures in accordance with the way we view that work SHOULD be done. So, this is work as we imagine it should be. However, when we step out of the ivory tower and into reality on the floor, we experience a new world of how work is really done! The difference between these two worlds really demonstrates our human resilience to adapt to our constantly changing work environments. This WAI vs WAD is largely based on the academic works of Dr. Erik Hollnagel, and expanded on by many others in the progressive Safety Differently movement.

Work as imagined vs Word as Done, with quote, "Workers are masters of the Work as Done." — **Figure 6:** The Blue Line

A graphic you will commonly see referenced in these Safety spaces is this one (Figure 6), referred to as the ‘blue line’. Since most of our risk assessments are based on the black line, when we work in our realities (the blue line), we are often deviating from our intended path (the black line). This is where our resilience comes in and we are adaptive to the complexities of our working environments.

Holistic RCA’s take this into account and seek to identify how and why our resilience was needed to be employed and to address any system deficiencies that need to be evaluated. This feedback loop will then ensure the black line is constantly being updated, to make the working environments safer and more productive.

7. All RCA’s Don’t NEED to be Reactive

You can probably relate that most of the time, a formal RCA, has to be triggered. This means there’s some kind of matrix (see a sample in Figure 7) somewhere that states if we have $XXXXX of loss per incident and/or XX minutes/hours of unexpected downtime over XX days, that an RCA will be required. Usually these thresholds are staggered, so that the higher the severity of the outcome, will require more depth of the RCA.

Sample RCA Trigger Heat Matrix with regions from lower left, green, to upper right (yellow, orange, and red) — **Figure 7: Sample RCA Trigger Heat Matrix**

While this is common and necessary to address the sporadic/reactive incidents, it has loopholes that are antithetical to the proactive premise of a holistic Reliability approach.

For instance, in the description provided, the thresholds are based on a single ‘incident’. What about all those chronic failures that prevent people from doing their jobs the best they can? They often happen every hour, shift, day, week, etc. On their individual occurrence, they don’t rise to the level of any of those thresholds. For this reason, I would suggest adding such a proactive threshold to conduct a formal RCA, based on annual costs of a chronic failure. This calculation would be simply based on:

‘Frequency/Yr x (Lost Production/Incident $ + Manpower $/Incident + Material $/Incident) = Total Annual Loss’.

Also, to be truly proactive, we should have a threshold that looks at unacceptable risks versus just consequences (when those risks have materialized). There is no reason we shouldn’t be doing a Pareto Split (80/20) on our risk assessments (i.e. – APM, FMEAs). If we separate the 20% of the potential failure modes (Significant Few), resulting in 80% of the potential risk, we should do an RCA on why the risks are so high for the Significant Few. Figure 8 is a sample of this output courtesy of our friends at ITUS Digital.

We typically don’t do this because there is no regulatory driver to make us to it. But it makes perfect sense from a Reliability standpoint to do this…because it’s the right thing to do. Try that and even make it a leading metric that you track for the effectiveness of your overall RCA effort 😊.

**Figure 8: ITUS Proactive Data Mining Example**

8. RCA and Knowledge Management

Most perceive traditional RCA to only be a tool to analyze incidents. While it is admirable to use RCA for this purpose, if it’s one’s only purpose, they are significantly underutilizing its potential.

If we only use RCA to solve one problem at a time, and we record the results separately; there is no means to centralize all the RCA data into one location. Under such condition we cannot:

1. search such a common database for trends,

2. offer up suggestions to future analysts based on past results (using AI tools which are much more accessible and easier to use today),

3. create learning libraries/templates of successful cases to be used in our training systems,

4. collaborate live with sister facilities around the world.

Without a means of aggregating such knowledge, it will likely result in excessive RCA re-work and a huge, missed opportunity to educate our people and prevent recurrence. People will continue to analyze similar events at other locations, simply because they are not aware they have been analyzed before.

I used to poll my conference audiences about what they felt the average cost was to conduct a formal, triggered RCA? Take into consideration labor $ for those involved on the team, any 3rd parties used (lawyers, consultants, labs, etc.), real estate costs for use of conference rooms, travel and meals, (if applicable), and any other such costs.

The average I heard was $25,000/RCA!!! Consider this number when you want to look at the annual costs of RCA re-work in any given company.

Note: Oftentimes RCA teams do not have access to Subject Matter Experts (SME). With the easy accessibility to ChatGPT these days, RCA team members can easily pose their RCA questions and often receive an instant and impressive response. There is a lot of refining that needs to be done with ChatGPT as it stands today (but it also has come a long way). We just need to recognize it may not be totally accurate or comprehensive. But it is a great starting point in which we can build on for our RCA’s and our Knowledge Management Libraries.

One of the highest-level purposes of doing effective RCAs should be to share/educate others on what has been learned. In order to do that, we must first aggregate all of the RCA data below (at a minimum) into a single database:

1. Problem statement

2. Timeline

3. Team members

4. Annual costs & impacts

5. Locations and specific equipment involved

6. Graphical reconstruction/visualization (logic tree for example)

7. Verification logs (evidence to back up hypotheses)

8. Corrective actions tracking (assignments, due dates and completion dates)

9. Measuring of corrective action and overall RCA effectiveness

10. Identified cause categories for trending

In Conclusion

It has been an interesting and fun exploration, as an RCA consumer, to view RCA solely as a field of study, versus through the prism of just a branded product/approach. While I know most all of my former competitors (and many of us are still great friends), I also know that those who have been in this business a long time, have done so for a reason.

The reason is whatever they’re doing, is working for their clients…pure and simple. While our approaches and tools may differ, they are working for everyone’s respective customers and that’s what matters.

Remember, the steps of any investigative process are essentially the same, so the differences have to be made up by how we treat our customers and ‘walk the talk’ with unconditionally supporting them.

People in the RCA business, like any other, survive by ensuring their clients are successful. In the end, people buy people, not products and approaches. Those who treat people like transactions just to get their money, will not survive as they are not maturing, lasting relationships to ensure their customer’s success. Good customers end up being great friends.

As a reality check though, “An analysis is only as good as the analyst!”. So, you can buy all the fancy tools with the bells and whistles you want, but if you don’t have a competent analyst, the tool/approach is useless.

In parting, here is a twist on an old but timeless Deming quote:

“We NEVER seem to have the time and budget to do things right, but we ALWAYS seem to have the time and budget to do them again.”

Let’s just do it right the first time (as quoted by Phil Crosby)😊!

About the Author: Bob Latino

Bob Latino is currently a Principal of Prelical Solutions, LLC. Bob was the former CEO of the Reliability Center, Inc. (RCI), until its acquisition in 2019. The Latino family founded, directed, and owned RCI since 1972.

He is an internationally recognized author, trainer, software developer, lecturer, and practitioner of best practices in all aspects of a holistic Root Cause Analysis (RCA) system.

Bob has been facilitating RCA analyses with his clientele around the world for over 38 years. He has taught well over 10k students in 25+ countries, the PROACT®RCA Methodology and associated software solutions. He is author or co-author of ten (10) books related to RCA, FMEA and/or Human Error Reduction and over 100 articles and papers on the same topics in Manufacturing and Healthcare.

He currently serves of the Board of Directors for CHOLearning (Community of Human and Organizational Learning, as well as a proud member of the UpKeepAmbassador Council (https://www.onupkeep.com/)

Bob is also Series Editor for CRC Press/Taylor & Francis (www.taylorandfrancis.com). His Series is entitled, ‘Reliability, Maintenance, and Safety Engineering: A Practical View of Getting Work Done Effectively’…always seeking authors!

blatino@prelical.com