Is There a Direct Correlation Between Reliability & Safety?

“Assumption 1: Safety is increased by increasing system or component reliability. If components or systems do not fail, then accidents will not occur. (p. 7)

This assumption is one of the most pervasive in engineering and other fields. The problem is that it is not true.

Safety is a system property, not a component property, and must be controlled at the system level, not the component level.

New Assumption 1: High reliability is neither necessary nor sufficient for safety. (p.13)”

These statements were excerpted from Nancy Leveson’s “Engineering a Safer World“.

This contradicts the common belief there is a direct correlation between Safety and Reliability. I personally, being in the Reliability field for 30+ years, believe there is a correlation between Reliability and Safety. But I would assert that it is not a direct correlation.

This is because we can have a reliable operation and it still be unsafe, and we can also have a safe operation that is unreliable.

But I firmly believe (and have experienced) that a reliable operation is inherently a safer operation, than an unreliable one. In a reliable operation, there are fewer stops and starts and unexpected situations that deviate from control systems in place, so it stands to reason there are fewer needs to quickly correct a deviation from a standard.

The graphic above was from a thought-provoking post by Dustin Etchison (cited beneath the graphic), that expresses a correlation between Safety and Reliability.

I expanded on this at length in a paper entitled “Do Human Performance Learning Teams Make RCA Obsolete?. I was addressing the following Leveson statement:

“The basic Domino Model (she is referring to what she calls event trees or chains, reinvented by James Reason 20 years ago) is inadequate for complex systems and other models were developed, but the assumption there is a single or root cause of an accident unfortunately persists as does the ideas of dominoes (or layers of Swiss cheese), and chains of failures, each directly causing or leading to the next one in the chain. It also lives on in the emphasis on human error in identifying accident causes. (p.15).”

I believe the Safety world has an inaccurate current day view of ‘RCA’ in general and therefore treats all RCA as a commodity equivalent to the limited capabilities of the 5-Whys (linear and identifies a single root cause). I believe how well we truly solve failures (losses resulting from deviations from an acceptable standard) has a direct impact on the safety of our workforce.

Contrary to popular belief, true RCA does NOT stop at blaming someone (based on a decision resulting in a bad outcome), but understanding the reasoning for their decision (their intent) at the time. Delving into a person’s intent for their decision, will often involve uncovering flawed organizational systems, restraining paradigms, cultural norms and other sociotechnical influences.

Does anyone have valid data that supports/contradicts a direct correlation between Reliability and Safety? Comments either way on this correlation?

UPDATE (11.27.17): Thanks for the overwhelming response to this post. The COMMENTS from the experts in both fields has provided a learning experience for all. This obviously demonstrates the great interest in the linkage between Reliability and Safety.

If I can make an attempt to summarize the vast number of comments, it would appear that most feel strongly there is a definite correlation between Reliability and Safety, but we understand there is not likely a direct correlation. This is because we know we can have a reliable operation that is unsafe, and vice-versa.

We also strongly believe that when we experience unexpected conditions (upsets), we test the boundaries of our safety controls and are at higher risk of experiencing a safety incident. Steady state (reliable) operations are typically less prone to such elevated risks.

I would like to thank Robert Kalwarosky for posting Ron Moore’s article, ‘A Reliable Plant is a Safe Plant, is a Cost Effective Plant‘. This is the only paper I have seen thus far that is based on studies conducted at actual, specific plant operations, over a designated period of time. The focus of the studies mentioned are to draw the links between Reliability, Safety and Costs. Here are a few of Ron’s conclusions:

Given this is not a new topic, I am surprised more such studies have either not been conducted or not been presented in this thread, for something seemingly obvious to Reliability practitioners. They may be out there but we just haven’t seen them yet!!

UPDATE (11.28.17): Below are comments from Ron Moore, in response to the ‘Assumptions’ stated above. Ron permitted me to use his comments for this purpose.

“Bob, below are my initial thoughts. I haven’t read her (Leveson) paper, so these are based on the quotes. I’m also assuming you’ve read the paper ‘A reliable plant is a safe plant is a cost effective plant.’ The data I shared in the paper is only a fraction of what I have, but at present it’s all consistent with what I’ve shown in the paper. In fact I have six different sets of data from different companies demonstrating that as OEE improves, injury rate declines. I have three sets of data relating reactive maintenance and injuries, along with PM/PdM and injuries.

My initial comments on the quotes you’ve provided from Ms. Leveson are provided below.

“Assumption 1: Safety is increased by increasing system or component reliability. If components or systems do not fail, then accidents will not occur. (p. 7)”. This assumption is one of the most pervasive in engineering and other fields. The problem is that it is not true. Safety is a system property, not a component property, and must be controlled at the system level, not the component level.

This appears to be an incorrect interpretation or characterization of the data. My data says that safety is improved by improving system reliability (and by inference component reliability). If you reduce the failures, both component and system level, you reduce the exposure to the risk of injury and therefore the probability of injury. However, I agree that it does not mean that accidents will not occur, since accidents are caused by any number of variables, some of which are not controlled by reliability excellence. I also agree that safety is a system property, not a component property, and must be controlled at the system level.

In my view, one of the best, if not the best, measure for reliability is OEE/AU, a system level measure. Reliability isn’t just about maintenance, but her statements/assumptions seem to imply that it is. Indeed, my data says that maintenance typically only controls some 10% of the loss of production capacity captured in the OEE measure. Moreover, reliability is driven by our practices in design, procurement, stores, installation and startup, operation and maintenance, all of which contribute positively or negatively to system level reliability (not just equipment or components). Reducing the number of defects in these practices, both within each function and cooperatively as a team, will improve reliability and reduce the risk of injury, while reducing costs and environmental incidents.

New Assumption 1: High reliability is neither necessary nor sufficient for safety. (p.13)”

I think this is a really bad assumption, even risky. It’s perplexing why anyone would say this. Why wouldn’t you want high reliability, particularly if it reduces risk – risk of injury, risk of high costs, and risk of environmental incidents. This assumption may depend on her definition or view of reliability being driven by maintenance. Reliability should not be driven by maintenance. Maintenance is a support function to the overall plant and production process

I have data on manufacturing businesses that improve safety without commensurate improvement in reliability. However they reach a point where additional improvements in safety do not appear to be achievable, because the system has reached a statistically stable state. For example you can improve safety by improved personal behavior – wear your ppe, do your lock-out/tag-out properly, etc. However, once you do this exceptionally well, you have to reduce the exposure to the risk of injury, that is, you have to improve process reliability (not just equipment) to achieve further gains.”

Additional note from Ron Moore: “I scanned through Chapter 2 of Leveson’s book, and we may be talking in two different languages, or perhaps same language, but different dialects. I agree with her that safety and reliability are different properties, and that a reliable system can be unsafe, and a safe system can be unreliable. She gives good examples.

When I talk about reliability, I’m generally not using the standard definition that she repeats in the book. I’m thinking about the ability of the business (the system in this case) – the refinery, steel mill, chemical plant to be able to deliver its product in a timely, cost effective, and safe manner.

And what I’ve observed in the data is that when the OEE improves, safety improves, when reactive is reduced, OEE improves and safety improves (but reactive events are not typically caused by maintenance, only performed by them), as practices in operations and maintenance improve, OEE and costs improve, and so on. OEE graphic provided courtesy of Bruce Hawkins (How Reliability Impacts Shareholder Value Presentation at SMRP Symposium, Bruce Hawkins, Dir. Of Technical Excellence, Emerson Operational Certainty).

A caution I would insert here is that correlation is not necessarily cause and effect. Anyway, the examples she uses (the ones I read) are what I think of as sub-systems, and from that context I can see her point, and agree. Moreover, she does make a good point about the use of FMEA and the like. It’s really hard to capture all the complexity in a large system (a plant or combination of plants and other functions in a business) using those techniques.

I’ve said for many years that leadership, culture, teamwork, employee engagement are more important than any particular analysis tool, but that the tools are important for engaging people in solving problems. What Tool? When? provides my thoughts on this.”

I would like to thank Ron Moore for allowing me to post his position on this very important topic.

UPDATE (6.26.18). I recently presented at the SMRP Symposium in Memphis and connected with my old friend Ramesh Gulati. He kindly provided me 12 years of additional field data from the Arnold Engineering Development Complex (AEDC) that further supports the conclusions of Ron Moore’s data described above. This graph shows a decrease in injury rates correlates to a decrease in PM backlogs and Unscheduled Downtime.

Thanks to my friends in the Reliability and Safety communities for sharing and contributing your experience in this post.

(Update) 2.12.19. This article, Reliability and Safety Inseparable, was published in Efficient Plant Magazine by Klaus Blanche (director of the Reliability & Maintainability Center at the Univ. of Tennessee, and a research professor in the College of Engineering. Contact him at kblache@utk.edu).

From my studies, top-quartile companies (low in reactive maintenance) spent 23% of their time finding issues with predictive technologies and condition-based monitoring. This does not include preparing the work orders to fix what was found. Top-quartile-company employee engagement (suggestions per employee):

• had a 27% better safety performance (OSHA recordable-incident rate) than the average of the remaining facilities

• recorded a 14% better OSHA recordable-incident rate than the lower 75% of companies.

It’s this instilled process of root-cause analysis that drives ongoing improvement.

This latest data supports the correlation between Reliability and Safety.

About Robert (Bob) J. Latino

Leave a Reply Cancel reply