How Systems & Reliability Engineers Apply Redundancy to Facilities and Critical Infrastructure

Redundancy in facilities and critical infrastructure is often misunderstood as simply having two of something. However, redundancy is a sophisticated strategy used by systems and reliability engineers to minimize failures and ensure continuous operation. It is one of several approaches to preventing system failures and comes with several key tradeoffs. This article examines four key aspects, or the four horsemen, of redundancy and why it is so important for facilities and critical infrastructure.

Reliability and Redundancy

Reliability is the probability that an item will perform its intended function for a specified interval under stated conditions. [MIL-STD 821C (1981)]

Redundancy is the existence of more than one means for accomplishing a given function. Each means of accomplishing the function need not necessarily be identical. [MIL-STD 821C (1981)]

Two General Approaches to Failure Prevention

As we evaluate and communicate redundancy, it’s important to remember that systems have two approaches to failure prevention.

Fault Tolerance

A fault tolerance approach commonly requires designing a redundant system so that a fault does not result in a failure. Fault detection and switching are as important as having two pieces of equipment.

Fault Avoidance

A fault avoidance approach requires a system to be designed with sufficiently reliable components to guarantee minimum system reliability. Techniques include using heavier structural members, more reliable components, and de-rating.

The Most Common Mistake

The most common mistake is accepting redundancy as in place without testing or otherwise validating it. This is common across all disciplines. Are you sure that the switching works?

The Second Most Common Mistake

Another common mistake is not doing preventative maintenance when we believe redundant equipment is present. After all, using redundancy (fault tolerance) as a form of system failure prevention means we accept individual equipment failures. The problem is that both duty and standby must be operable, and we don’t usually know that until we actually put the redundant until in service. Are you continuing to do preventative maintenance to your systems that have redundancy?

The Four Horsemen of Redundancy

I am not sure how I was first exposed to the “four horsemen.” It may have been in church. The Four Horsemen of the Apocalypse are described biblically in the Book of Revelation of the New Testament. They symbolize different aspects of the end times:

Conquest, who rides a white horse and carries a bow.
War, who rides a red horse and wields a large sword.
Famine, who rides a black horse and holds a pair of scales.
Death, who rides a pale horse and is followed by Hades.

My first exposure could have been to country music. The term “Four Horsemen” refers to The Highwaymen, a supergroup composed of four legendary artists known for pioneering the outlaw country subgenre: Johnny Cash, Waylon Jennings, Willie Nelson, and Kris Kristofferson.

Growing up in Charlotte, NC, it could have been from professional wrestling. The four horsemen were “The Nature Boy” Ric Flair, the Anderson Brothers (Arn and Ole), and Tully Blanchard. They dominated the ring and gave some of the most entertaining interviews.

One thing learned from all versions of the Four Horsemen is that no one horseman is more important (or as deadly) than the other three. This is also important for understanding and communicating redundancy.

The First Horseman of Redundancy: Complexity

Something is complex if it has many parts. Remember, complexity does not mean complicated, which means difficult. We add more parts to a system when we adopt a fault tolerance (redundancy) approach. The number of parts in a system leads to more unexpected interactions.

Complex systems are harder to understand and verify than simple ones. Extra elements associated with redundancy invariably require further ‘managerial’ systems to determine, indicate, and/or mediate failures.

“The differences in the AK-47 and M-16 are a classic example of where the latter’s complexity makes it less reliable (dependable) in certain environments.”

We usually think of too little redundancy as making a system more fragile and less reliable. In reality, too little or too much redundancy can make a system more fragile and unreliable by making it too complex.

The Second Horseman of Redundancy: Independence

By definition, redundancy is the existence of more than one means for accomplishing a given function. Each means of accomplishing the function need not necessarily be identical.

In other words, your boat does not have to have two battery-powered bilge pumps for the system to be redundant. One battery-operated bilge pump with a hand bilge as the backup will suffice. Of course, it also depends on each pump’s ability to perform the same function reliably.

The biggest disadvantage of having independence in a redundant system is that there are two pieces of equipment (and their subsystems) to understand and manage. That’s why most organizations have identical units for redundancy, and maybe worse, they renew and replace them at the same time.

The major disadvantage of not having independence is that ‘Identical’ equipment will likely wear in similar ways and fail at similar times. Front-line staff and their management often overlook this aspect. Maybe worse, reliability calculations for redundancy assume independence, and most engineers do not understand the Implications.

The Third Horseman of Redundancy: Propagation

Predicting how things will fail is often difficult, especially in service. An unexpected failure mode or effect in an upstream system may unexpectantly impact the performance of downstream systems. The unexpected catastrophic failure of an upstream system may wipe out a downstream system.

“A classic example of propagation is TWA Flight 800 (Boeing 747), which suddenly and tragically exploded near Long Island. The FBI initially suspected sabotage or a missile strike. Subsequent investigation concluded that the root cause was a spark caused by (poorly understood) corrosion of the aircraft’s aging wiring. The spark from the wiring failure ignited volatile fuel vapors in the central fuel tanks. The plane exploded suddenly and catastrophically.”

Most fatal accidents involve unanticipated chains of failures, where the failure of one element propagates to others in what the US National Transportation Safety Board (NTSB) calls a ‘cascade.’ In systems engineering, we often call how one system impacts the failure of another a “System of Systems” issue.

In process systems, a subsystem often fails due to upstream subsystems’ partial and full failure. The cascading failures occur despite the downstream subsystems’ “redundancy.” Process systems also fail when an adjacent system places unusual moisture or heat on another system despite the adjacent system’s “redundancy.”

The Fourth Horseman of Redundancy: Human Error

Redundant systems require people to build and operate them. Isolated and redundant elements are often linked by the people who operate and maintain them. Even in airplanes with multiple levels of redundancy, the pilot serves as the ultimate (and final) redundant element to prevent a failure.

Failures are not passive events when they occur. Humans frequently instigate actions. Technological failures, sometimes of redundant equipment, open a window for human error.

“Redundancy, or fault tolerance, often depends on a human switching from a failed piece of equipment to a standby unit. The human error associated with ‘switching’ usually defines a system’s reliability.”

What’s more, failure analysis frequently and incorrectly blames a front-line human for an incident rather than the overall management system. When it comes to redundancy, management systems are especially relevant because faults are tolerated and often hidden in the short term.

Communicating Using the Four Horsemen of Redundancy

Using the Four Horsemen analogy reminds us to communicate that redundancy is multi-faceted. It also reminds me that I need to remember that all four horsemen are important. Using the country music and wrestling comparison, I easily slip into the trap that Kris Kristofferson and Tully Blanchard were not as important as the other three. However, in the biblical comparison, death would be analogous to human error. Like death, forget about human performance and the gates of hell are not far behind.

Applying Redundancy to Facilities and Critical Infrastructure

Redundancy in facilities and critical infrastructure is often misunderstood as simply having two of something. However, redundancy is a sophisticated strategy designed to minimize failures and ensure continuous operation. It is just one of several approaches to preventing system failures and comes with several key tradeoffs. This article looks at understanding four key aspects, or the four horsemen of redundancy.

This article was first posted on www.jdsolomonsolutions.com.

JD Solomon is the founder of JD Solomon, Inc., the creator of the FINESSE fishbone diagram®, and the co-creator of the SOAP criticality method©. He is the author of Communicating Reliability, Risk & Resiliency to Decision Makers: How to Get Your Boss’s Boss to Understand and Facilitating with FINESSE: A Guide to Successful Business Solutions.