A guest post by Andrew O’Connor, of Relken Engineering Pty Ltd
Common Cause Failures (CCF) is one of the reasons why a classical reliability model of your system may dangerously underestimate the risk of failure. It directly attacks the benefits of providing redundancy by creating a single point of failure. In fact, studies have shown that CCF events may contribute between 20% – 80% of the unavailability of safety systems within nuclear reactors [Werner 1994]. This post will “Describe this type of failure (also known as common cause mode failure) and how it affects design for reliability. (Understand)” [CRE BOK III.A.4]
Example. Consider a system which is required to provide power to a safety critical item. The system consists of two backup generators in parallel. Only one generator is required to power the safety critical item, with one redundant generator in case the first fails. The generator may have a failure to start probability of P(A) = P(B) = 0.0049 per demand [Vesely et al. 1994].
The probability of system failure can be calculated below to be 2.4E-5 (on demand).
Definition.
In simple terms, a CCF is the failure of multiple components from a shared event which has been transmitted through a coupling factor.
A formal definition is provided by NUREG-CR5485:
A CCF event consists of component failures that meet four criteria:
- two or more individual components fail or are degraded, including failures during
demand, in-service testing, or deficiencies that would have resulted in a failure if a demand signal had been received; - components fail within a selected period of time such that success of the PRA mission would be
uncertain; - component failures result from a single shared cause and coupling mechanism; and
- a component failure occurs within the established component boundary.
It is generally accepted that Common Cause Failures do not include those multiple component failures which fail from a functional dependency that would be modeled in a traditional fault tree or system reliability model. Instead it recognizes that, in particular on redundant systems, that a dependency exists between components that were manufactured by the same company, or maintained by the same person, or exist in the same location.
Using our example, a classical Common Cause Failure may be that the two generators are maintained by the same person who was following an incorrect maintenance procedure. This means that if the first fails (due to this mistake), it is highly likely that the second will also fail. The System Reliability figure obtained above assumed that the failure of the two generators was independent. Unfortunately, our coupling factor (the maintenance person and the maintenance procedure) mean that generator A and B are dependant on each other.
Other examples include a manufacturing defect of a parts supplier caused defective air cleaners to be installed on both generators. Or both generators existed in the same location which had a flood occur, etc.
Modeling CCF Events.
I’ll quickly show the treatment of CCF in a fault tree to show the magnitude of CCF on a system. To account for the CCF dependency between generator A and B, each basic failure event is divided into a CCF element (X_AB)** and an independent element (Ai, Bi).
**Note there are numerous methods for estimating the P(X_AB) number, the NUREG-CR5485 or email me for a more comprehensive list
P(X_AB) = 1.55E-4 [Wierman et al. 2007, p.78]
P(Ai) = P(Bi) = 4.745E-3
P(S) = P(X_AB) + P(Ai).P(Bi) – P(X_AB).P(Ai).P(Bi)
Using these figures, the revised probability of system failure can be calculated to be 1.77E-4 (on demand). The original system failure estimate was underestimating the probability of failure by a factor of 7.4. WOW.
So why is Common Cause Failure treated differently to any other dependency we would model in a fault tree. There are numerous reasons including:
- The set of events which could be a common cause event is so vast that it would be impossible to include each as discrete events in a reliability model, so these events get grouped into a collective group and modeled as a single type of event.
- Common cause failure events are so infrequent that it’s unlikely for exactly the same event to occur again, so in order to make estimates from historical events, the events are grouped into a single classification to create an empirical estimate of their effect.
It is important to recognize that CCF events are the ‘catch all’ for dependencies which are not explicitly modeled in your reliability model.
Consider the recent Japanese Fukushima Nuclear Power Plant failure. Despite multiple redundant methods of providing power to the plant in order to cool the reactors in case of an emergency, a single event caused all redundant systems to fail at once. Is this classified as a Common Cause Failure. The failure has the properties of a Common Cause Failure, but in terms of reliability modeling an earthquake event was already explicitly included within the model, so the event would not be included in quantifying the nuclear reactor CCF event probability.
Protecting Against Common Cause Failures
To defend your system against common cause failure you need to either reduce your coupling factors or reduce the probability of causes. To read more on this, I’ll refer you to NUREG-CR5485.
Wrap Up
Common Cause Failure modeling is often limited to the Nuclear and NASA industry. However, the considerations and protecting against Common Cause Failure is relevant to all systems. If you have any experience with dealing with CCF, please comment below.
Mosleh, A., Rasmuson, D. & Marshall, F., 1998. Guidelines on Modeling Common Cause Failures in Probabilistic Risk Assessments, Washington DC: U.S. Nuclear Regulatory Commission. NUREG/CR-5485
Vesely, W.E., Uryasev, S.P. & Samanta, P.K., 1994. Failure of emergency diesel generators: a population analysis using empirical Bayes methods. Reliability Engineering & System Safety, 46(3), 221-229.
Werner, W., 1994. Results of recent risk studies in France,
Germany, Japan, Sweden and the United States, Paris: OECD Nuclear Energy Agency. NEA/CSNI/R(1994)10
Related:
Common Mode Failures (article)
Management Role Concerning Safety, Quality, and Reliability (article)
Fault Tolerance Basics (article)
Leave a Reply