
Definition of Reliability
The concept of Reliability is often misused, misunderstood, and misinterpreted. Reliability in its academic root, is defined as the probability that a system will perform its intended function in a specified mission time and within specific process conditions. So, it is in essence a probability.
The time variable is crucial in calculating the reliability of a system. Reliability is the Probability of Success. And 1 minus the Probability of Failure.
R(t)=1–F(t)(whereF=FailureProbability)
The Link between Reliability and Safety
The basis of the Reliability model is a statistical distribution. An example of the model construction is illustrated in Diagram 1 below. A number of “n” centrifugal pumps are run on a test bench until they fail. Each interval to failure is recorded and once all the failure records are collected, a normalized frequency graph can be constructed. This frequency graph (or Probability Density Function) helps define the statistical distribution that best represents the life cycle of a typical pump in the population. Using specific mathematical transformations applied to the distribution, the probability of failure after a defined mission time can be derived.
Generally speaking, Reliability will deal with Probabilities of Failure or the absence thereof. Avoiding failures is essential for the safety of workers in the field. And minimizing the Probability of any Failure is key to enhancing a safe environment. However, we need to measure the Reliability or Probability of Failures of a system in the first place in order to know where we stand.
Two examples using Reliability calculations for safety purposes are provided below. There are many more.
Using Reliability Calculations for Safety Risk Analysis
Risk is a product of the probability of failure (F) and the associated consequence (C).
Risk=FxC
The risk value is typically a number, often a monetary amount. However, in the case of safety, especially when it comes, sadly, to fatalities, it is difficult to put a monetary value on human life. This is why Risk Matrices as shown in Diagram 2 below come into play. The risk value is obtained from an established corporate risk matrix shown in Diagram 2 below.
In the Diagram 2 matrix, the Safety Risk for a Turnaround Event is established. The generic matrix below will be applied to the deemed hazardous tasks in the turnaround. For example, vessel entry or welding next to a source of flammable material.
Using industry records and building statistical models, one can establish the probabilities in the first column.
Using Crow AMSAA plots to evaluate Safety Performance.
Crow-AMSAA Plots have a variety of names, such as Reliability Growth Plots or Duane Plots. The term “Crow” comes from Dr Larry H. Crow, who enhanced James T. Duane’s pioneering launch of this methodology, which was developed in the early 1960s (1). Crow successfully applied the method in the US Army Materials System Analysis Activity (AMSAA). The technique has blossomed into large amounts of new applications in in the field of Reliability Engineering.
In practical terms, The Crow-AMSAA technique involves plotting, most commonly, cumulative failures vs cumulative time on a log-log scale resulting in straight line plots (2). The line slope value (or Beta value) indicates improving, deteriorating, or constant failure occurrences. Due to the straight-line nature of the plots, future failure forecasts can be estimated. In plain words, based on the current trend, when is the next failure expected to occur? This method handles mixed failure modes, so it is, therefore, suitable for the complex nature of the generating units.
In the example below, we look at Lost Time Injury records in relation to a Turnaround exercise in an industrial plant. A Turnaround typically involves a higher than usual number of workers on the industrial site doing non routine jobs often with high safety risks. Additionally, those temporary workers are not familiar with current safety standards that the plant would usually hold. This leads to a higher than usual number of injuries.
The Graph 1 Crow-AMSAA plot below shows a sharp inflection in terms of LTI events as soon as the Turnaround starts. The Beta Value representing the slope of the line goes from less than 1 to around 5 indicating a sharp deterioration in the number of Lost Time Injury Events. The MTBF value also shows that the time between event is divided by 10 compared to normal operation. Once the turnaround is over, the safety performance trend goes back to normal.
In essence the Turnaround event comes with a clear deterioration in Safety performance. The question here is: what can be done to improve this poor performance at the next turnaround?
The data used to build this Crow-AMSAA plot is provided here.
Conclusion
Reliability Engineering concepts can definitely help with safety performance evaluation and eventually improvements. In the examples above, we have two types of indicators. The Risk Analysis is a leading indicator where the organisation would prepare itself for an upcoming event. Whereas the Crow AMSAA plot is more of a lagging indicator. The turnaround is over and we use the records to evaluate performance and if mandated, make improvements.
References
1 – Nigel Comerford, Crow/AMSAA Reliability Growth Plots, 16th Annual Conference 2005 – Rotorua.
2 – H. Paul Barringer, 2003, Predict Future Failures From Your Maintenance Records, Maintenance Engineering Society of Australia Speaker Tours 2003.
Leave a Reply