Why Reliability Needs Risk Management to Succeed
Guest Post by John Ayers (first posted on CERM ® RISK INSIGHTS – reposted here with permission)
Most of my career was spent with the Department of Defense (DOD) industry. The many programs I worked on included a fairly difficult reliability requirement. I was taught that reliability is designed into a system. I learned that verifying a reliability requirement was done by analysis. But for the system reliability to succeed, you need to consider the manufacturing and installation of the system. This is when risk management comes into play to ensure system reliability requirements succeed. This paper explains why.
A reliability requirement for a DOD development project typically is defined as availability. Availability is the probability that a system will work as required during the period of a mission. The mission could be the 18-hour span of an aircraft flight. The mission period could also be the 3 to 15-month span of a military deployment. Availability includes non-operational periods associated with reliability, maintenance, and logistics.
There are three qualifications that need to be met for a system to be available:
- Functioning system not out of service for repairs or inspections
- Functioning under normal conditions and operates in an ideal setting at an expected rate
- Functioning when needed and operational at any time…
Reliability Design Analysis
The starting point of a reliability analysis is to create a model. The model includes every major
component in the system as well as their reliability terms. The major terms are:
- Mean Time Between Failure (MTBF) is a reliability term used to provide the amount of failures per million hours for a product.
- Mean Time To Repair (MTTR) is the time needed to repair a failed hardware module. In an operational system, repair generally means replacing a failed hardware part.
- Mean Time To Failure (MTTF) is a basic measure of reliability for non-repairable It is the mean time expected until the first failure of a piece of equipment.
The model is run numerous times. Each iteration involves making changes to the model such as: adding redundancy; eliminating single point failure point; and altering reliability terms until the output of the model shows the availability requirement is met. The reliability requirements for the major components flow out of the mode
At this point, there is an analysis that shows the availability requirement has been met. But this is just a paper verification of meeting the requirement. We still have to manufacture, assemble, install and test the system and things can go wrong as my example shows.
I was the lead for a red team review (independent review) of a preliminary design review (PDR) for a Radome. A Radome is a fabric dome that covers an antenna or radar system to protect it from the environment. The Radome had to survive a 200 miles per hour wind, a very extreme requirement.
The PDR did not go well. The availability analysis did not show the requirement was met. It was based on using the best known fabric for the Radome A new fabric invention would be needed to meet the requirement. The red team failed the review which meant they had to go back to the drawing board. The red team was disbanded. About a year and a half later, I heard the Radome (which was installed) ruptured during acceptance test. The investigation of the failure revealed that the Radome was over tested (too much pressure) and was not fabricated properly as the cause of the rupture. The cause of failure was due to mistakes performed during the manufacture and installation phases. I never found out how they met the availability requirement. My guess is they reduced the requirement.
How Risk Management Could Have Helped
If a risk assessment was conducted for the manufacturing phase of the project. Most likely it would have identified a number of risks associated with the fabrication of the Radome. The risks would have been analyzed, handled, and monitored/controlled. The same can be said for the installation phase. Assuming the mitigation plans prevented the fabrication and over testing mistake, then the reliability analysis could have been verified
I will never know if the availability requirement was met when tested. The analysis showed it was met but due to manufacturing and installation mistakes it was not proven. I think the main lesson learned is to always conduct a risk assessment for all phases of the Radome (in this case). This includes doing one for the design assumptions as well. Many times, a bad design assumption causes project failure.
Reliability needs risk management to be successful. I think the example shows that. I have many more examples like it for a later time.
John earned a BS in Mechanical Engineering and MS in Engineering Management from Northeastern University. He has extensive experience with commercial and DOD companies. He is a member of PMI (Project Management Institute). John has managed numerous large high technical development programs worth in excessive of $100M. He has extensive subcontract management experience domestically and foreign. John has held a number of positions over his career including: Director of Programs; Director of Operations; Program Manager; Project Engineer; Engineering Manager; and Design Engineer. He has experience with: design; manufacturing; test; integration; subcontract management; contracts; project management; risk management; and quality control. John is a certified six sigma specialist, and certified to level 2 EVM (earned value management).https://projectriskmanagement.info/
If you want to be a successful project manager, you may want to review the framework and cornerstones in my book. The book is innovative and includes unique knowledge, explanations and examples of the four cornerstones of project risk management. It explains how the four cornerstones are integrated together to effectively manage the known and unknown risks on your project.