Golden Rules for Becoming a Leader in Reliability

Forming an ideal system’s approach to designing new systems involves developing paradigms, standards, and design process models for developers to follow in their future design efforts. These paradigms are called “words of wisdom” or golden rules [1]. They become the guiding lights for your product management needs.

The following paradigms are the most important criteria for designing for very high reliability.

Paradigm 1: Always aim for Zero Failures.•
Paradigm 2: Be courageous and “Just say no.”
Paradigm 3: Spend significant effort on systems requirements analysis.
Paradigm 4: If the solution costs too much money, develop a cheaper solution.
Paradigm 5: Design for Zero Defects in production, maintenance and repair.
Paradigm 6: Design for Prognostics and Health Monitoring (PHM) to minimize
the number of surprise disastrous events or preventable mishaps.
Paradigm 7: Always analyze structure and architecture for reliability of complex software systems.
Paradigm 8: Taking no action is usually not an acceptable option.
Paradigm 9: If you stop using wrong practices, you are likely to discover the
right practices.

Always Aim for Zero Failures

My books Design for Reliability [2] and Preventing Medical Device Recalls [3] cover several examples of the teams coming up with over 1000% return on when they use creative thinking. This approach not only accomplishes zero failures but also at zero net cost because the return on investment is so high. An example is a shaft design for a heavy duty truck transmission. The shaft that often failed in 2-3 years was designed for zero failures for 20 years by changing the heating and cooling cycles in the heat treating process in manufacturing. The cost was insignificant!

Be Courageous and “Just Say No”

Paradigm 2 is to be courageous and “Just say no” to those who want to rush designs through the design review process without exercising due diligence and without taking steps to prevent catastrophic events. Say “No” at certain times during the system development design process to prevent future possible failures as they are discovered. Many organizations have a Final Design Review, call it the Critical Design Review. This is the last chance to speak up. A very important heuristic to remember is “Be courageous and just say no.” The context here is that if the final design is presented with known and unknown reliability design issues, and everyone
votes “yes” to the design approval without seriously challenging it, then your answer should be “no.” Why? Because there are almost always new problems lingering in the minds of the team members, but they don’t speak. They are probably thinking that it is too late to interfere or they want to be a part of the groupthink process where everyone thinks alike.

Spend Significant Effort on Systems
Requirements Analysis

Most system failures originate from bad requirements in specifications. The sources of most requirement failures are incomplete, ambiguous, and poorly defined device specifications. Designers leave out as much as 60% good requirements. Worst, the systems are consistently getting more complex with too many interactions. They result in making expensive engineering changes later, one at a time called “scope creep.” Often robust changes cannot be made because the project is already delayed and there may not be resources for implementing new features. Look particularly hard for missing functions in the specifications.

If the Solution Costs Too Much Money, Develop a Cheaper Solution

The example of the NASA Challenger accident shows that at very little cost, the seal would have been designed to fly at any temperature instead of the threshold of 40°. The cost of installing a wire to heat the seal was probably less than $1000, while the cost of the accident was in the millions of dollars and the cost eight lives. The lesson to be learned is that a cross‐functional team should not accept any design without challenging the design. Apparently this design (not to fly below 40°) was not challenged enough.

Design for Zero Defects in Production, Maintenance and Repair

Philip Crosby, the former Senior Vice President (VP) at ITT and author of the famous book Quality is Free, pioneered the zero defects standard. Philip Crosby considered “zero defects” as the only standard you needed. This applies even more to safety. It is a practice that aims to prevent defects and errors and to do right things right the first time.

Design for Prognostics and Health Monitoring (PHM) to Minimize the Number of Surprise Disastrous Events or Preventable Mishaps

PHM technology and application enhances system reliability, efficiency, availability, and effectiveness. In complex systems such as telecommunications and aerospace systems, most of the system failures are caused by fundamental limitations in the design strength and mechanical degradation mechanisms, and too many software interactions. There is a need for innovative solutions for discovering hidden problems, which usually turn up in rare events as probabilistic nondeterministic faults. Through the use of embedded sensors for health monitoring and predictive analytics within embedded processors, prognostic solutions for predictive maintenance are an enabler for system reliability.

Always Analyze Structure and Architecture for Safety of Complex Systems

The systems today are connected directly and indirectly to many other systems. In large systems such as weapon systems, safety is linked to other systems, vehicles, soldiers, and satellites. There are almost an unlimited number of interactions possible. Tweaking in one place is bound to create some change in another place. As a result, all latent hazards are impossible to predict with high confidence. Unfortunately,
by the time reliability engineers get involved, the structure and architecture are usually already chosen. Therefore, early involvement is critically important when dealing with large, complex systems.

Develop a Comprehensive Safety Training Program to Include Handling of Systems by Operators and Maintainers

Development of a complete safety training program for certifying the operators and maintainers requires not only recognizing the components and subsystems but also understanding the total system. Many safety training programs are focused only on the subsystem training. When this occurs, it means that the certification of the person operating or maintaining the equipment is limited. The operating and maintenance personnel may therefore not realize that the total system can be affected by mistakes and hidden hazards.

Taking No Action Is Usually Not an Acceptable Option

Sometimes, the teams cannot come up with a viable solution. They take no action or postpone the action hoping that a problem will be solved over time. This may be done out of denial, or because of fear to take action, or the fear of upsetting the superiors in management. Meanwhile, a product with a potential for fault finds its way to the market. Whatever be the case, all stakeholders including the customer become victims. The goal is to prevent the customers from becoming victims and casualties from poor design practices.

If You Stop Using Wrong Practices, You Are Likely to Discover the Right Practices

The cause of many product recalls is insufficient knowledge of what needs to be done before the product development begins. Encourage everyone on a design team to identify wrong things. Ask every team member to identify at least five wrong things. They are always able to do so. With a good leader, this can be done with each team member. Then, create a plan to stop working on the wrong things, and replace them with the right things. In one case, employees came up with 22 solutions for a single design problem. About five of them had more than a 600% ROI. Almost always, the right things seem to just appear by themselves.

REFERENCES
[1] Gullo, Louis and Dixon,Jack, Design for Safety, Wiley,2018
[2] Raheja, Dev, and Gullo, Louis, Design for Reliability, Wiley 2012
[3] Raheja,Dev, Preventing Medical Device Recalls,Taylor & Francis 2011