One of the primary questions we answer as reliability engineers is:
How long will it last?
Reliability prediction is the forecast or prognostication attempting to quantify either the time till failure, or expected future failure rate or warranty claims, or required spare parts.
We need to know as we make decisions today about the design or purchase.
The best prediction is done after everything has failed. Ship all your products and track the actual failures over time. Unfortunately, it only provides what happens after it happens.
We need the information to make decisions so we can change the design, stock more spares, or set customer expectations appropriately.
Methods of Prediction and Costs
Forecasting the future is difficult. Engineers have taken available information and knowledge along with a range of techniques to create reliability predictions.
Methods range from a simple guess based on engineering judgment to detailed physics of failure modeling for specific failure mechanisms and environments.
Early in the concept phase of any design, we may ask, “Will this last long enough?” Based on engineering judgment we select basic design elements, materials, and architecture.
Largely based on engineering judgment, a simple guess helps to uncover basic information about the use environment and customer expectations, along with basic technology capabilities.
A simple guess has a large degree of uncertainty. It is the fastest (just the duration of a considered opinion) and least expensive prediction to create.
Parts Count Prediction
In the 70’s and 80’s, HP and other organizations would diagnose product failures to the component level. Then keep track of component specific failure rates and types of applications or environments.
These databases of failure information provided a viable means to estimate future designs.
The large databases such as Mil Hdbk 217 or Bellcore (now Telcordia) collected failure information across broad groups of products and environments.
Most organizations do not conduct the details failure analysis opting to quickly replace or repair the product for the customer, so the source of information for databases has diminished.
The basic idea that each component contributes a possible failure then adding the failure rates to estimate the product failure rate is full of inaccuracies and faulty assumptions.
Unless using an accurate database the results are not much better than a simple guess, yet do assist in identifying potential reliability issues in a design.
To conduct the study, one needs access to an appropriate failure rate database, vendor data to supplement for missing values and a little time to tally the failure rates. With a reasonable database, it may take one or two days to create a prediction.
A small investment and not much improvement in prediction accuracy.
Similar Product History
As organizations stopped conducting detailed failure analysis on every product failure, they did continue to track product performance.
Since most products are variations of existing products, the prediction approach of using similar products as a foundation, then supplementing with other prediction methods for the new elements, is a reasonable approach.
For example, if a new design includes the existing power supply of an existing product, we can use the field failure rate information for that power supply in the new product.
One must consider the power supply environment and load in the new design and if any changes in conditions impact it reliability performance. It is a good approach to narrow down the areas needing a detailed analysis.
Of course, this is only viable if you have the information on previous products and sufficient failure rate detail to subsystems. For completely new designs it is not possible.
This approach is comparable to cost of a parts count prediction and generally has better accuracy. Like a parts count approach, it may take one or two days to track down and tally the prediction.
Weakest Element Estimate
Years ago I learned about tape backup product. Like any electromechanical system, it has many possible failure mechanisms.
If it is assembled, transported and installed correctly the dominant failure mechanism is the read/write head wear due to tape abrasion. When the head dimension reduces enough the ability of the device to read/write ceases.
Each foot of tape dragged across the head creates a predictable amount of wear, and the organization accurately predicted the time to failure based on the amount of use (i.e., feet of tape moved across the head).
Field data on similar models verified that the number one reason for product failure was head wear. The design team optimized the wear and performance, yet the failure mechanisms remained the primary failure mechanism for the product.
In this kind of situation, the team has the benefit of being able to focus on one mechanisms and creating a product prediction. The other elements of the system just had to last longer than the head.
This approach to prediction simples the amount of work and experimentation and provides an accurate life estimate.
Of course, changes to the materials, tape, speed, tension, and other variables effecting the head wear will change the relationship. Plus, the team must maintain due diligence with the reliability of other elements to avoid creating a new weakest link.
Reliability Block Diagram
Block diagrams are an organizational tool for considering contributions to the system reliability. Appearing like an organization chart, each block is a subsystem or element of a product.
Let’s say we have a desktop computer as the product. The top block is the system, below it let’s say there are five blocks that represent the power supply, hard drive, mother board, display, and keyboard.
There are block diagram structures for series, parallel and complex systems. Each block includes the reliability of that element. And, depending on the structure (reliability-wise) the calculations may differ to determine the system reliability.
While not a prediction method on its own, once a block diagram is established, the ability to compare design options (say two hard drives with different expected reliability performance) and the impact on the system reliability.
In one regard it is like the parts count method with the added benefit of being able to account for parallel reliability structures.
Life testing takes many forms and beyond the scooper of this article to describe them all. You can find books on accelerated life testing to provide detailed guidance.
As a prediction method, conducting an experiment is expensive and accurate, when done well. The basic requirement is to know the failure mechanisms of interest and how the appropriate stress relates from the experimental to use conditions.
Errors or poor assumptions can make this method very inaccurate, so take care when designing, conducting and analyzing reliability experiments or tests.
Physics of Failure Modeling
Research and modeling have enabled the creation of characterization of failure mechanisms in great detail. While not covering every possible failure mechanisms, many are well known and have a physics of failure model.
Simple models include solder joint fatigue formulas based on the Coffin-Mason relationship. Complex models may require finite element tools to completely model.
Technical literature provides details for PoF models and if not exist you may need to conduct experiments to fully characterize the relationship between stress and life performance.
Once you establish a PoF model, you have the ability to consider changes in the environment, stress load, structure, material set, or dimensions (variables affecting the failure mechanism) to create reliability predictions.
While PoF models take time and are expensive to create they provide the most flexibility and accuracy beyond other methods.
Sources of Value
We create reliability predictions to support decisions. Predictions can help to understand:
- Is the product reliable enough?
- Is the product going to meet our reliability targets?
- How many spares will we need over the next year?
And, many other specific questions when we’re not able to wait for actual results to occur.
Making Informed Decisions
In the simple example of comparing to vendors of hard drives. Understanding an accurate reliability prediction for your application enables selecting the most cost effective and reliable product to meet our business goals.
Without a prediction, we may select a hard drive that fails too early and limits the product’s reliability. Or we may select one that is too reliable and expensive for our application.
Identifying and Improving Design Weaknesses
During product or system design, the ability to identify the elements that will lead to failures (weakest links) permits us to focus resources on those areas for improvement.
Without knowing where to focus we may have more failures than anticipated along with the remorse of “if we only knew”.
Allocating Resources Appropriately
Beyond focusing efforts for reliability improvements, predictions also divert resources away from elements that are very reliable already.
Prediction also focuses product testing on failure mechanisms most likely to limit the product reliability. And, may limit the costs of prototypes and testing facilities.
Using reliability block diagrams and similar models allows the balancing on reliability with component costs to optimize the reliability at the minimum costs.
An accurate prediction is useful inside the company to forecast warranty and repair costs. Outside the company, predictions are useful to customers:
- Making purchase decisions
- Planning larger systems using the product reliability information for modeling
- Estimating total cost of ownership, including spares, downtime and maintenance costs
Starting with the data sheet, to supporting white papers on reliability prediction claims, to warranty policies to customer perception of brand promise impacts the value of reliability predictions.
Finally, creating a prediction and comparing it to field reliability performance allows the organization to improve next time. First using the field data for similar products, and in refining any models created for reliability predictions.