Futility of Using MTBF to Design an ALT
Let’s say we want to characterize the reliability performance of a vendor’s device. We’re considering including the device within our system, if and only if, it will survive 5 years reasonably well.
The vendor’s data sheet lists an MTBF value of 200,000 hours. A call to the vendor and search of their site doesn’t reveal any additional reliability information. MTBF is all we have.
We don’t trust it. Which is wise.
Now we want to run an ALT to estimate a time to failure distribution for the device. The intent is to use an acceleration model to accelerate the testing and a time to failure model to adjust to our various expected use conditions.
Given the device, a small interface module with a few buttons, electronics, a display and enclosure, and the data sheet with MTBF, how can we design a meaningful ALT?
What to Measure
The data sheet and our system’s functionality relying on this device define a range of possible elements to measure. We could measure display brightness, button functionality, response times, life of the electronics, etc.
Before selecting what to measure in the ALT, we need to stop and ask what will limit the life of the device in our application? The provided reliability information doesn’t say. It just says the device has a suspiciously round number MTBF value of 200k hours.
An FMEA, risk analysis, or discussion with the development engineers may narrow down the possible elements of the device that will likely fail first. If time and resources permit, maybe running HALT to find weaknesses (ID failure mechanisms) is on order. Again, just having MTBF doesn’t help.
Which Stress to Apply
Knowing the likely failure mechanism to cause the device to fail is an essential first step to select the appropriate stress (temperature, vibration, power cycling, etc.) to accelerate that failure mechanism.
Not every failure mechanism responds to an increase in temperature. Applying the wrong stress will lead to poor results.
The data sheet might have some environmental or operating limits (power, voltage, temperature, etc.) Those may be clues as to important stresses to explore how they lead to failures.
Like when determining what to measures, we need to sort out which stress, or stresses, provide a means to accelerate the failure mechanism of interest.
Acceleration Model
Let’s say we estimate a rubber seal around the display is likely to fail and could be accelerated using higher temperatures.
Instead of the normal operating temperature of 25°C, let’s double it to 50°C. Ok, so? How much of an acceleration does that change in temperature cause? That is why we need an acceleration model.
The temperature increase might increase the chemical reaction between the material and oxygen and we can use the Arrhenius mo l, if we know or can estimate the activation energy.
Or, the temperature increase may increase the compression of the seal creating a mechanical deformation and damage over time. Here I’m not sure what model to use, yet the Arrhenius model would likely not be useful.
Of course, knowing MTBF provides no information on failure mechanisms other than to suggest the failures are repairable to keep the system running.
Time to Failure Model
Given MTBF we may assume the system has a constant failure rate, or not. Remember all life distributions have a mean value. Knowing the MTBF value doesn’t automatically imply a constant failure rate.
Therefore, if we assume an exponential distribution describes the time to failure pattern, we may be wrong, and most likely would be wrong.
Is the failure arrival pattern decreasing, increasing? We don’t know just knowing MTBF.
Knowing the failure mechanism and how an appropriate stress changes the failure rate is a great start. The design of the ALT includes sample sizes and how and when to make measurements. Knowing the expected pattern of failures given our samples allows us to monitor for failures as appropriate times.
Knowing the inverse of the average failure rate doesn’t really help us know when to expect failures to occur. Thus hampers our ability to design an efficient ALT.
Problems with MTBF Based Reliability Testing Formulas
An astute reader would probably wonder why we’re not using either time or failure truncated test planning and analysis. We have MTBF and that is all we need to design such life tests.
Well, the MTBF value is given and defines the testing. It doesn’t allow us to estimate the time to failure distribution. It may reveal if a system has poorer reliability then expected, yet now if it is better. Nor does such testing permit evaluation or understanding of the pattern of failures.
The MTBF based testing also assumes a constant failure rate. This means if we run 1,000 units for 20 hours, or 20 units run for 1,000 hours it has the same result. If the failure mechanism is wear out or a chemical degradation, then we are more likely to have failures in the units that run longer, and no or few failures in the group that runs for a few hours.
This approach is only appropriate if you know, without doubt, the dominant failure mechanism is best described by an exponential distribution and has an equal chance of failure every single hour of operation. If this is not a certainty, then running 20 or 1,000 units till you have sufficient failures to estimate the time to failure distribution is prudent.
Summary
Running an ALT is expensive. Let’s get the design of the ALT right. That starts by ignoring MTBF claims by vendors, and getting to know the failure mechanisms.
Tim Gaens says
Next question would be:
How many MTBF did you proof with your ALT?
Fred says
Not sure I understand the question, Tim. Given that ALT’s tend to examine wear out type failure mechanisms, MTBF would not be a suitable metric to use.
Tim Gaens says
Sorry Fred,
Was just playing the manager role here.
I’m still having a hard time getting people away from MTBF.
I agree on the article. Thanks for sharing.
For a manager it is easier to understand MTBF, it simplifies stuf, but it is wrong.
I like to see more examples how it should be done as common practices.
e.g. in your article are still a lot of assumptions to be made for the Acceleration Model (and probably always need to be made, after FMEA, product history, field data from similar products, …)
e.g. should we ask failure mechanism from our suppliers? (and failure rate for each mechanism?)
Fred says
No worries Tim, yes ask for failure mechanisms and models (not failure rates) to estimate failure rates given your particular set of environmental and use stresses. cheers, Fred
Tim Gaens says
Do you know component supplies that can provide this?
Tim Gaens says
Rephrase, “willing to” provide this.
Fred says
Hi Tim,
Over the years I’ve worked with many vendors that can and did supply detailed failure mechanism and associated models. Fans, bearings, memory, IGBTs, etc.
If you don’t ask, you will probably only get MTBF… so ask.
Cheers,
Fred
Larry George says
Thanks to Fred for raising valid question. Here is a starting suggestion.
Want to list failure modes in order of criticality? Not RPNs or other armchair guesses.
Use library of failure rates from MIL-HDBK-217F or MIL-HDBK-217G (George, from your field data) plus library of failure mode probabilities from MIL-HDBK-338B or your own experience? Combine it all with FMERD (workbook), “Failure Modes and Effects Reliability Diagnostics” latest revision May 2016. It ranks alternative failure modes on the basis of criticality, using your field experience if available supplemented with handbook values.
Available from pstlarry@yahoo.com.