5 Clues Using MTBF is Not Helping
Have you ever heard the claim that “We use MTBF, as it’s working just fine”?
They may be profitable and successful in the marketplace. Is MTBF serving them well?
One way help the folks claiming MTBF is alright is to illustrate using a better reliability metric may provide an improvement over using MTBF. Asking a few questions may find the inevitable chink in the MTBF armor.
- Any early failures due to supplier or assembly error?
- Any test design to check for how long the system will last?
- Ever notice maintenance seems to improve or degrade system performance?
- Did you fix the time frame for the calculation of MTBF?
- Do you plot and track the month to month changes in MTBF?
The key is to find indications they need to understand the changing nature of the underlying failure rates for their product or system. If the product or system only experience failures that followed an exponential distribution pattern, all of above questions would have a negative answer. MTBF, begrudgingly, may work.
Unfortunately, products, components, systems, rarely if at all ever occur such that the time to failure distribution is an exponential distribution. The arrival of failures tends to either taper off over time or increase over time. The arrival of failures, and likely never, actually occur at a steady pace.
Check the Assumption with Your Data
I once heard the claim that the time to failure distribution was always best described by the exponential distribution. So, I asked for what evidence did they have to support that claim.
They always assumed an exponential distribution, thus only collected the count of the number of failures and the tally of total operating time to calculate MTTF or MTBF. Thus all their data ’supported’ using MTBF. It was all they every measured. Thus there was no evidence, in their minds, that they needed anything other than MTBF.
This group routinely scrubbed the data to remove any vestige of time to failure information. They deemed it unnecessary to collect start times or time of failure information. A count was good enough.
Early Life Failures
Then back to the questions above. They regularly worked with suppliers and their design to eliminate early life failures. Resolving issues that cause failures is a good think, yet not recognizing the nature of a decreasing failure rate and simply adjusting MTBF each time another failure occurred, didn’t seem odd to them.
Later I discovered they didn’t count failures in the first month of product operation as a failure for the purpose of calculating MTBF. Those ‘quality issues’, even though they occurred when the customer was using the product, were not ‘in the useful life’ region thus not appropriate to use for MTBF calculations.
Onset of Wear Out or End of Life Estimates
I asked how long their product would last and how did they know? Did the product have some form of wear out or degradation that we should understand to plan for replacements or repairs?
Yes, it did. The team regularly conducted accelerated life testing on two different elements of the product that would limit the life of the product. In both cases, the specific failure mechanism could feasibly occur after a month or two of operation in some circumstances.
In most situations, these failure mechanisms would not noticeably increase the probability of failure till closer to 5 fives years of use. At which point the failure rate would continue to rise slowly. It was dependent on the use conditions and frequency of use. This only meant the ‘useful life’ period for each customer was different. If the product failed after two months, the useful life period was only two months. If it worked for five years before failing, the useful life period was much longer.
So I asked about the useful life, and they showed me the MTBF value again.
Instead of rationalizing your metric, find one that is useful.
If your organization uses MTBF or your suppliers insist on using MTBF, ask them the questions above. Point out the use of MTBF is not serving them. No amount of fiddling, adjusting, or accommodating the changing nature of failure arrivals will reveal useful information while using MTBF. Those efforts are a futile attempt to green a bit of information from a faulty measure.
Instead, use reliability directly. Use non-parametric or parametric models that permit the modeling of changing failure rates over time. If nothing else, plot the time to failure data in a histogram (simple probability density function plot). Fit the time to failure data to a Weibull and check if the slope (beta) is 1 or not. Check the assumption of the data being well described by the exponential distribution.
Let your data speak to you and to make sense as you try to understand the data.