What to Expect from MTBF
What do we really want?
When using the term, MTBF, many believe they are talking about the reliability of a device or system. A high MTBF numbers means it is a reliable item.
What we really want is the device to work over some duration without failure (or with few failures). It should perform a function as expected in the desired environment.
A lot can go wrong.
MTBF is often used as a measure of reliability. It’s easy to calculate. Total time divided by the number of failures.
Often what we want is an accurate description of reliability at some point in time. Say, over the warranty period. Or over a 6 month operation. Or, when planning preventative maintenance.
We are interested in how the failure rate changes over time. Is the item wearing out or not?
Gap between what we want and MTBF
MTBF provides a average value. Statisticians call it the expected value. You may call it the mean value. It’s the center of mass of the data. Not the middle unless the data is symmetrical about the mean (like a normal distribution).
The average for most time to failure data is the point in time when 63% of the units have failed, or there is a 63% chance one unit will have failed.
I’m not interested, generally, in this number. I want to know when 1% will fail, or when the chance of failure is approaching 5%. I want information about smaller numbers than over half failing, so I can make decisions that will maintain a low failure rate.
Annualized failure rate isn’t any better. It’s just the inverse of MTBF with an adjustment to be the failure rate over a year. Again, it talks about an average, with little information about when the first few failures are expected.
Understanding is a Gap too
Another gap between what we want and what we get with MTBF is the common understanding of MTBF. Most do not understand what it means.
When I ask about the reliability of a product, I want to know what is the chance it will successfully operate in my environment for a specific duration. I do not want a grand average that doesn’t include duration.
When I ask others what is MTBF I get a very wide range of responses. From a failure free period of time to the time till 50% failure, to its ‘reliability’ which means very few failure over the MTBF stated time.
Even fewer can tell me how to estimate the reliability (chance of successful operation) over a one year period based on MTBF. And still fewer will wonder if MTBF even applies to the device.
If the chance of failure (failure rate) changes over time, as with something that wears out, say bearings or connectors, then MTBF isn’t very useful. MTBF alone assumes that every moment there is an equally likelihood of failure. The chances do not change with time.
If we expect to understand the reliability behavior of a product over time, then MTBF is not enough information, nor useful information to understand what we want.
If I recall, MTBF only exists when there is a constant failure rate. If the failure rate isn’t constant, you have to use a distribution to model the probability of failure.
In real world systems, the issue generally isn’t one of hardware reliability, but outages and accidents. Outages and accidents generally happen because a failure happens (by wearout, overstress, human error or some other means), and the strategies used to manage failures don’t work. A reliability engineer’s job is to make a system robust in the face of failures, which are preventable, but unavoidable.
Fred Schenkelberg says
Exactly right! imho.
Thanks for the comment.