Perils of MTBF
Every organization talks about product reliability in some manner. Sometimes our customers provide explicit reliability requirements. Sometimes our customers have an expected metric to report reliability expectations. Our industry may have a ‘standard’ means to discuss reliability. Or, we have a local ‘tradition’.
One of the most common is MTBF.
MTBF, or Mean Time Between Failure, and the many variations of this term have a one thing in common. It is the most misunderstood four letter acronym in engineering. For the purpose of this discussion I am using MTBF and most of the comments equally apply to MTTF, MTBUR, etc.
During a presentation on this subject to a group of reliability professionals, I asked if anyone in the room had encounter trouble with MTBF. Nearly every person of the over 100 in attendance quickly raised their hand. We spent the next hour sharing horror stories resulting from the misuse of MTBF.
What is MTBF?
Technically, MTBF (MTTF actually, more on that later) is unbiased estimator of the exponential distribution parameter, theta. This is based on how we calculate the value based on either test or field data.
If we have 50 units that all run for 100 hours and right at the end of 100 hours one of the units fails. We can calculate the MTBF as follows. First determine the total hours all the units operated. That’s easy, 50 units times 100 hours is 5,000 hours. Then divide the total operating hours by the number of failures. In this simple example, that is one, for a resulting MTBF = 5,000/1 = 5,000 hours.
Note: if we had 100 units run for 50 hours and had one failure at the end of 50 hours, the result is the same. Or, if one unit runs for 5,000 hours before failing. Or, 5,000 units each running for one hour, then one fails. Weird, right?
Well, not so strange if the underlying failure mechanism has a equal chance of causing a failure every hour (or moment). If the chance of failure is constant, or we say the hazard rate or failure rate is constant, then the above method to estimate MTBF is valid.
There are better ways to estimate MTBF when the assumption of a constant failure rate is not true. Yet, most often MTBF is calculated as described above.
Light Bulbs & Smoke Detectors
How often do you change incandescent light bulbs? Randomly, right? When a bulb burns out (why does it always seem to be the hardest to reach and least available bulb?) you find a spare bulb and replace the burned out one. Do you then think about changing the rest of the similar light bulbs in the house? Probably not.
Note: Light bulbs really do not follow the exponential distribution for time to failure. Bill Meeker did the experiments and conveyed this information to me after reviewing this site. I’ll update this note once I find a good example. In the meantime, let’s assume light bulbs follow a random pattern with respect to time to failure.
Incandescent light bulbs tend to follow the exponential life distribution. (This is not actually true, yet in my experience and limited data the time-to-failure distribution in my home is close enough.) And as such there is no rationale to conduct preventative maintenance. The memoryless feature of the distribution suggest the new bulb has exactly the same chance of failure in the next hour as the existing working light bulb. So there is no time or cost benefit to the preventative replacement.
Note: Talking to Professor Bill Meeker, discovered that he had a group of student run a life test on incandescent light bulbs. They found that the life distribution is actually best modeled by a normal distribution, not exponential. Bill recommended that we think of the failure rate of a ceramic mug – which typically fails by dropping to the floor and shattering or being struck and chipped. The failure rate may well be exponential in that case.
Now, if your community is like mine, you receive annual reminders to change the batteries in your smoke detectors. Those 9V batteries do tend to wear out. Ignoring the preventative maintenance leads to middle of the night low battery power ‘beeps’ from the smoke detector.
This then leads to the annual effort to change all of the 9V smoke detectors. I’ve seen the same behavior in office buildings using fluorescent tube lighting. The maintenance crew tend to replace entire banks of tubes. When queried, I learned of their experience. “When one goes, then all will fail soon after. So, while we have the ladder out, we just replace them all.”
A few common ‘issues’
In support of the statement, ‘worst four letter acronym’ consider each element of the four letters.
M – STANDS FOR MEAN
Speaking statistically, this is the expected value or the first moment of the distribution. Each distribution has a mean value.
The issue stems, in my opinion, from those undergraduate statistics classes most would rather forget. The normal (Gaussian) distribution dominated those lectures. Many sections and test questions started with the phrase
“Assuming a normal distribution….”
It was drilled into our engineering minds. The learned response was ‘mean’ is ‘average’ is the 50th percentile of a normal distribution. One half of values are above and one half are below.
Therein dwells the root of a mistaken understanding of MTBF. Not all distributions have the same properties concerning mean values, which was most likely not mentioned during the undergraduate statistics course. For example, the exponential family of distributions has a expected value or mean which defined as the 63.2 percentile. One third (36.79%) of values are above and two thirds (63.21%) are below.
Let’s assume we have 1000 light bulbs with an MTBF of 100 hours. How many will still be working at the end of 100 hours of operation?
Each hour each light bulb has a 1 in 100 chance of failing. Therefore we lose about 10 the first hour.
This is as expected if using the reliability function of the exponential distribution.
If we run the time out a little further the plot shows what we commonly call the exponential decay. The chance of failure each hour for each light bulb is the same. It just takes more time to have the same number of failures. the first hour of the experiment with 1000 light bulbs, 10 failed (1000 x 1/100 = 10 failures in one hour). When there are only 500 light bulbs remaining, it takes two hours to incur 10 failures (500 x 1/100 = 5 failures in one hour).
Hours, cycles, years, pages and many more ways of counting some form of use are common. Recall that the MTBF is the inverse of failure rate. The failure rate units are the number of failures per unit time. Inverting this give us units of time (hours, cycles, years, …) per failure.
I am not sure why (tend to think it was a marketing decision) someone decided to invert the negative connotation of ‘failures/hour’ into the positive sounding ‘hours/failure’.
Therein clicks another issue with MTBF. The units of MTBF, often in hours, is often confused with clock or calendar time. It really is a confusing unit of measure to convey the probability of failure. Instead of stating a light bulb has a 0.01 chance of failure per hour of operation, our dislike for numbers between 0 and 1 (recall probability and stats classes!) is avoided by inverting the failure rate. Now it reads 100 hours MTBF. Sounds much better.
B – stands for Between (or Before?)
Either way, between or before, when linked with the rest of the acronym it conveys a failure free period. It would have been better to state MTF, Mean Time of Failures. While that suggestion isn’t really that good, the idea of a failure free period, is not part of the definition.
I heard one design team manager explain MTBF as the time to expect from one failure to the next. The time between failures. So, once a failure occurs, we have the MTBF hours before we would expect the next failure.
MTTF, the closely associated metric, uses To instead of Between, and creates the same confusion. With To, Before or Between, two thirds of the light bulbs will fail at the 100 hour mark.
When the MTBF value is very large, say 1 million hours, it may seem like a failure free period is occurring. Not really, it just is the probability of failure is very small, 1 in a million chance of failure per hour. Running a test of 10 light bulbs for 1000 hours with an actual 1 million hour MTBF probability of failure would result in an expected ZERO failures (an expected 1% units failing – it may take an average of ten runs of the test for a single bulb to fail)
F – stands for Failure
Who defines this in your organization? Do your customers return the product and they are classified No Trouble Found? In a classical sense a product failure is when the product does not met stated performance specifications. Yet, customers will return products that fail to meet their expectations and it still creates warranty expenses.
In many forms of product testing only apply one form of stress which only promotes a subset of all possible failure mechanism. Basing MTBF calculation on a single stress test, while possible to be accurate enough, is often missing important life-cycle conditions, stresses and failure mechanisms.
The simple issue here is the internal definition of failure which well may be different than your customers definition. Be clear and concise, plus open to new definitions of failure. It is generally limited to product specifications.
History of Use
Karl Pearson first mention of the ‘negative exponential distribution’ in 1895. The Exponential Distribution has a number of interesting properties, one of which takes advantage of the tools available in the 1950’s and 60’s.
Specifically the ability to add failure rates (inverse of MTBF is the failure rate). Adding was rather easy at the time using mechanical and later electric adding machines. Using a slide rule and tables for the exponents is cumbersome with possibly 100’s or 1,000’s of calculations.
In 1961, the first issue of the MIL-HDBK-217 detailed how to perform parts count predictions. The method relied on the ability to add failure rates. Work continues to this day to update and revise the methodology. These efforts may take us out of the era of mechanical adders, as today doing complex calculations is as easy as turning the crank.
Today we have models and distributions for the complex array of failure mechanisms and should take advantage of this knowledge. Limiting the combination of failure rate information to a constant for each component distorts and misleads those attempting to make decisions based on the prediction or data analysis.
Examples of MisUse of MTBF
So, while it is a convenient assumption to say the component, product or system has a constant failure rate, this is often not true. And, this assumption does lead to very poor understanding, modeling and decisions related to real products.
The obvious misuse stems from the various means individuals misunderstand MTBF. For example, if an electric engineer believes MTBF to be a failure free period his selection of components will have a significantly less desirable field failure rate.
Another simple issue is the advertising of product or component reliability by simply stating an MTBF value. Without stating the conditions, environment, usage period, and other reliability related bits of information, the reader is left to wonder what the MTBF really means. For a component that has an increasing failure rate over time, like a cooling fan that experiences bearing wear out, the MTBF is a valid approximation of the fan failure rate over some specific period of time. The fan datasheet often does not state the expected duration over which the constant failure rate applies. If the vendor is designing and evaluating fan life for an expected one year of use, then the life data may actually be exponential. If the application the electrical engineering is considering the inclusion of a cooling fan is to operate for 10 years, then surprised when the product qualification or field performance experiences higher than expected fan failures.
Barry Snider says
The most costly misuse of MTBF is the years of wasted effort collecting failure data on equipment that have no relationship other than equipment type, i.e. pump. I know of reliability engineers at large oil refineries that try to use MTBF to gauge the reliability of 3000+ pumps of all shapes, sizes, functions, applications, and most importantly, failure modes. This is ludicrous.
Fred Schenkelberg says
Totally agree – and you are on to something there – focus on failure mechanisms and I would recommend not using MTBF and specifying reliability completely for each model of pump (or failure mechanism within a reliability model for each type of pump) including probability of success, duration, environment, and function.
Hans Lutsch says
Actually, The Aircraft Reliability industry did lots of work on this over the past ten years through the ATA Spec2000 process (See Spec2000 Reliability Interest Group RIG) where much was done to standardise these metrics and more.
MTTF is now the mean time of the life achieved by the group of assets that Failed (does not include those that are still operating) Not like MTBF that includes unit operating hours of the assets that have not yet failed during the time period of analysis which of course if used incorrectly, produces an unrealistic MTTF expectation.
MTBF can be a very useful metric if applied and used correctly but like any other tool, if used improperly can cause more grief than good. If enough data points over a large time period are used, Eventually MTBF will settle out pretty close to what MTTF is. If you look at a large air carrier like SWA you will see this happen over about a decade in their data. On the other hand in Plant Maintenance data, it tends not to occur like this due to poor record keeping and/or small data populations.
When making reliability improvements I tend to use Weibul analysis for a better grasp of the real realibility but again, you have to have some good data.
Fred Schenkelberg says
Thanks for the note and I was not aware of the work in the aircraft industry.
If I understand how you defined the calculation of MTTF as not using suspensions – then it is a very conservative value as a representation of the population. I still believe and am trying to show clearly that using a single value, even is used correctly, does not adequately describe the reliability of anything. As a minimum I hope any reporting of MTBF or MTTF includes the duration over which those values are applicable. Better would be to use the appropriate distribution or non-parametric method and not limit the industry to a single value.
Laurence Montrose says
Great idea for a website, but poor implementation.
Two errors in your first description section alone, and more throughout. I agree with the issue, but you should have done your research.
Technically: MTBF is not equal to MTTF. MTBF is only correctly used in scenarios with repair, when it properly includes the admin/logistics/repair/checkout delays before starting an operational phase again. MTTF does not include these additional periods, when used properly. Imprecise use of terms is an artifact of lack of deep understanding, and that is where we fully agree on the root cause of many of these issues.
Even more important is the single greatest error that most people make, equating MTTF with the exponential distribution. MTTF is defined in a precise mathematical manner and has no specific tie to the exponential, Gaussian, Weibull or any other distribution other than each of these has a mean value.
The reason you made the mistake of associating the classical Gaussian distribution of lightbulb lifetimes has to do with the age distribution of your bulbs. After a period of operation, with varying achieved lifetimes, when each failure is followed by repair (i.e. replenishment!), the age distribution starts to be a uniform distribution. With
a) a large enough population of components and
b) many equivalent lifetimes of the system, with
c) a replenishment strategy,
d) regardless of the distribution function of the component,
then the conditional probability of failure of a component in some finite period of time tends to a constant rate. Which looks like an exponential distribution to the person who doesn’t understand why his logistician is specifying replacements at a constant rate and his system is working effectively. This is what Hans was addressing, and his point about poor record keeping confounding the results is well put.
Fred Schenkelberg says
Thanks for reading and providing careful comments. The post is obviously not written for you, yet I do hope you found a few lines of reasoning to help with your explanations and clarifications for those that abuse MTBF.
I have found most confuse the calculations and reporting of MTBF, including those that should know better – thus the idea to just not use MTBF or related measures at all. There are endless ‘rules’ for what is or is not counted. One military manual for the four services listed over 500 ways to count and calculation MT-something.
The lightbulb distribution was done by students of Bill Meeker – they most certainly did the experiment and analysis carefully. No mistakes there, I’m sure.
There is certainly plenty of room for improvement and better implementation of data collection and analysis. Hopefully this site helps to keep that discussion going.
Again thanks for your comments.
Also, if you’d like to write a post for the site – my offer stands that if you generate more than 1200 pageviews in a Friday thru Friday week – I’ll send you a NoMTBF coffee mug. See http://nomtbf.com/2013/05/nomtbf-guest-post-challenge/ for details.
Chet Haibel says
I really like the notion of eradicating the MISUSE of MTBF. But MTBF is useful when used correctly.
The above article PERILS suffers greatly from confusing the failure rate with the hazard rate. An exponential distribution has a constant hazard rate and an exponential failure rate.
Thomas Speidel says
In biostatistics we tend to avoid parametric time to event modelling as was suggested in some posts. Semi-parametric approaches are fairly robust and flexible and one does not have to make unnecessary assumptions on the underlying function which are hard to check. Obviously, I come from a completely different background, so forgive my ignorance in the subject, but why not using hazard ratios to describe failures (whichever way they are defined)?
Fred Schenkelberg says
Works for me. Survival analysis leans a bit more on non-parametric methods which tends to conservative overall. At times adding the distribution provides a bit more power and precision, and as you mention is often difficult to prove.
Dealing with sparse data in any situation is difficult, and doing a simple average and calling it MTBF is not the solution.
Thanks for the comment.
Charles Dibsdale says
In the same vein, why would you use Mean, when Median may be more appropriate? Leaving aside the fact MTBF conveys no information about the spread of data, we can also question the use of mean as a measurement of expectation. If the underlying distribution of continuous values is not symmetric like a pure normal bell curve, then Median should be preferred. Median is also less sensitive to outlying data compared to the mean.
Gary Smith says
Maybe better described by “MTBF caused by a Specific Failure Mechanism”
Put something in place for that Failure Mechanism.
Measure the MTBF caused by that Failure Mechanism and look for change.
Good reporting of failures is essential for correct analysis of MTBF and.
John Amoyaw says
RE:MTBF & MTTF Concept:-
How are you doing? Please kindly explain full concept on the above mentioned subject and the impact on maintainability and reliability?
Senior Maintenance Planner
Fred Schenkelberg says
The article does a pretty good job of explain why MTBF or MTTF are not useful. Yet, let me summarize. MTBF and the like inverse average failure rates attempt to represent the reliability performance of a product or system. These measures represent reliability performance poorly and allow engineers and managers to make very poor decisions. We should use informative summaries of reliability performance to make decisions.
i’m looking for warranty period calculation (or estimation) from reliability techniques
i want to calculate MTTR of a device and then calculate warranty
can you help me
Fred Schenkelberg says
Sure, I think, if you have existing time to failure data or other information relating to the expected life distribution of the device. What data or information do you have available?
Juan Carlos Mejía says
Excellent post, What Do You think about of use Crow-amsaa (TR-652 of US Army) to get the MTBF trend for a system? (multiple failure modes); because Weibull (and other distributions) are useful to get MTBF (using gamma distribution) for only 1 failure mode, but when You have multiple failure modes, You need another focus (I use Crow-Amsaa)…
Fred Schenkelberg says
Thanks for the kind words and question, Juan Carlos, much appreciated.
While I really do not like using MTBF the various growth models do help track progress toward a goal. I’d much rather use a system reliability block diagram model and use the appropriate reliability distributions for each block, yet that is not always feasible.
I like the growth models has it helps understand if we’re finding and fixing enough of the reliability issues fast enough.
Dylan Wagner says
I enjoyed reading your site and completely agree with the abuse MTBF has had in various industries. I was employed last year right after college as a reliability engineer and quickly grew to loathe MTBF. I work on calculating failure rates for the U.S. Nuclear Submarines various systems and personally hold proper analytics to be of the utmost importance. I am currently working with my company to fund getting a software to model data using a Weibull Distribution or rather a more appropriate distribution after performing a Goodness of Fit test. My goal is to convince my company that MTBF can no longer be used to provision for spares. I believe using the time at which the reliability is equal to 50% would be much more accurate do you agree?
I would love to win that mug if contest is still valid. Would use it at my desk with pride
Fred Schenkelberg says
Hi Dylan, sorry for the delay with your comment. Thanks for writing. I’m not sure 50% would be more accurate… setting timelines for repair/replacements really depends on the cost of the repair and the cost of the downtime. Higher cost of downtime the earlier one would do the replacement (less probability of failure).
Are there any useful notes for preparing for CRE exam? The question bank is terrible. And the textbook they recommend is awful! Awful textbook! I think its by Mark Durivage or something
Fred Schenkelberg says
check out https://accendoreliability.com/creprep/ and the folks at the Quality Council of Indiana for a suitable reference / study book.
Philip Frohne, CPL says
Love the website. I perpetuated the R formula in my book Quantitative Measurements for Logistics (McGraw-Hill). However, if a person substitutes pi for e, the curve doesn’t really change significantly. So why was e chosen? The curve historically represented the vacuum tubes in 1950s aircraft radios. How did it become the standard R formula?
MTBF in humans is tough to calculate because very few humans become completely not mission capable (NMC) – full MTTF condition, and then recover as MTBF. We must also define what a FMC human is. Life but brain dead? Bedridden? Mentally/physically challenged condition?