A variety of discussion and resources that I’ve found related to No MTBF.
Papers
While researching reliability predictions I ran across this paper that critically examines a few common reliability prediction methods.
Jais, C., Werner, B. & Das, D., 2013, Reliability and Maintainability Symposium (RAMS), 2013 Proceedings-Annual, Reliability predictions-continued reliance on a misleading approach. pp. 1-6.
A draft version of the paper that was eventually published as “J.A.Jones & J.A.Hayes, ”A comparison of electronic-reliability prediction models”, IEEE Transactions on reliability, June 1999, Volume 48, Number 2, pp 127-134” A nice comparison and illustration of the variability to expect when using a parts count prediction method. Graciously provided by Jeff – and it’s in draft form, so please pardon the odd typo’s etc.
Draft Comparison of Electronic Reliability Prediction Methodologies [slideshare id=11756810&doc=draftcomparisonofelectronicreliabilitypredictionmethodologies-120226115557-phpapp01&type=d]
Bowles, J. B. and J. G. Dobbins (2004). “Approximate Reliability and Availability Models for High Availability and Fault-tolerant Systems with Repair.” Quality and Reliability Engineering International 20(7): 679-697.
Systems designed for high availability and fault tolerance are often configured as a series combination of redundant subsystems. When a unit of a subsystem fails, the system remains operational while the failed unit is repaired; however, if too many units in a subsystem fail concurrently, the system fails.
Under conditions usually met in practical situations, we show that the reliability and availability of such systems can be accurately modeled by representing each redundant subsystem with a constant, lsquoeffectiversquo failure rate equal to the inverse of the subsystem mean-time-to-failure (MTTF). The approximation model is surprisingly accurate, with an error on the order of the square of the ratio mean-time-to-repair to mean-time-to-failure (MTTR/MTTF), and it has wide applicability for commercial, high-availability and fault-tolerant computer systems.
The effective subsystem failure rates can be used to: (1) evaluate the system and subsystem reliability and availability; (2) estimate the system MTTF; and (3) provide a basis for the iterative analysis of large complex systems. Some observations from renewal theory suggest that the approximate models can be used even when the unit failure rates are not constant and when the redundant units are not homogeneous. Copyright © 2004 John Wiley & Sons, Ltd. Bowles, J. B. (2002).
“Commentary-caution: constant failure-rate models may be hazardous to your design.” Reliability, IEEE Transactions on 51(3): 375-377. This commentary examines the consequences, from a system perspective, of modeling system components using only their failure-rates, viz, the inverse of their MTTF, when, in actuality, they have a lifetime distribution with either an increasing or a decreasing hazard function.
Such models are often used because: they are tractable; they are thought to be “robust;” they depend only on average values; or only small amounts of data are available for calculations. However, the results of such models can greatly understate the system-reliability for some time periods and overstate it for others. Using MTTF as a figure of merit for system-reliability can be especially misleading. Furthermore, the redundancy in a redundant system might provide very little of the reliability improvement predicted by the constant failure-rate model, and series systems might, in fact, be much more reliable than predicted. The overall result is that a constant failure-rate model can give very misleading guidance for system-design
If you have a paper that you’d like to add to this list, please let me know.
Fred: One example I use when lecturing is the life time of a wine glass (or ceramic coffee mug). The probability of failure in the next day is independent of the age. Failure will occur when there is an accident (dropped on the floor or something in the dishwasher hitting it). Bill (email from William Q. Meeker Jr., Oct 26th, 2011)
Presentations
Links on the Topic
Wikipedia does not have a great page, yet gets the idea across A Google search on 4/13/09 found over 2 million hits. The following are the best (imho) discussing MTBF, MTTF and their proper use. Weibull.com – a really good resource for the statistics treatment of many distribution useful in reliability engineering. This page treats MTTF and why it is not a good reliability metric.
FAQ extraction – what is MTBF RAC Journal 2nd Quarter 2005 lead article is titled Practical Considerations in Calculating Reliability of Fielded Products. An excellent short treatise on how to avoid the ‘issues’ with MTBF.
If you know of other great sites on the topic of MTBF – please let me know. fms@fmsreliability.com
Mark O'Brien says
Howdy,
Great blog, great learning source. In my never humble opinion, I think too much time is spent by misusing mbtf amd more should be spent preventing and eliminating failures. I like this site because it looks like you agree.
Fred Schenkelberg says
Hi Mark,
I agree and believe we should tools to make better decisions – not use tools that thwart that effort.
cheers,
Fred
chinareliability@163.com says
Dear Fred:
during my work experience and consultant prepare time, i am agree with your ideas that MTBF is misuse,misunderstanding. my english is not so good that i may be can not catch all your ideas about the MTBF, but i am have question about it .
1. assumption that the failure rate is constant,and the failure distrbution is Exponential distribution, the curretn company still used 1/MTBF as failure rate, they use technical to made the MTBF is very large, then they will get the lowwer failure rates, so ,they are satisfied with this ways. what are about you suggestion, how can i let the clients that the MTBF is misuse?
2. at current status, MTBF predict are still have 3 international standards, there are many company used this standard to predict. another way is to do the MTBF testing ,
they are design the test , the factor value, such as , sample size , acclerate mode, confidental level , testing time , this way i am believe can get reliable MTBF value based on right calculate method.
what ‘s about your opinion what is more reliab way to calculate the mTBF?
Thank you for you reply by email
BEST REGARDS
Guo
Fred Schenkelberg says
Hi Guo,
Working with a failure rate based on the assumption of constant failure rate over the time period ignores the falling or rising nature of the failure rate. Plot the data and see if the constant failure rate fits the data. If not, then use another method to determine the failure rate.
Failure rates are easier to understand and not prone to problems of understanding around MTBF, so it’s not great, just better.
To determine MTBF (first don’t determine MTBF – rather find the reliability (probability of success) at time, t.) parts count type methods are very poor. If you have nothing else, it is a very, very rough estimate. If you have field data, use that. If you need to determine life distributions for subsystems or components, then do that. The physics of failure approach is much better, yet not for everyone as it can be expensive if a model is not available (you have to create the model then).
Just because MTBF is in a standard doen’t make it useful or appropriate for you and your company.
cheers,
Fred
Vijaymahantesh V Surkod says
Sir, i am relatively new to the reliability and i am working on a project of statistically determining the reliability of a power electronic converter system. My question is, let me say that i have a failure data after a constant temperature test(for lets say 1000 hrs)(at two elevated temperature ..1) at 125 degree Celsius and another at 2) 150 ) (out of 10 samples 6 have failed and i have a time to failure of each one of them). Now how exactly i can find out the the activation energy for the usage of Arrhenius equation for finding out service life.
and let us say
Tf= time to failure
at 125 degree celcius
Tf1 = 345 hrs, Tf2= 458 hrs, Tf3 = 525 hrs, Tf4 = 674 hrs, Tf5 = 741 hrs, Tf6 = 800 hrs
now at 150 degree celcius
Tf1 = 168 hrs, Tf2= 265hrs, Tf3 = 315 hrs, Tf4 = 400 hrs, Tf5 = 560 hrs, Tf6 =700 hrs, Tf7 = 845,
Fred Schenkelberg says
Hi Vijay,
Good question and gets to the heart of accelerated life data anlaysis.
You basically have two sets of data that differ primarily by the temperature applied. The time to failure pattern is likely a function of temperature and you’re assuming the Arrhenius relationship, which includes the activation energy term. You can do a non-linear curve fit to estimate the activation energy, along with the two characteristic life values and the beta (shape) assuming the time to failure distributions fit a weibull distribution and have similar beta values.
See the work by Nelson, especially his book Accelearte Testing for details.
You could also, and I would suggest it as the right approach, determine the failure mechanism for the failures and talk to a chemist to ascertain the activation energy of the related chemical mechanism that led to the failures. While you can estimate the activation energy term, it’s better to determine it from a chemistry approach. Sometimes, the literature reports on the specific failure mechanisms so you can double check you in the right range.
Good luck with the problem.
Cheers,
Fred
Vijaymahantesh V Surkod says
Thank you sir, i appreciate you time time for me. Now as i have told you that i am working on a POWER ELECTRONIC CONVERTER product, and i need to know the reliability of the products using statistical methods. Now here is my question again, while conducting the TEMPERATURE TEST, is it mandatory that we have to look for a common failure mechanism?. I mean like i have a power electronic converter product (which is a prototype), and i conduct the temperature test giving the input to the converter and taking the output, now i will take the “time to failure” regardless of what failure occurred (may be power device may have failed, or their might be a short circuit and capacitor has failed , etc.), is it ok for me just to see whether the output is delivering or not , if output is not delivering at an instant of time and take those time to failure and do the analysis. Is this approach gonna give me the exact results that i wanted to get or is it a wrong approach? .. if this approach will not give the exact analysis results .. then how should i proceed?
Thank you
Fred Schenkelberg says
this issue is the relationship between temperature and time to failure is likely different for the different failure mechanisms. The two you describe may behave very differently under stress conditions than under use conditions.
Mixing failure mechanisms nearly always alters the ability of the pure statistical fit… you may or more likely will not get an reasonably accurate estimate of normal stress performance.
Do the fit, then change the activation energy just by a few tenths… it will dramatically change the model’s normal condition life estimate. That is why you generally should not fit activation energy from scant data.
In short, yes it is important that your analysis understand the failure mechanisms before attempting to fit the data.
Cheers,
Fred
Claire Jones says
Fred,
Thank you for your attempts to educate us. I need you to send me links or URLs for references and/or papers/articles about why MTTF and/or MTBF should not be used for reliability reporting. I think I am beginning to understand but it stills seems confusing.
Fred Schenkelberg says
Hi Claire,
A great many of the articles on this site may be suitable to reference. A literature search may reveal a few more articles, yet that may require library or database access.
cheers,
Fred