When Your Supplier Converts Reliability to MTBF
Oh, the trouble that will occur. The mistakes, mishaps and errors and most certainly the inability of the supplier to provide a reliability solution.
If you provide the supplier with a straightforward and complete reliability goal, and they convert it to an single number as an MTBF value, what really could go wrong? Also, why would the supplier degrade the requirement to an MTBF value?
Let’s say you have a piece of equipment that you want to have a supplier design and build for you. It is a complex piece of equipment, yet very little if any of the product is expected to be repairable. Let’s say it’s an electronics box that provides communication capabilities.
A complete reliability requirement may be summarized as:
Com box xyz shall provide communication capabilities (reference spec a for protocol, range, etc) located in unmanned outdoor and unsheltered environments worldwide (see spec b for weather and use profile details), and do with with 98% reliability over 20 years.
You might add other couplets of probability of success, reliability, and durations as needed. Yet, basically we want this box to work for 20 years with a relatively low chance of a unit failing.
This requirement is clear, measurable, and sufficient for any reliability related requirement.
The easiest way and totally incorrect way to convert the reliability objective of 98% reliable over 20 years, is to ignore the probability part and use the 20 years as the MTBF goal. Thus instead of 98% reliable the new target is 36.8% or so. Much easier target to achieve and still 20 years.
The second is to do a little math with the exponential cumulative density function (of course we’ll assume the exponential function applies as it’s so easy to work with and do calculations, predictions, and test planning). We set time to 20 years, and F(20 years) is given as 0.02 (2% of units fail over 20 years or 98% of units do not fail), then solve for theta (or MTBF). We find we setting MTBF to about 10,000,000 hours is about right.
Sounds impressive, too, 10 million hours MTBF.
Why Do the Conversion?
Why would anyone convert a reliability of 98% at 20 years into an MTBF value? Ignorance mostly. Here’s what I’ve heard over the years:
- We want to have a single number to represent reliability.
- MTBF is reliability (certainty not the definition of reliability itself?)
- We always assume the exponential.
- MTBF is so much easier to work with
- We only understand MTBF (really? Let’s test that…. Evil grin)
- Our MTBF value is the same as your goal.
What have you heard? Do any of these make any business or common sense?
Mostly the conversion is to simplify any calculations concerning reliability. This was standard practice in the 1960’s when calculations relied on manual, slide rule, log tables, and mechanical adders. We do not have those limitations today, so what is the need for simplification?
If your vendor does a similar conversion, ask them why and why again until they see the folly in the effort to use MTBF. Remember if you specify 20 years with 98% reliability, that is what you want. Using MTBF is most likely guaranteed to befuddle the supplier team enough that they willingly or mistakenly will not achieve your specification.
What Should You Do Now?
Find a new supplier – while not always possible, the risk to your program has gone up beyond the cost of awarding the work to a new supplier. Seriously this supplier really isn’t worth your organizations time.
Double check and challenge the use of MTBF on every element of the program. Very few components or elements of a system actually exhibit a constant failure rate, or exponential distribution time to failure behavior. Make them prove, as they should be able to do anyway, the data supporting the assumed validity of MTBF.
Remind them you are interested in the 2nd percentile of failures being at or above 20 years, not the anything to do with the point in time when any other number of failures occurs (if exponential then about 63rd percentile of failures, if another distribution then the percentile will vary dramatically and likely not be near 2%).
Refuse any results based on erroneous assumptions.
Double-check all testing sample size calculations. If assuming exponential they can stack sample run times to represent actual time. For example, running two units for 100 hours each, would represent 200 hours of run time for a unit, which is only true if the failure mechanisms are actually described by the exponential distribution. Almost never true.
Reliability growth, prediction, and many other reliability calculations rely on being simple because they rely on the assumed exponential distribution – expunge the use of the assumption.
And, finally, if you have to work with this team, be prepared to teach, educate and encourage them to actually understand the risks, model the failure mechanisms, evaluate tests that actually have failures, do exploration of designs and prototypes for failures, and actually abandon the shackles of MTBF.
Matt H. says
Fred. Great argument.
My thoughts around MTBF (Before or between) is a great way to measure a product against designed and testing goals. You can measure it internally, baseline, release a product and see how it behaves outside the development vacuum.. But what happens when it doesn’t meet those metrics? Ultimately to a customer, any reason for downtime or unavailable use of the product, means business impact. The ability to get an understanding of what is broken[support framework], what is needed to repair, and the cost/impact is where I see most companies not investing in. To add to MTBF, is only one side of the coin, it is Productive Hours, Non-Productive Hours, and cost.
Fred Schenkelberg says
Hi Matt, thanks for the comment and I do agree what the customer really wants is equipment that just works and doesn’t cost too much to keep operating. Any metric or set of metrics is worthless if the product doesn’t meet the customer’s expectations.
I also believe MTBF is pretty worthless as it is commonly misunderstood, and tends to over simplify the time to failure information. Alone, MTBF is not able to reveal changes in failure rate over time – which we really need to know. Thus as a metric is conceals the vital information we and our customers need for decision making.
Hi Fred, Matt,
I agree with you in the downsides of using the MTBF (I suffer from them almost everyday).
But sometimes, the requirements you receive from a customer are stated in terms of MTBF for a certain useful life period, where no degradation or life limited parts are permitted. In addition, verification of this MTBF objectives can be easy for the customers: in-service MTBF can be known adding the number of failures in a cumulative time (I am thinking about an aircraft fleet, for example), because it is not usual to have time recording devices to get real time to failure data for each equipment. MTBF can also be used to estimate the unscheduled removal associated costs of the fleet…
However, as you mention, using the MTBF always pushes you to demonstrate the constant failure rate in the period (sometimes it can be assumed due to the short useful life of the equipment and/or the long observation time period where the data is collected), it simplifies things too much and gives no information of when the equipment is more likely to fail.
In-service data and test data are the key to change “the MTBF mentality” to “Time to failure data analysis mind”.
Fred Schenkelberg says
Very good comment Ricardo, thanks. Yes, we need to shift to a time to failure mentality as it provides use with a clearer picture of reliability performance.
While I have a problem with vendors changing reliability requirement to MTBF, I have no problem changing a customer’s requirement away from MTBF.
If the requirement is for no degradation or life limiting parts – how would we know that with conventional MTBF based testing? It would be easy to design a test to never see these issues. Plus field data may take years to reveal an issue.
A very good article and it peaks my interest because the company I have joined is set on using MTBF as measure of the expected time a product is to last without replacing parts. whilst I accept this may seem like a convenient metric, I have trouble accepting that the measure of reliability can be broken down to one number and the larger context of problem solving through; robust design, proper root cause failure analysis, a knowledge of the process and DoE. Several years ago, well a few more than that “Monte Carlo” analysis became the buzz word. This is an algorithm that can be bolted onto many of the reliability software packages to predict failure rates and calculate MTBF. I have never used this technique. But what is the view on this “newer” method of failure rate and MTBF analysis.
Fred Schenkelberg says
Thanks for the comment and good luck with the new organization. IMHO a fancy or ‘new’ way to to calculate MTBF results in having MTBF. Not good.
For the many reasons outlined on this site, MTBF really isn’t a good measure for any purpose. Instead use the data or information and Monte Carlo methods when appropriate to estimate the reliability performance of your product over time. How many will survive the warranty period or one year or a mission,… etc.
Instead of getting fancy estimating a bad measure, focus on adding value instead.