How to Talk About MTBF
Chris and Fred discuss the pointlessness of the MTBF. This comes from a listener who reached out to complain about how lots of industries enforce the MTBF … but why?
Join Chris and Fred as they discuss the MTBF. Seemingly sophisticated engineering industries (like aircraft manufacturing and electronic design) assume that every component that is ever used never gets old, and never gets better. In other words, every failure phenomenon has a constant hazard rate. How do you work with this?
- Just to clarify … The constant hazard rate implies that the likelihood of a 100-year-old component is just as likely to fail TODAY as a brand-new component (provided both are still working). Really?
- How do industries convince themselves that the constant hazard rate is not a good model? They do things like pointing to a ‘bathtub curve’ that represents how hazard rates can initially decrease (wear in based on manufacturing defects), constant hazard rate (at the bottom of the bathtub curve), and increasing hazard rate (where things wear out) … and saying that they assume all the manufacturing defects have been removed and wear out can be ‘mitigated’ by maintenance and inspections. Really?
- How do I convince my organization and industry this is not good? There is a huge problem here. For example, virtually every aircraft crash is very well investigated by organizations like the FAA and NTSB and lots of other similar organizations across the world. Virtually every investigation that doesn’t involve human (pilot) error identifies a manufacturing error (like inclusions or cracks on turbine blades) or unmanaged wear out (like insulation degradation on electronic cabling that results in arcs that initiate fire).
- What causes constant hazard rate failures? Randomly occurring external and catastrophic external stresses. Think things like ‘bird strikes’ like those that occurred on US Airways flight 1549 that emergency landed on the Hudson River in New York in 2009. By the way … this was a successful emergency landing where no one died …
- There needs to be a business imperative for this. When there comes time for change, there needs to be a perceived HUGE business benefit that outweighs the perceived personal risk of someone going against the grain and suggesting the MTBF is bad. Perhaps you can make the case that if things go wrong, this assumption of the constant hazard rate could be the ROOT CAUSE of failure.
- And a system that is ‘too complex’ is not an excuse. Why? Because there are lots of ways your system can fail. Which means you have a huge choice of failure mechanisms to choose from if you want to improve reliability. And this choice means you can find the VITAL FEW things that drastically improve reliability.
Enjoy an episode of Speaking of Reliability. Where you can join friends as they discuss reliability topics. Join us as we discuss topics ranging from design for reliability techniques to field data analysis approaches.
Gerardo Burciaga says
Thank you, I enjoy listening to your podcasts.
From R&M perspective, there is nothing better than field data to generate realistic predictions and to capture field failure modes. Field data can be used to generate Weibull analysis and determine Weibull parameters characterizing the failure mode under analysis. Having said that, I disagree that MTBF is pointless. There is a purpose for that metric, it may not be the adequate metric to characterize reliability characteristics of the item of interest but allows non-R&M engineers to have a frame of reference to speak about reliability. I do not disagree that the frame of reference provided by a simple MTBF may not be adequate or correct, but without this frame of reference for a non-R&M engineer it would be too hard for them to speak about reliability. For R&M engineers a MTBF is not enough or not even correct in some cases but that should not be used as a flag from R&M discipline to prevent industry to talk about reliability in terms of MTBF. We as R&M engineer can pick up the conversation and elevate that conversation and engage ourselves in the analysis to something meaningful from R&M discipline perspective.
There are many reasons why industry keeps using MTBF coupled with an exponential distribution (fix failure rate) one of those reasons is the simplicity of the math associated with this distribution in comparison with other distributions. Another reason less obvious is the fact that an exponential distribution provides predictions that are more conservative, that is more pessimistic predictions, than if we were using other distributions. The R&M engineer should see the MTBF as an opportunity to engage and educate not as an opportunity to discourage people to talk about reliability.
This discussion reminds me about Plato’s Allegory of The Cave. Only those that have seen the light know the truth and reality, if you want to persuade those that live in the darkness, you have to speak their language.
Christopher Jackson says
Thanks for the feedback Gerardo … I love your reference to Plato’s cave. Why? Because people were chained inside Plato’s cave. But … this is not the case for non-R&M engineers! They can walk outside into the light anytime they want to.
The issue isn’t language or dumbing it down. It is making people want to see the light.
I don’t think the MTBF is pointless. It is really useful for logistics and sparing. But it is not even a reliability metric. It sounds like it is … but it satisfies no contemporary definition of reliability. I can’t agree with your ‘frame of reference’ analogy. What is it that you are referencing? The premise is that everything else besides the MTBF is too hard for non-R&M engineers. And it makes the math too hard. If the math is too hard … they aren’t smart enough to be engineers. It’s just that they don’t want to invest any time into a better understanding of reliability. This often starts with leaders who aren’t genuinely interested in reliability, so it becomes easy for their workforce to pretend they are chained within Plato’s cave.
Organizations that do reliability well have every engineer, designer, and manufacturer embrace a reliability mindset. This doesn’t mean they have to become R&M engineers. Far from it. It is the engineers, designers, and manufacturers who make reliability happen. Or not happen. So these people need to understand why proper reliability metrics are more valuable to them, their customers, their organization, their budget, their schedule and so on.
Just my opinion!
Charles Dibsdale says
I agree with Christopher’s reply to you, but want to make another point. This is 101 statistics. Any citizen who exercises critical thinking should know this, let alone R&M engineers. When is it best to use the “mean” measurement of a distribution of (continuous) data, as opposed to the “Median”? Both are measures of “expectation” or “central tendency” of a distribution. The mean and median are the same if a distribution is normal (gaussian) or symmetric around the mean. But the ‘mean’ is more sensitive to outlying or skewed data than the median and the median is a more robust measurement of expectation in these cases. If we assume a constant failure hazard rate, then the underlying distribution is exponential and this is certainly not symmetric. So even if we thought just quoting an expectation for a distribution we know is not “symmetric normal”, we would not choose to use the mean, we would use the median. I am not advocating we use a measure of expectation for reliability metrics, but using the median is more appropriate than the mean. It just illustrates how unsatisfactory MTBF is even from a basic statistical perspective.
I believe the teaching of very basic statistics is woeful, and in a world where statistics and data are presented to us in everyday life (especially by the media and political groups), we should know when statistics are being used appropriately, or not, as responsible citizens.
Christopher Jackson says
Thanks for the thoughts … I concur!