MTBF in the Age of Physics of Failure
MTBF is the inverse of a failure rate, it is not reliability. Physics of failure (PoF) is a fundamental understanding and modeling of failure mechanisms. It’s the chemistry or physical activity that leads a functional product to fail. PoF is also not reliability.
Both MTBF and PoF have the capability to estimate or describe the time to failure behavior for a product. MTBF requires the knowledge of the underlying distribution of the data. PoF requires the use stresses and duration to allow a calculation of the expected probability of success over time.
MTBF start with a point estimate. PoF starts with the relationship of stress on the deterioration or damage to the material. One starts with time to failure data and consolidates into a single value, the other starts with determining the failure mechanism model.
Does MTBF has a Role Anymore?
Given the ability to model at the failure mechanism level even for a complex system, is there a need to summarize the time to failure information into a single value?
MTBF was convenient when we had limited computing power and little understanding of failure mechanisms. Today, we can use the time to failure distributions directly. We can accommodate different stresses, different use pattern and thousands of potential failure mechanisms on a laptop computer.
MTBF has no purpose anymore. MTBF describes something we have and should have little interest in knowing.
Sure, PoF modeling takes time and resources to create. Sure, we may need complex mathematical models to adequately describe a failure mechanism. And, we may need to use simulation tools to estimate time to failure across a range of use and environmental conditions. Yet, it provide an estimate of reliability that is not possible using MTBF at any point in the process. PoF provides a means to support design and production decisions, to accommodate the changing nature of failure rates given specific experiences.
When will PoF become dominant?
When will we stop using MTBF? I think the answer to both is about the same time. It is going to happen when we, reliability minded professionals, decide to use the best available methods to create information that support the many decisions we have to make. PoF will become dominant soon. It provides superior information and superior decision, thus superior products. The market will eventually decide, and everyone will have to follow. Or, we can decide now to provide our customers reliable products.
We can help PoF become dominant by not waiting for it to become dominant.
WILLIAM THORLAY says
How can we use PoF in industrial maintenance environment? Do you think we have enough time and resources to do those kind of analysis, considering that we deal daily with hundreds of different components?
Fred Schenkelberg says
thanks for the comment and good question. Basically, start by asking vendors for the time to failure models for their equipment, making measurements and tracking time to failure behavior of critical components, etc.
Not any different than product reliability – it may take some work, yet building and using the PoF models provides a significant improvement in your ability to plan and prevent downtime.
Paul Franklin says
I have a slightly different view. Physics of failure models can be useful, especially with new designs because they can tell you something about what to expect. Very often, however, they have to be developed by the consumer, and even then may not tell the whole story.
Take an example: read and write cycles to memory. If you happen to be interested in single bit failures, it is certainly possible to understand the physics of the situation (though it’s also possible just to count read and write cycles without reference to a physical model). Almost surely, the mix of read and write cycles will vary among applications, and it’s even possible that operating environment factors (such as temperature, vibration, etc.) could affect the (potentially time varying) rate of occurrence of failures. Vendors could certainly do this sort of testing, but it comes at a cost.
Product users are typically concerned about life cycle cost, and not the cost of unreliability per se. So there’s at least one practical question: who pays to develop the model? Even if all of this information is available, it’s possible to use a number of techniques to control effects of the failure.
I’ve had good success getting people to think about failure modes qualitatively and the controls and mitigations the design includes. Operations and maintenance staff also usually have a good idea of what really goes wrong, and the rate problems occur. They also typically know if things are getting worse.
It’s absolutely true that I don’t have a lot of mathematical precision with this approach. But I do get good results: we think about the failure modes and test that against prior experience; we understand how the design mitigates the failure modes we know (or anticipate); and we are able to trap lessons learned from field experience to drive corrective actions for issues we didn’t anticipate. This drives cost lower.
I’m absolutely with you that any single number hides too much, and that’s as true of summary Weibull statistics as it is with constant failure rate. And using the wrong “number” is dangerous. No one really believes that any product has a constant failure rate that is really 876,000 hours–that’s 100 years! And no one really believes that any product’s useful lifetime is 100 years. People who advance those sorts of notions need to be educated about all kinds of things.
Keep up the good work. There is, as you point out, a lot of it still left to do, and it doesn’t get done by wishing.
Dave Robson says
Regarding PoF and the effort required to analyse a product; there is no greater effort involved in MIL217 predictions, (which produces a number, with no confidence bounds and distribution), and FMECA.
There is software out there, DfR’s Sherlock, CALCE are the two I know of. But if you’re really keen and you really want to know about PoF, you could buy a decent book such as Steinberg’s Vibration Analysis for Electronic Equipment.
Fred Schenkelberg says
Thanks for adding references and I agree Dave’s book is wonderful.
The technical literature is full of material and fundamental, and many component level experiments, results, and models. It’s a great place to start to find out what is likely to fail and how the failure appears (stresses and symptoms).
Also consider the RIAC site WARP – not sure what it stands for, yet the folks at RiAC are gathering the technical papers that have PoF models. Last time I checked there were a few hundred papers listed.
And, consider an electronic system may have thousands of components (with 50 to 100 unique technologies) and PoF is viable when building on the work that has already been accomplished.
John Cloutman says
HI Fred, I’m often pressed by mechanical engineers to give them a correlation between e.g. High Temp/Humidity exposure of “nnn” hours at “X” degrees C and “Y”% RH, and a “real world” exposure duration on a material like Bayer PC2805 polycarbonate. Can you help me explain that there’s no correlation, as the real world contains a slew of variables like UV and thermal cycling which make correlation of a single steady state and the “real world” impossible? Another example is with UV exposure – it’s as if they want to try to invalidate or question the test results by suggesting that a week at high temperature or a week in a UV chamber is equal to many years in the real world.
Fred Schenkelberg says
Good comment and set of challenges. Many of the standard tests temp/humidity, thermal cycling, etc have such correlations for specific failure mechanisms with specific materials, processes, designs. The challenge in many cases is connecting the appropriate model to the testing conditions.
Polymers often oxidize breaking down chains thus changing material properties, color, etc. There may or most likely isn’t a well know formula to related T/H to life. If really needed, then conducting your own set of experiments may find the correlation. Creating the correlation is not impossible, yet may be difficult.
Sure, multiple stresses do exist in the real world, yet focusing on the failure mechanisms, say chain scissioning in polymers could be causes by temperature, UV or other stresses. The rate of reaction may also differ for each stress or combination of stresses. Often though one stress dominants the reaction and allows us to accelerate the reaction in a meaningful way. Not always, thus making accelerated life testing interesting.
UV exposure – which I’ve done a set of experiments looking for color change in the past. This involved chemists, careful color measurements and carefully defined failure definitions (some color changes didn’t much matter to anyone, others were to very poor/bad/unwanted colors.) It took time, yet allowed us to initially select color additives that wouldn’t drift in color too unwanted colors, and to create accelerated life tests to evaluate production and long term performance. The experiment didn’t look at tensile strength, crazing, and the many other ways UV ages polymer systems. We focused on the one criteria that posed a significant risk for the specific project.
Doing a T/H test without a correlation to expected failure mechanism and conversion to use conditions, to me is only going to reveal gross issues that may occur quickly to a product. A broad sweep of stress to shake out serious or major issues. Not really useful for estimating performance over time. If there is a connection to a specific set of failure mechanisms, then T/H or other testing can provide meaningful information.
John, ALT can provide reasonable models, yet you need to fully understand the failure mechanisms involved and how stress level creates change in the failure mechanism behavior.
Not sure my comment helps, so let me know if you have more questions.
Jim Borggren says
I think that there are two problems with Physics of Failure:
1) It is a-lot more work (and hence more costly) when compared to derivation of MTBF. Also, you probably need to do an assessment for just about every relevant failure mechanism (which is probably a good thing if you want the right answer, but again quite a bit of work / cost).
2) All the textbooks describe reliability analysis using MTBF and there are published databases of values. There doesn’t seem to be as much reference material available on Physics of Failure – either how to assess it or how to use it. Accessing hundreds of technical papers is not as easy as buying a copy of Reliability Maintainability and Risk or Practical Reliability Engineering (or one of the other recognised reference books available).
David Robson says
I guess it depends what you’re happy with Jim.
You spend a lot of time producing nothing, by way of a point estimate without any confidence and with an assumed underlying, exponential probability distribution of failure. Nothing could be more wasteful!
There is literature out there which an Engineer can disseminate and use as a poor man’s methodology compared with Sherlock, CALCE etc. You just need to look.
It rather depends on what you expect an Engineer to do. If it’s menial tasks such as input component parameters into a spreadsheet and then produce a figure for MTBF, then I see that as a waste. A good Professional Engineer is able to think and develop ideas.
Using PoF affords the Engineer the opportunity to get to know the product design intimately as opposed to blocks with a point estimate.