Language is The Tragedy of Reliability

I was teaching a class on Reliability 101 a few years ago and it turned out to be one of those great classes where debate and discussion would just pop up all over the place. I frequently start my classes with “If I end up being the only one speaking today I am going to take that as an indication of complete failure in having engaged you in this material.” So I was loving that this group were starting to debate each other on the material we were covering. I wasn’t even in some of the conversations. This rich environment is where I just spurted out one of my more memorable reliability quotes.

The group was arguing about metrics for reliability and what worked, didn’t work, missed key information, why this one is no good for this application. I was just listening and all of a sudden heard myself say out loud “Language is the tragedy of reliability.” Everyone laughed and said that captured what they were arguing about perfectly, a few wrote it down.

This are my favorite two “tragedies”

Highly Accelerated Life Testing (HALT): Seriously! This is so bad that the originator of it actually had the coconuts to blame his wife for it. Not kidding. He did this and she was in the room. I glanced back at her and she had one of those spouse smiles where her eyes weren’t smiling and you know the car ride going home was going to be a bit rough regardless of the road conditions.

Here is why it’s so bad. Highly Accelerated, ok I’ll give you that. Life & Testing???? HALT testing does not measure or predict life. It also is not a test in the sense of a “pass or fail” type of activity. HALT gives up the opportunity to predict failure in a field application for the gain of quickly observing failure modes. That’s it’s whole thing, why it’ special, what it gave to the reliability world. It’s not a test by any means, it’s an “exploration.” The objective is to investigate and explore how the product fails , what does the failure look like?, what changes it? The failures are the entire point. This has resulted in decades of reliabiity engineers explaining the intent of HALT over and over and over again.

Mean Time Between Failure (MTBF): My good friend Fred Schenkelberg has created a website dedicated entirely to this miserable metric, www.NOMTBF.com. Mean Time Between Failure for some reason makes everyone think it is an indication of time to failure, i.e. wearout. So a statement of an MTBF of 100,000 hrs is interpreted as the product will last 100,000 hrs before wearing out. If a constant failure rate is assumed the actual interpretation of 100,000 hr MTBF is that 62.3% of the population will have failed due to random failures. No wear-out failures are included in this measurement. This is why MTBF numbers are so high. loosing half the population due to random failure is hopefully a long way out. It is also important to note that no individual item is intended to last 100,000 hrs. Wear-out type failures are not included in the measurement. Items that wear-out are just removed from the population and replaced with a new unit. The MTBF of a human is around 800 years. That statement makes no sense to 90% of the engineering world. Translation: The metric is not doing it’s job.

-Adam

Join the Apex Ridge Reliability mailing list: We hardly send out messages so your inbox won’t be stressed. Usually the content is an announcement for a webinar, an article, or seminar we are hosting.

Feel free to contact us about our services at www.apexridge.com

About Adam Bahret

Leave a Reply Cancel reply