A Series of Unfortunate MTBF Assumptions

The calculation of MTBF results in a larger number if we make a series of MTBF assumptions. We just need more time in the operating hours and fewer failures in the count of failures.

While we really want to understand the reliability performance of field units, we often make a series of small assumptions that impact the accuracy of MTBF estimates.

Here are just a few of these MTBF assumptions that I’ve seen and in some cases nearly all of them with one team. Reliability data has useful information is we gather and treat it well.

Assumptions Around Use

Since we only know the shipment date to a customer and when they call, let’s say the unit operated full time from shipment till they called.

We suspect it takes time to actually transport unit from our factory to customer. It may even take time to install and place units into service. Yet that is unknown unless we ask some customers for information. Wouldn’t want to bother customers.

Also, let’s consider every unit is in use. No spares or stored units. Likewise not sitting in a warehouse or store shelf.

We do learn about some, let’s assume all failures, through a call or product return from the customer. Thus, we can only assume all other units are still in service, operating full time.

This set of assumptions tends to increase the number of hours we are counting as operating hours. That pads or helps MTBF (or reliability for that matter) look better than it actually is.

Spend the time to understand the time from shipment till the start of service. This may be a distribution from nearly immediate to months or years till placed into service. You should know this information.

Spend the time to understand the typical operating hours per day. Some items do work 24/7 while others only on occasion. Maybe you have different classes of customers that use your product in very different manners. Again you should know this information.

Only Real Failures Count

Have you noticed that sometimes a customer will call to complain that the product isn’t working and when you received the product back from them, it works just fine? Funny (strange) isn’t it. About 25% of product returns (varies by industry and specific product, of course) have no trouble found, or no fault found. Many of these products are then cleaned up and shipped out to other customers.

Sometimes the customer suggests a product is a failure when it is the wrong color (my ex-wife did this once). Or, it’s a failure if it doesn’t solve the problem they thought it should. Or, it’s a failure if they no longer need the product. Sometimes they call a product a failure when it doesn’t function as expected.

When analyzing returned products we want to know what to do different or better to avoid future product failures. The units with something missing or broken are great, we can get to a root cause right in the lab and implement design/process changes to fix it.

When the analysis finds nothing wrong do we question our analysis to make sure we are evaluating the product as the customer did? Rarely. The customer wanted the unit to operated outside on a cold day, while in the nice warm, clean, lab it starts just fine. Did we just miss an opportunity to improve product reliability? Probably.

If it is an ordering mistake, the wrong color, doesn’t do what we thought it would do, or doesn’t operate as we expected (where is the on/off switch…), are those failures? Did they return the product? If so, the customer called it a failure.

By not counting software bugs, ordering or use errors, or any group of claims or reasons for a product return, we incur the cost of the return and the potential permanent loss of a customer. It’s a failure that requires more than hardware related changes.

Let’s Smooth to View Trends

Smoothing data is a nice word to say averaging. While MTBF is technically an average, an average of averages ‘smooths’ a monthly or weekly reported MTBF value just a bit more.

Let’s say we ship products weekly and groups products made in a specific week into a cohort for our analysis. Let’s say the MTBF values are estimated for each week’s production, assuming the week’s products all go into service at the same time.

Each week we add another batch to the products in the field, and we age all other batches one week. Pretty soon that is a lot of weekly MTBF values. With normal variation, let alone active changes to components, design, or processes, the week to week variability of MTBF may cloud or obscure the trends represented in the data.

Let’s assume a rolling average of 3 or 6 months of data will helps us spot trends. Also, let’s assume there are not meaningful ramp/decline of production at the start, seasonally, or at end of production.

Do you see any problems with this approach? I’ll leave this one open for your comments on why smoothing may be an issue.

Summary

Our ability to gather and analyze field data is often a central role in our ability to understand how well a product is performing in the hands of customers. If we make a series of unfortunate assumptions during the gathering, interpreting, or presenting of the data, we are likely to obscure the very information we seek to understand.

Add your comment on why smoothing as described above is a potential problem. Plus, add some of the other unfortunate MTBF assumptions you have seen (and hopefully exposed and corrected!)

Comments

Michael Dashuta says
February 22, 2018 at 12:00 PM
I would suggest to use as reliability evaluation factor a trend of early removals. This would be the ratio of confirmed removals within a given time frame (3-6 months) for units manufactured during last two (?) years, to the quantity of units manufactured during last two (?) years that are working in the field. Of course, this would require good communication and cooperation between the field service and reliability group.
- Fred Schenkelberg says
  February 22, 2018 at 5:14 PM
  Thank for the comment and suggestion. If I under the idea is to use a fixed set of ages to make comparisons? Why not use the full set of data? cheers, Fred
SPYROS VRACHNAS says
February 24, 2018 at 7:32 AM
I counted 5 assumptions in your analysis.
…..operated full time from shipment…..
….every unit is in use……
….all other units are still in service operating……
….the week’s productions all go into service at same time.
…..there are not meaningful ramp/decline of production……
The question remains: How truthful is a “Field MTBF” computed based on these assumptions? Can you rely and make decisions based on this number?
Is there another way?
- Fred says
  February 24, 2018 at 10:26 AM
  Hi Spyros, thanks for the count and comments. Nice addition as well. As you know there are better ways to deal with field data. Using the time to failure data with a non-parametric or even an appropriate parametric model would be much more informative. cheers, Fred
Kevin Walker says
February 27, 2018 at 2:31 PM
We do get the hours in service reported on maybe 10% of our returns for repair or overhaul. Not much, but we were able to plot that against the months since shipment we do know, and get a decent regression line to show us not only average usage (slope of the line), but also the offset in time from when we ship until it enters service (where line intercepts time axis). Not great but better than a complete guess, and actually somewhat repeatable. We were also able to discern a difference in that up-front time offset between product shipped to the OEM, and product shipped to the aftermarket for spares and replacement (about half the time before entering service for spares).
Where the usage wasn’t constant enough to fit a line, we could at least fit a distribution of times and apply that across the fleet. Still not great but turns a wild guess WAG into a SWAG scientific wild guess, right?!
As far as smoothing, why wipe out one of the few bits of information you can glean from returns – whether things are getting better or worse.
- Fred says
  February 27, 2018 at 3:56 PM
  Thanks for the note and summary Kevin. Very well stated conclusion about smoothing… cheers, Fred