Is Testing The Only Way to Confirm Reliability

Abstract

Chris and Fred discuss how we confirm the reliability of something we are making. Or maintaining. Or managing. This is in response to someone raising a question regarding reliability allocation – based on an Accendo webinar. And the question was all about working out how to test that we are on track to meet goals allocated to subsystems and components. So what do we do? Well listen to this podcast!

Key Points

Join Chris and Fred as they discuss a question that one of our listeners asked us about testing the bits of a system that have had reliability goals allocated to them. To be specific, our listener was talking about MTBF – which is problematic. But there are some key problems that all of us face when trying to work out if we are on track!

Topics include:

It is exceedingly rare for there to be NO data or information about a component. Which means that testing should be our last resort. So why pretend no information exists? If one of your components or subsystems has been used in lots of previous models or versions … then you have run the best test possible! … by your customers! … use that data!
… and it is exceedingly difficult to demonstrate really high reliability. So if you have a reliability target of 99.9998 % … do you realize how many components you need to test to demonstrate that you have met this target with a 90% confidence level? A lot.
It is actually about reliability analysis. Not reliability testing. This is not semantics. Reliability analysis takes all forms of evidence, information, data and expert judgment you have at hand. And if you have none of this … then you test. We know how metal fails. The two Space Shuttle disasters were caused by very well known failure mechanisms – not some weird ‘space ray’ or anything else that is not of this world. So let’s use the knowledge we have already gathered.
But if you are forced to test for MTBF, then there is an ‘industry’ standard approach. Which is very, problematic. But let’s go through the steps for creating the test plan.

1. The first thing you do is assume a constant hazard rate. Which is terrible.

2. Then you work out your MTBF goal.

3. You then need to assume that the product under test is more reliable than your goal. That’s right. And we call this the discrimination ratio (DR). So if you assume a DR of 2, then you will be creating a test plan assuming that your product has an MTBF twice that of your goal (if you don’t do this, then there is very little chance for your thing passing the test.’)

4. You then need to quantify the risk that an unreliable product will pass your test. This is always statistically possible. So you have to limit it. Let’s say we are happy to accept a 5 % risk. Lets call this risk ‘β‘.

5. Then (start) by assuming a certain number of allowable failures. Lets start with 0 … which gives us the shortest test duration! Let’s call the acceptable number of failures ‘r.‘

6. Then calculate the test duration using the following equation

$T=\frac{MTBF_{goal} \times {\chi}^{2}_{(1 – \beta ; 2r+2)}}{2}$

… where Χ² is the ‘chi-squared’ random variable which has a CDF value of β and 2r + 2 ‘degrees of freedom.’ Explaining this one is the subject of a whole other podcast, but Excel can help you out!

7. But … now you need to work out the probability of actually passing the test if you have a product which exceeds your requirement. And you use this equation:

$\alpha = 1 – F_{Poisson}(r ; \mu = \frac{T}{DR \times MTBF_{goal}})$

where α is the risk of your really reliable thing not passing the test and F_Poissonis the CDF of the Poisson distribution … again Excel can help you out!

8. Realize that the risk of not passing the test is way to high … and then go back to step 5. And increase the allowable number of failures. This will lengthen the test, but also reduce the risk of your really reliable thing not passing the test. And you keep doing this until you have enough allowable failures to reduce the risk to something you are happy with. And this is the number of samples you need!

… but you can test smarter! Find the vital few ways your thing can fail. And then focus (perhaps accelerated) testing on the dominant failure mechanism only. This will save lots of time and money. And of course, assuming a constant hazard rate is just dumb.

Enjoy an episode of Speaking of Reliability. Where you can join friends as they discuss reliability topics. Join us as we discuss topics ranging from design for reliability techniques to field data analysis approaches.