Reliability … why always asking ‘are we there yet’ doesn’t work

Last Verified March 12, 2024

A common refrain from managers and engineers alike (as it relates to making or maintaining reliable products) is:

How do we know if we are there yet?

This makes reliability engineering sound like driving a car, sanding a piece of wood, mowing the lawn, or any other endeavour where efforts perfectly align with progress.

This doesn’t work for reliability. Reliability can be easy to achieve, but it needs to be thought about in a different way. And when you do, everything becomes easier.

When it comes to making reliability happen, there is always a time lag. A big one.

Reliability happens when we incorporate all simple, basic design characteristics from the start. Perhaps we secure a wire so it doesn’t pull on a solder joint. Perhaps we have a minimum radius on struts to reduce stress concentrators so we don’t get fatigue cracking. Perhaps we think of all the different tolerances of our components and work out what they need to be avoid a significant minority of our systems being impossible to assemble because individual component tolerances have ‘stacked up.’

Many of these great ideas need to have been thought of in the first ‘five minutes’ of design when they are trivially easy to make happen. BUT – their impact on reliability becomes apparent much, much later.

Asking if ‘we are there yet’ makes us think we can do something if we are ‘not there yet.’

Otherwise, why would we ask? Think of a production process where we have (for example) lots of design reviews with lots of milestones or ‘gates.’ This implies that there are options we can explore if we miss a milestone or ‘gate.’ Why would you review a design if no matter how good or bad it is, we will still keep doing what we have been doing?

Is there some sort of reserve force of engineering magnificence that we can summon if we declare a ‘reliability emergency?’

No.

Reliability happens at the point of decision. Not the point of measurement.

Reliability is increasingly difficult (and expensive) to bake into our design the further into our production process we go. This means that reliability strategies based on ongoing monitoring of reliability performance can at best only identify what we should have done ages ago. There is never enough time, money, or customer patience to indulge us having to redo stuff we thought we had already done.

Measuring reliability (the traditional way) is hard, expensive, and long.

Especially if you are hoping to design really reliable stuff. If your product, system or service has a service life of 7 years, how long will the test need to be to get enough failures to estimate reliability well enough? Invariably, the answer is ‘too long.’

There is widespread pandemonium about ‘passing onerous tests.’

I have seen many projects based on pre-production testing of prototypes that are immediately followed with shambolic high-rate manufacturing that devastatingly removes all the reliability we thought we had. Component defects, assembly errors and all manner of other quality-related issues cruel reliability, budget and schedule.

There is a reason there is something called ‘Design for Manufacturability.’ But if we are interested in passing tests or getting through ‘design gates’ we forget about what lies on the other side of said gates. Meaning we don’t think about manufacturability when we need to be designing it into our system.

And the solution?

Stop asking ‘are we there yet?’

Start asking ‘what do we need to do to get there?’

This changes things from continually identifying what we should have done to continually searching for what we need to do. Using activities like Failure Mode and Effects Analyses (FMEAs) help identify likely weak points in the ‘first five minutes’ of design. Using things like Highly Accelerated Life Testing (HALT) to push preliminary designs to (and beyond) their limits shines a light on whatever weak points are left. We call these weak points our VITAL FEW. We then target other activities (like data analysis and statistics) on those VITAL FEW and nothing else.

And let’s say that for whatever reason, we still need to pass some sort of reliability demonstration test (due to regulators or contractual requirements.) If we focused on asking ‘what do we need to do to get there’ we can relax and know that our amazing product, system or service will pass that test with flying colours.

Wouldn’t that be nice?

So tell me what your experiences are. Can you relate with any of the scenarios I talked about? Or better yet, do you have any examples of how working out what you needed to do to ‘get there’ yielded wonderful results?

Ask a question or send along a comment. Please login to view and use the contact form.