Reliability Goals and Allocation

Abstract

Chris and Fred discuss reliability goals … and how we ‘allocate’ them to components and subsystems. This is really important, as subsystems and components are often built and designed by different designers or organizations. So how do we translate system-level goals to MEANINGFUL lower-level goals?

Key Points

Join Chris and Fred as they discuss reliability goals and allocation. Getting reliability goals can be really challenging … mainly because it forces us to think and then draw some sort of a line in the sand! So how do we set RELIABILITY goals and then make these relevant to each part of your design team?

Topics include:

What is a reliability goal? Something that is unambiguous, objective, not ignored, not short cut or otherwise not dismissed as a ‘nice to have.’ A reliability goal is not complying with all standards and tests. The Blowout Prevent (BOP) on the Deepwater Horizon should be the final word on this. So a reliability goal needs to be some sort of characteristic particular to your product, system or service that represents what the user or customer wants. It is not what we have always done. It is not the reliability specification that is ‘as high as we can get it’ for our testing budget. It is something that matters.
But there are some organizations that have no reliability goals … so why can’t I? Netflix (for example) focuses on continually trying to ‘break’ their system, analyze what went wrong, and then fix it. This sounds like ‘build-test-fix’ … which is almost always bad. Oh … and Netflix doesn’t have reliability goals – and yet they are one of the most reliable data networks there are. So if they don’t have goals, and all they do is ‘build-test-fix’ … why can’t we? The reason comes down to culture. Netflix is testing their system beyond what is ‘normal.’ They delete virtual instances, turn off data centers and do lots of other things which REALLY stresses their systems. They are continually looking for the weakest point of their system, even if that weakest point requires what might seem like ridiculous or ‘unreasonable’ stresses to get there. And their system (as of today) is actually ‘there’ when it comes to reliability performance. They are focusing on CONTINUAL IMPROVEMENT as opposed to ‘build-test-fix’ until we pass a test. And then stop. These are two very different approaches.
… are we there yet? This is a classic question we ask for ‘build-text-fix.’ That is, we keep ‘doing reliability stuff’ until we reach some sort of goal. The problem with this is that reliability decisions are most effective (and cheapest) when they are implemented early in the design. But … they can only be measured months or years later. This really slow feedback loop dooms most organizations, as it takes too long for designers to be aware if their decision was ‘good’ or not. Reliability Growth Testing (RGT) where the system is used in ‘at use conditions’ makes this issue really apparent. You need a ‘fully functional’ prototype to do RGT. By this stage, it is really expensive to change or improve reliability in design. So RGT often devolves into doing the least intrusive corrective action. But … you can track improvement in reliability. The other option is simply investing all your resources into making reliability happen as opposed to measuring reliability later on. This is always faster and cheaper.
The key for success is …? A simple philosophy. Some organizations focus all their efforts on doing (proper) FMEAs. Others focus on using HALT to keep improving each generation of their product (sort of like Netflix). Successful organizations tend to have worked out which approach works best for them, and make sure that is deeply embedded in company culture.

Enjoy an episode of Speaking of Reliability. Where you can join friends as they discuss reliability topics. Join us as we discuss topics ranging from design for reliability techniques to field data analysis approaches.