When something fails, what should we do?
A natural question when something fails is
Why did it fail?
The answer is not always obvious or easy to sort out.
One of my favorite examples was on a circuit board that had a small burn mark where a component exploded off the board. The customer didn’t notice that missing part, our engineering team did that.
The customer noticed the features they wanted were not working anymore. The box went dark and didn’t power up anymore. It was dead. So they returned it.
That is the failure mode – the loss of feature or function. It’s what the customer notices.
Failure mechanisms and root cause
In order to answer the “why did it fail?” question in a useful manner, we need to determine the sequence of events that lead to the failure. The root cause analysis is a process to determine this chain of events.
The cause may be faulty material or assembly, damage or design error. It may also include poor decisions and human error. Generally, we look for the physical or chemical reason for the failure.
We should also explore the design, assembly, supply chain, and customer related processes for where did an error or weakness in the process contribute to the failure.
Failure mechanisms are the material or code faults that lead to failure. Thin insulation leading to dialectic breakdown, contamination leading to corrosion, faulty code leading to an over voltage command. The idea is the actual elements of the product that if prevented or avoided would avoid the failure from occurring.
Types of failures and timing
Products fail for many reasons, many mechanisms. Most products have literally hundreds of ways they can fail. It’s really a race between mechanisms to cause the failure. Eventually, everything will fail.
One of the first steps in sorting out the specific cause is determining when it failed. How old was the product when it failed. Early life (just bought and installed) failures tend to cause more customer anguish than a product that has provided a long life of useful service.
In general, we often talk about three types of failures.
- Early life failures
- Random failures (constant failure rate)
- Wear-out failures
Each type also suggests a set of possible causes. While not always accurate it’s a good starting place when looking for the root cause.
Early life failures
These failures are generally due to latent defects within a product or damage that occurs to a product. A few examples include:
- faulty components
- faulty assembly
- transportation damage
- installation damage
Early life (infant mortality) failures tend to exhibit a decreasing failure rate over time. This may be due to only a subset of products having the faulty batch of components, for example.
Generally, these are events that occur to a product of component from an outside agent or event. These tend to occur in unpredictable and with random frequency. A few examples include:
- Lightning strike
- Severe overloading
- Drop or impact
- Accidental operation
The probability of these failures can be monitored, estimated and predicted. The failure rate is constant meaning each hour (unit of time) has the same chance of failure as any other hour.
Most failure mechanisms have either an increasing or decreasing failure rate, yet some have a very small change over a period of time of interest, thus effectively constant.
Think ‘second law of thermodynamics. Most material changes (degrades) over time. Water, oxygen, and physical wear and many other factors tend to erode a product’s ability to function. Wear-out failures include:
- metal fatigue (crack formation)
- corrosion (chemical change)
- abrasive wear (brake pads)
- polymer loss of elasticity, crazing (chain scissioning)
There are many ways product wear out. The rate of use, the operating temperature, and many other local operations and environmental factors contribute to the rate of wear out.
Becoming aware of a product failure and starting to determine why it failed is an exploratory process. The clues of when the failure occurs may help frame the initial investigation.
By determining the failure mechanisms and root cause, you and the team can determine the best course of action to mitigate or prevent other failures.
Reading CDF plots (article)
The Four Functions (article)
Sources of Reliability Data (article)