“Failures are gold”
Early in my career the engineering manager relished discovering equipment failure.
It didn’t matter if it was human, electronic, mechanical or software in nature, the glint in his eye soon gave way to a flood of possibilities. He enjoyed the process of investigating the fundamental reasons a failure occurred.
He wanted to know the cause(s) so we could design and operate a better system less likely to invoke those causes leading to failure. He was relentless in the pursuit of understanding what happened.
At times in my career, I’ve sent suspected bad parts or confirmed failed components to vendors for root cause analysis. That was often pure folly.
The responses often included:
“Yes, it failed” with no other explanation
“It’s working now, we do not see a failure here.”
“We know about this and it’s fixed already.”
“Oh, it was electrostatic discharge, it was your fault.” They had not received the part yet.
“You overstressed the part, it was a victim of poor design or handling.”
It was rare that a vendor would provide meaningful root cause analysis that provided actionable information for improvement. If it’s our design, in what way?
Failure Mechanisms and Modes
A failure mode is what is observed indicating a failure. It’s the plume of smoke rising from the back of the monitor. It’s the lack of response to the power switch. It’s the wrong answer appearing in your calculator. It’s how you know the product is not working.
A failure mechanism is the physical, chemical, thermodynamic or other processes that result in failure, which we then can observe as a failure mode.
Failure mechanisms are where the science exists, it’s what excited my boss. Understanding the science of failures allows us to select better materials, improve manufacturing processes, or minimize loads with the effect of improving reliability.
When a failure occurs do Root Cause analysis
The common term, “failure analysis” generally implies that we:
- verify the failure occurred, try the power switch again
- gather evidence exhibited indicating a failure, what are the symptoms or failure modes
- recommend corrective or preventative action which may include short and long term solutions
This common approach tends to jump to action before understanding what actually happened. If you have a power supply that is not working and by replacing the analog controller component, the power supply works again, is it fair to conclude the replaced part was the root cause? No.
What about the replaced part failed? Was it faulty due to corrosion or mechanical bending? Was the part merely a victim of another element in the system causing a current overload that causes the part to fail?
Elements of a system fail for many different reasons and until you fully understand the root cause of the failure, any corrective or preventative action is little more than a guess at what will work.
Root Cause Analysis (RCA) include the detailed process of determining the failure mechanisms and the factors that cause that failure to occur.
RCA helps you understand if the faulty part is due to material contamination, poor material selection, over-stress, excessive degradation, etc. Understanding that the component is unable to withstand repeated inrush current cycles as each cycle accumulated damage to the aluminum lines on the IC, you can either change parts to one that can handle the stress or reduce the stress.
Simply replacing the part and saying the vendor is at fault for poor workmanship, is not going to solve the problem.
Failure analysis is largely done in the lab with the tools and expertise to tease out the root cause of a failure mode. Recently Bhanu Sood of NASA Goddard presented at the IPC APEX Expo on failure analysis, his slide set, Failure Analysis for Improved Reliability, is available online.
He provided an extensive overview of failure mechanisms and failure analysis techniques along with a couple case studies. It’s a very good read and certainly will spark your interest in celebrating failures.
Think of what you can learn and improve. I hope you get as excited by failures as my former boss.
Pavel Bochian says
Good article.