
I usually write articles about topics I personally struggled to understand from the sources available to us such as books, online resources, and so on. I believe most technical concepts are fairly straightforward at their core, but the way we express ideas and translate our understanding into writing often makes them harder for others to grasp. That’s an area where we can all continue to improve.
As part of that journey, my goal with the Breaking Bad for Reliability newsletter is to be a communicator of Reliability Engineering principles, and I am doing this mainly for two categories of people:
- People who want to become reliability engineers but have minimal information about their responsibilities.
- People or companies who want to hire reliability engineers but don’t have a clear understanding of what they actually need, or what skills to focus on in their hiring process.
In this context, one of the topics I have been planning to write about is Failure Mode and Effects Analysis (FMEA), a widely used engineering tool in product and process risk management. Early in my career as a reliability engineer, I read a lot about FMEAs and grasped the basic idea behind them, but in practice my FMEA sessions never went the way they were described in books and academic papers. As with many engineering concepts, what’s on paper and what happens in reality don’t always align — and that’s what motivated me to write about it.
Let me be very clear here: if you are here to learn how to do an FMEA, this is probably not the right source for you. There are many good trainings and books out there on the subject, and personally, the best book I have read so far is Effective FMEAs by Carl S. Carlson. Carl did an amazing job of systematically laying out the entire process.
However, my main goal in this article is different. I want to talk about the practical challenges — the things that are rarely written anywhere — that I have experienced throughout my career. So, let’s jump in.
OWNERSHIP
The first thing I want to focus on is ownership: who should own FMEAs?
I believe in the saying, “If everyone owns it, then no one really owns it.”
If everyone owns it, then no one really owns it.
In my career, I have seen systems, reliability, or design teams take ownership of FMEAs. There are pros and cons to each approach, which I will talk about in a bit, but at least having someone responsible is a big step forward.
When reliability (under systems engineering) owns it, the process is usually systematic and step-by-step: functions are identified, associated failure modes and mechanisms are listed, risks are assessed, and so on. But it generally lacks ownership from the design teams who have the ultimate knowledge about the current design. Reliability engineers often do not have deep technical knowledge of the design, so they end up chasing design engineers for information. Meanwhile, design engineers, who already have a full plate of tasks, don’t prioritize FMEAs unless the benefits are clear to them. When those benefits are not communicated, the process turns into a nightmare for reliability engineers, who eventually end up working in isolation.
On the other hand, when design teams take full ownership without any reliability engineer’s involvement, the opposite problem emerges. A huge amount of time is spent cataloging individual component failure modes, including extremely unlikely ones, but the system perspective is lost. What begins as a system risk analysis exercise quickly turns into a time-consuming documentation effort with little connection to meaningful system design decisions.
A slightly better version is when design teams still own the process but are supported by reliability engineers. In this case, reliability engineers act as facilitators, helping teams structure risks, prioritize them, and burn them down in a meaningful way. Ownership remains with design, but the process gains the structure and discipline needed to produce real value.
In reality, there is no single right answer. As engineers like to say, “it depends” — and it really does. Considering how complex and interconnected today’s technological products and processes are, having a systematic way to identify risks, prioritize them against system goals, and manage them proactively is critical to getting real value out of FMEAs. In my view, systems teams (including reliability engineers) should own FMEAs early in the design cycle, when design decisions are still at the high, system level. Here the focus should be on system-level risks using a top-down approach, which I am going to explore in the next paragraph. As the design matures and details solidify, ownership should shift to design teams, while systems and reliability engineers step back into the role of facilitators to support the process.
TOP-DOWN OR BOTTOM-UP?
This is a sensitive topic for many reliability engineers. In particular, professionals from a defense industry background often advocate for the bottom-up approach and refer to MIL-STD-1629, which was later cancelled by the DoD.
The biggest issue I see with bottom-up is the risk of wasting time on risks that do not matter at the system level within the specific operational and environmental conditions. When individual component teams start working on FMEAs without a clear system view — without high-level functions, constraints, and interfaces — they end up listing every possible failure mode, even those that have little or no impact in the system context. For example, a bearing may have dozens of different failure modes, but whether those modes matter depends entirely on the system. Is the bearing used in a car engine, a kid’s scooter, or a rocket engine pump? Without context, you waste energy analyzing irrelevant details.
To illustrate: imagine a system that consists of 5 components, and each component has 2 functions. Each function has 2 failure modes, and let’s say each failure mode has 3 causes. That alone multiplies into 60 risks that need to be identified, assessed, and managed. You can see how quickly the numbers grow. The problem is that many of those mechanisms or causes at the lower levels of the physical hierarchy may be irrelevant, or represent extremely low risks in the context of the actual system being built in that specific environment and use profile. When you follow a bottom-up approach, this kind of noise is unavoidable and can exhaust your resources long before you create any real value.Failure mode progression — systematic elimination of failure causes that do not need to be carried down to lower levels.
This is why I prefer the top-down approach. You begin with the big picture, define system-level risks, then work your way down, filtering out what is irrelevant and focusing on the few risks that truly matter — often called the “vital few.”
In many organizations I have worked at, bottom-up was the default simply because that was the way things had always been done. Changing that mindset was often an uphill battle. But once the benefits of a top-down approach were clearly communicated, it almost always turned into a success.
TIMING
Another important aspect of FMEAs is timing: when should you start?
Think of two extremes. On one end, you wait until the design is frozen and run FMEAs just to document findings. On the other, you start when there is only an idea — during brainstorming and concept trades. Personally, I prefer the second.
The purpose of FMEAs is to identify and manage risks structurally, and the earlier this is embedded into the decision-making process, the better. I like doing functional FMEAs in the early stages, when little is known about the physical design, and then refining them as the design matures. The later you start, the more likely FMEAs become a pencil-whipping exercise that adds no value.
In practice, however, reliability engineers are often brought into programs late in the cycle. By then, FMEAs may already have been performed poorly, people are frustrated, and trust in the process is gone. That makes the reliability engineer’s job much harder, as they must first prove the value of FMEAs and often try to salvage existing ones. In some cases, I found it easier to start from scratch rather than trying to fix a broken process. If you hear skepticism about the usefulness of FMEAs, it almost always means the purpose of the process was not well understood, and the common mistakes I described earlier were made.
WHERE To STOP
Another aspect of FMEAs that I personally struggled with — and where I wasted an incredible amount of energy and resources early in my career — is the question we should all be asking: “How deep in the physical hierarchy should we go?”
To give you an idea, take a pump. Do we stop at the impeller and simply note “impeller breaks,” or do we go deeper and analyze specific crack mechanisms on the impeller surface? Or take a printed circuit board: should we stop at the board level, or continue breaking it down into every resistor, capacitor, and diode?
My rule of thumb is simple: stop at the point where you no longer have meaningful control. If you cannot make design changes at that level, if you lack the data or visibility to properly assess risk, or if the component is entirely sourced from an external supplier whose internal design you cannot influence, then drilling down further only produces paperwork. It adds complexity without making the system any more reliable.
The reason this matters is that going too deep drains resources and dilutes focus. I have seen teams spend weeks cataloging resistor-level failure modes in a purchased PCB. It looked impressive on paper, but in practice it contributed nothing to the reliability of the final product. What truly makes a difference in that situation is specifying clear performance requirements, testing effectively, and qualifying suppliers — not listing every possible failure of a resistor you don’t design or manufacture.
There is also the problem of complexity creep. Every function branches into modes, every mode into causes, and soon you are staring at a spreadsheet so large that no one can realistically use it. That is when FMEAs lose credibility and get dismissed as “just compliance paperwork.” By contrast, if you stop at the right level — the highest level where your team can still influence the outcome — you preserve clarity. The FMEA remains lean, credible, and actionable. Most importantly, it directs attention to risks you can actually manage, rather than drowning you in noise that you cannot.
IN SUMMARY
FMEA is just another tool from the Design for Reliability Process toolkit. We should not perform it just for the sake of tradition or compliance. Like any tool, its purpose is to provide information that helps us make better decisions and improve design.
At its best, FMEA is not about filling out sections or checking boxes. It is about the actions it drives, the insights it uncovers, and the design decisions it informs. That is where the real value lies.
Leave a Reply