Author: Mark Latino
RCA and How to Understand the Basics of Component Failure
When performing a PROACT® Root Cause Analysis (RCA) there is a data collection step called ‘Preserve’ (or the PR in the PROACT acronym) which requires the team to collect failed parts, conduct interviews, obtain paper data and positional information after an undesirable event occurs.
The method also has a step to construct a logic tree and hypothesize all of the possible ways an undesirable failure mode can occur. This paper explores what internal knowledge is helpful when examining failed parts and how that knowledge verifies the physical possibilities on the logic tree.
When leading an RCA investigation, the investigator will collect failed parts like bearings, mechanical seals, shafts, etc. The broken parts are inspected to determine what forces the part experienced as the event unfolded. Having knowledge about the forces present, allows the RCA team to verify if a certain hypothesis did or did not occur.
There are two types of mechanisms that cause mechanical failures. There is either a loss of material or the material is overpowered. There are additionally two mechanisms that cause material loss, 1) material can have loss because of corrosion or 2) there is material loss from erosion/wear.
There are also two mechanisms that overpower materials, 1) the material is overpowered with a single load application or 2) the material is overpowered over time by fatigue.
There are four all-inclusive buckets or hypotheses used in logic trees that cause material failure:
Failure Mechanisms for Overpowering
For now, we will talk about the mechanisms that overpower materials. 90% of all mechanical failures are caused by fatigue, therefore we will talk about fatigue first.
Fatigue occurs when a material is subjected to repeat loading and unloading. When the loads are above a certain threshold, microscopic cracks begin to form at the surface of stressed areas and a crack(s) will begin. Eventually a crack will reach a critical size. This will be when the remaining material can no longer support the load and the material will suddenly fracture.
Overload failures occur in two forms based on whether the material is brittle or ductile. If the material is brittle, they call the failure a brittle overload fracture. Brittle overload occurs instantly, usually from a single load application. If the material is ductile, the material will become deformed and fail plastically.
When analyzing the failure type, an easy way to identify fatigue failure is there has to be an origin(s) plus one. If the failed part has an origin plus progression marks it is fatigue. If the part has an origin plus a final fracture zone then it is also fatigue. Sometimes the load variations are so minor you can’t visually see the progression marks but there is a final fracture zone. These are the most frequent. There are other indicators that we won’t get into right now as they add unnecessary complexity.
Brittle overload failures can be determined by a ‘salt and pepper’ look on the surface. The salt and pepper appearance is because the fracture moves across the surface so fast. The origin can be determined by following chevron marks (they look like arrows) and point to the origin. The chevron marks will only be present in brittle overload failures. If the fracture was a tension failure you will most likely have a hinged lip. We see this with fastener failures and sometimes alignment pin failures. The final visual is the failed pieces look as if they can be put back together perfectly. Don’t worry, pictures are coming!!
Ductile overload failures are visually determined by material deformation. Ductile failures happen in the plastic range of the stress strain curve so they will have a ‘cup and cone’ appearance when they fail in tension. There may also be a fibrous look to the surface. We often see this in wire rope failures because wire rope is a ductile material and the job it performs is lifting. Therefore, it tends to fail most often in tension.
Now let’s show how this would work when performing an RCA.
For example, let’s say you have experienced an unexpected pump shaft failure on PCH-112, you have collected the failed shaft and it looks like the shaft below (Figure 1).
The logic tree being developed by the RCA team states the Event was ‘PCH-112 Unexpectedly Lost Function’ and the only Failure Mode is the ‘Shaft Failed’. The first level of hypotheses answers the question “How can a shaft fail?’
There are four all-inclusive buckets (as stated earlier) for how a shaft can fail; the shaft can erode, corrode, fatigue, and/or be overloaded.
The logic tree at this point would look similar to the one below (Figure 2). To determine which possibilities did and did not occur, we will use the failed shaft inspection results to help us.
When interpreting a logic tree, just remember that the top box is the Event or Undesirable Outcome that forced us to take action (PCH-112 Unexpectedly Log Function). This happened because the Pump Shaft Failed (Failure Mode). We know these to be true, because we can see the evidence with our eyes.
Level to level in a logic tree is essentially a cause-and-effect relationship. Underneath the Mode level, as we explore the physics of the failure, we simply keep asking ‘How Could?’ This will generate hypotheses which have to be proven or disproven with hard evidence (not hearsay). Figure 2 shows how we expressed our four hypotheses for how our shaft could have failed.
If an analyst doesn’t have any knowledge of fracture basics, they would most likely have to send the broken part out to an internal or external expert (metallurgist) to analyze. They would then have to wait for the report explaining the forces present at the time of failure.
If an analyst has basic metallurgical knowledge, they can identify those forces themselves (with their trained eye) and move the RCA forward. Let’s take a look at what this part is telling us.
The arrows in the photograph above point to progression marks (Figure 3). There are many progression marks across the surface of the fracture. Progression marks are only present in fatigue failures. They represent the propagation of a crack. Cracks need load fluctuations to propagate across the shaft surface. The more rapid the growth the farther apart the progression marks will be. The information at this point has verified the failure as fatigue. What other information is present in the part?
In the photograph below (Figure 4) we can see the crack origin. When we follow the progression marks backwards, it points out the crack origin which is always the point of the highest stress. This is usually a sharp corner, in this case it is the sharp corner of the key-way.
When the progression marks are followed away from the origin, we will find the Final Fracture Zone (FFZ). The FFZ is the point where the material could no longer support the remaining load and it breaks. The FFZ also has information for the investigator. The larger the FFZ, the heavier the load was at the time of failure. This part’s FFZ was small. Therefore, the load was minimal. What we have here is a fatigue failure that started in a sharp corner of the key way under minimal load (Figure 5).
The shaft’s side view below also has some information to contribute. Figure 6 shows the part turned up on its side, the break is at about a 45 degree angle. This indicates some torsion and/or bending was also going on at the time of failure.
The information from the failed part also helps the investigator with the next step of data collection. The question now is “How long was the shaft in service?” Why does this matter to us?
If the shaft had been in service two+ years and the loads were minimal, what would need to change to make the failure happen? Most investigative teams would likely be interested in any operational changes, possibly an increase in throughput. Obtaining process data before and during the time of the event would be able to verify if the throughput was increased or not.
Another possibility (hypothesis) might be there was shaft corrosion present, severe enough to lower the material’s fatigue strength, which could cause the failure at normal operating loads.
Now let’s say the shaft was in service for only two days, what direction would be most logical to pursue now?
Usually when the service time is short, investigative teams would focus on data collection first, related to the shaft itself. Some concerns to investigate could include:
1. Was the correct shaft installed?
2. Was the shaft ordered from stores or stock?
3. Was the shaft material correct for the service?
Obtaining the equipment specifications, maintenance manuals, drawing, etc. and comparing them to the actual shaft dimensions and chemistry, would also help to validate the hypothesis.
The other direction would be to investigate everything about the installation. Some things of concern might be:
1. Was the procedure followed?
2. Was a procedure even used?
3. Was it aligned properly?
4. Was a baseline vibration signature performed after initial start-up?
These things are more wrapped around human installation errors.
Let’s move back to the logic tree and see if any of the possibilities can be eliminated. The part information provided from basic material failure allows the investigator to determine what did and did not happen.
The hypothesis blocks also have a number in the bottom, left-hand corner, which is what we call the ‘Confidence Factor’ of the verification method used. The ‘0’ indicates with 100% confidence that erosion, corrosion, and overload did NOT occur. The ‘5’ indicated with 100% confidence that fatigue DID occur (See Figure 7).
Since progression marks can only occur in fatigue failures, the lead investigator is 100% confident in the fatigue conclusion. Since there were no visual signs of erosion, corrosion, or overload, the lead investigator is 100% confident they did not occur.
The question now is “How could the pump shaft be fatigued?’ (Figure 8). There are only two kinds of fatigue: thermal and mechanical. To verify thermal or low cycle fatigue, the analyst could view the part using a powerful microscope and visually see the effects of heat fluctuations. In this case, thermal fatigue signs were not present, and it was ruled out.
The “How could?” question is again applied, “How could the pump shaft have mechanically fatigued?” There are four possible all-inclusive hypotheses: Misalignment, Unbalance, Resonance, and Looseness.
This level can be verified using vibration data taken before the failure occurred. Vibration trend data is extremely valuable for verifying types of vibration. A vibration signature history can quickly validate all four hypotheses.
The results verify there was misalignment present before the failure. This becomes a physical root cause. If the misalignment was not present, then the shaft would not have failed. The physical root was determined using basic understanding of what the fracture surface was telling us.
If we were to continue, the next levels down would get into the human and latent root causes (which is another article that explores human reasoning and decision-making). In the interim, if you’d like to download this free Personnel Error Diagnostic Chart, it would help you complete your RCA and navigate the human and latent roots.
For more information on our evidence-based training on our PROACT RCA Methodology, Why Parts Fail and Human Error Reduction Techniques, please click for detailed workshop data sheets. Online instructor-led and recorded workshops are available as well, just give us a call if travel restrictions are in place.
Author’s Note: I want to thank my mentor’s Neville Sachs and Edward Sullivan for teaching me how important it is to pay attention to the fracture surface and question everything about that surface. There are many things we have visually seen like hammer marks, vise marks, chisel marks, welded nuts used as an additional set screw, and the like. The markings can tell a story about how much of a problem the equipment has been for maintenance.
About the Author: Mark Latino is currently President of Reliability Center, Inc. (RCI). Mark came to RCI after 19 years in corporate America. During those years a wealth of reliability, maintenance, and manufacturing experience was acquired. He worked for Weyerhaeuser Corporation in a production role during the early stages of his career. He was an active part of Allied Chemical Corporations (Now Honeywell) Reliability Strive for Excellence initiative that was started in the 70’s to define, understand, document, and live the reliability culture until he left in 1986. Mark spent 10 years with Philip Morris primarily in a production capacity that later ended in a reliability engineering role. Mark is a graduate of Old Dominion University and holds a BS Degree in Business Management that focused on production and operations.