Recently there was a power outage, that caused approximately 2,000 homes to lose power during a very cold day. The paper headline read, “All-day outage caused by worn wiring”.
This seems like a reasonable comment and probably like many other newspaper headlines also seems to go a long way to explain what caused the 2,000 homes and business to lose power for 5 ½ hours, and the 300 that lost power for a total of 11 ½ hours.
I suspect, that many of us often just take these types of headlines at face value, and chalk it up to “it is just the newspaper” or it is just normal journalism.
I try hard to question statements like these and when I did, I thought, most of us have probably had something that failed that was traced to wires that were worn. We all know that if a wire is worn it will cause problems, it is implied in the headline, but I think a more important question would be, what caused the worn wires?
The problem with stopping too soon is, if you aren’t careful, you may convince yourself of all sorts of things, without questioning whether there was more to the story or not.
Was there perhaps more to uncover or at least think about? Can worn wiring by itself be the cause of this incident? As you probably guessed I think they stopped too soon.
A simple solution?
Let me ask those that are reading this – “have you ever seen worn wiring that did not cause an actual incident?” I have and I’ll bet others have too.
A picture is worth a thousand words so let’s put one up to discuss –
Figure 1 presents a graphic of a cause and effect analysis, ending up with worn wiring. I couldn’t bring myself to put just outage caused by worn wiring down, there were just too many questions.
Thinking it though, wouldn’t the outage be caused by the lack of power which would be caused by some type of short, which in turn might have been caused by worn wiring and wiring shorting? Lots of questions if we start putting our thoughts down on paper.
Many times I believe we are conditioned due to years of acceptance to not even question these things. Aside from the fact that they may have missed some intermediate steps. If we stop at worn wiring the obvious solution is to replace the worn wiring.
I’m sure all of your customers, whether they be consumers or internal operations, are glaring at you, or calling on the phone, and asking you to fix this and get them up and running and back to normal. So the wiring is replaced and everything goes back to normal.
What might the diagram look like if someone says hey wait a minute – what caused the worn wiring? I took the liberty of guessing at what the rest of the diagram might look like just to make a point of this discussion, and have included it below.
Figure 2 identifies what might be the potential causes if we didn’t stop at worn wiring. Some interesting things potentially pop up past the worn wiring stopping point.
In any investigation, it is up to the group to determine when to stop, and unfortunately, I see too many that stop too soon. I want to reiterate that I have taken liberty with possible causes, just to make a point in this article, since I wasn’t involved in the investigation.
It is important to note that if you stop at worn wiring you will not see these causes and not seeing what is shown in Figure 2 might cause you to miss some effective solutions.
I want to be clear that I am not arguing that replacing the worn wiring is necessary and a perfectly viable solution and needs to be done. The first order of business is to get things back to normal. The question is will this keep the problem from recurring?
I think the answer to that depends on some definitions.
What is a solution?
The whole purpose of doing an RCA is to find the underlying causes of an incident so that we can propose solutions to prevent it from recurring. To continue, some discussion on solutions is appropriate. so I ask the question “what makes a good solution to an incident?”
“If you can’t measure it you can’t manage it” is a quote that floats around and sometimes incorrectly attributed to Dr. Deming.
This is probably not true in all cases but I believe that if you can put a measure on it you can manage or improve it. Let’s define some criteria on what makes a good solution, so we can manage it.
Four criteria that can be used to measure the effectiveness of a solution are:
- Does this solution prevent recurrence?
- Is this solution within your control?
- Does this solution meet your goals and objectives?
- Does this solution cause other unacceptable problems that you are aware of?
Most likely the replacement of worn wiring solution, offered above, meets all of the criteria given with the possible exception of the first one.
If the underlying causes of the worn wiring are not fixed the problem will probably recur. In this particular case, the original wiring was installed in 1989 so it had worked for 28 years. Not a bad run. What do we do if we say replacement will not prevent recurrence?
Doesn’t the answer depend on the time component that you chose to use? In the reliability equation, there is a mission time specified why shouldn’t we apply a time component to solutions?
How do we fix the problem?
I believe we must consider a phased approach. There are most likely 3 levels of solutions
Let’s use the following descriptions for each of the solution types:
Immediate – solutions that only prevent an identical incident from occurring again on the same equipment, the same type of equipment or other equipment in the same part of the process.
Common – solutions that eliminate the problem and the likelihood of similar incidents in the future on other equipment throughout the process or facility.
Systemic – solutions applied to causes will result in improved management systems or processes and work culture and will prevent the likelihood of many other incidents from occurring throughout the facility or company.
By specifying the above, it allows us to say that there may be three types of solutions necessary depending on the issue.
We may choose to implement one or all three types if we have carried the investigation far enough. In our wiring example, we need an immediate solution to get the system back up and running NOW per our customers. But if we follow the diagram we see that there may not be an inspection process which would apply to many types of equipment within the organization, this might lead to a solution of implementing an inspection process, but that won’t immediately fix the worn wires in our case, which is why we need to replace them.
The diagram also leads us to the potential that failure modes aren’t being analyzed or looked for and that would have implications at the highest level of the organization, however implementing this solution won’t fix the immediate issue of the worn wiring either. I think it is tough to argue that establishing a failure mode system is a valuable long-term solution.
If you stop your RCA investigation too soon you may not identify those causes that will lead to solutions that extend beyond the immediate situation.
While implementing the immediate solution may get you back up and running, looking for common and systemic causes will be where there is significant opportunity to make long-term cultural changes in your company.
Looking at the overall issue we may want to look towards multiple solutions to fix the immediate issue but also look to reduce the number of common and systemic issues that exist in our organizations.
The big dollars are saved by not stopping too soon and finding common and systemic cause since they are causing multiple problems within your organization.