Yet another example of why it’s important to understand the failure modes that make your system vulnerable to complete shutdown. Delta Airlines is learning this lesson the hard way today after having to inform customers around the world that all of its flights would be on hold or even canceled due to a “system wide outage”.
Delta listed the cause for the outage as a power failure near its world-wide office location in Atlanta, Georgia while those at Georgia Power believe it was the failure of Delta’s equipment that caused the power outage.
While each company points the finger at the other, the reality is Delta’s customers around the world are sitting at airports or at home wondering when the problems will be resolved and when Delta will be able to accommodate their travel needs.
The irony of it all is it didn’t have to happen. The industry that better than any other has shown the world the importance of developing a maintenance strategy by assessing all reasonable and likely failure modes apparently has never applied the tools they used to make aircraft reliable to their computer systems. A thorough analysis using a team of system experts from Delta and Georgia Power would have with a high degree of certainty discovered and discussed the failure mode that is responsible for today’s outage. On top of that the team would have recommended a strategy to address/mitigate the failure to ensure continued coverage.
So what happened?
How could one of the world’s largest air carriers find themselves grounding every flight around the world and as hours passed have no reasonable response for customers as to why their flight had been grounded or canceled and when they expected to be able to return to service?
My guess is someone who had little understanding of the importance of hidden failures convinced Delta management that the redundant systems they have in place would ensure continued service regardless of what failure might occur. For those who don’t work in the field of maintenance and reliability, someone convinced them then never had to worry about the brakes on their car failing because they have an emergency brake. And while this is true, if you never test the emergency brake to make sure it works properly, it might not work when you need it. As I like to tell my customers “Redundancy builds complacency”, don’t ever lull yourself into believing that because you have a back-up nothing bad can ever happen.
What to Do
While the airline who had until today had one of the top records for customer satisfaction as well as on-time departures and arrivals looks for answers, it’s a good time to think about your company. Are your systems vulnerable to the same type of failure? What are the potential consequences to your business should a complete system failure occur? If the answers are as bleak as those faced by Delta Airlines today, take a tip from someone who has been helping companies mitigate failures for two decades; find yourself a great facilitator, put your team of experts together and find/mitigate the failure modes before they occur!
As usual I’m interested in your feedback on this story. Has your company ever suffered a similar event? Have you performed FMEA or RCM on your computer systems? If so what were some of the tasks implemented to mitigate the failure modes that would result in system-wide shutdown? And, maybe most fun of all if you were impacted by this event, what did you have to do to make it to your destination?
Leave a Reply