Guest post by Kenneth Latino.
I have been in the business of maintenance and reliability for nearly 35 years. I have had the luxury of working with many industries during my career both as a practitioner in a large paper mill and as a maintenance/reliability consultant. I have been able to see what works and what does not. So, these are some of the mistakes I have made personally and have also seen others make. It is by far not a comprehensive list but does provide some of the more important reasons and some potential solutions.
1. Reliability not seen as a site imperative
Reliability is not a group but rather a culture. Just like we cannot expect to be safe just because we have a safety department. Safety is everyone’s responsibility and so is reliability. We drive the safety message from the very highest levels of a company. We also need to do this in the reliability world. The reliability person cannot always be in the field to make sure the pump is properly aligned or that the proper PM was done on time. This means that everyone must see the need to perform reliability activities and are held accountable to make sure they are performed properly. We would never accept allowing someone to not lockout a piece of equipment before working on it. Yet, we push frontline workers to make repairs quickly, even though we know it is not the right long-term decision.
Solution: Consider forming a Reliability Steering Team at every site. Make sure that someone from the site leadership team is a part of this team. Meet regularly to go over reliability initiatives and ensure that tasks are being implemented on schedule. Make reliability part of your plant’s everyday vernacular. Make reliability a topic in the morning production meetings and celebrate successes when there is a reliability improvement.
2. Expecting software/technology to solve problems
Some people see software and other technologies as the panacea of reliability success. The fact is, software tends to only provide value if it is enabling a defined work process. I have seen many companies implement maintenance management solutions but have loosely defined maintenance work processes. The system then provides virtually no value to the organization and in some cases stunts productivity of workers.
Solution: Ensure that before you embark on new technical solutions, ensure that you have documented work processes. This means that everyone involved knows their role and how they need to use the new technology to drive reliability improvement.
3. Setting impractical goals for a reliability initiative
A lot of consultants guide their clients to performing reliability activities in a systematic manner. Sort of a “paint by numbers” approach. For example, they might start with saying that you must perform criticality analysis on all your assets first, develop detailed asset strategies second, MRO processes third, etc. The problem with this is that these activities take a tremendous amount of effort from the site(s). Most sites are not given additional resources to perform this additional work. Imagine trying to do criticality at a large site with over 50,000 assets. Even if it just took ten minutes per asset (which is not realistic), you are looking at 8,333 hours of work per person involved. That is like 3 working years per person and that is just to define the criticality. And when you are finished, you have not even changed the behavior of the equipment in the field.
Solution: Be more pragmatic and focus on getting some quick wins. One way to do this is to perform bad actor analysis at your site and develop reliability improvement projects on the top bad actors. Once the root causes are known and corrective actions implemented, you can then ensure that the asset criticality is correct and that you develop an asset strategy that will maintain those improvements the long haul. Busy sites generally do not have the patience for a “paint by the numbers” approach. They need to see some “wins” which will provide fuel to keep the reliability initiative moving.
4. No dedicated resources for proactive work
One of my former Mill Managers once told me that we need both “run the mill people” and “improve the mill people”. How many times have you seen reliability engineers taken away from improvement work and being used as troubleshooters on the “failure of the day” or to work an outage because we are low on resources. I am not naïve to think that we can live in a perfect world, and everyone can always stay in their lane. However, it is when this behavior becomes the norm where we use proactive people on reactive problems. Basically, having a group of “extra” resources we can use at will. You need to have a culture where you are always driving proactive work forward even if you have reactive issues going on.
Solution: Dedicate reliability resources to focusing on things that will have a positive impact in the future. For example, they should be focused on failure elimination activities to remove bad actors. They should be developing asset strategies that are focused on removing known failure modes. Resist the temptation to take these valuable resources off their proactive work only to support daily reactive work. The site will know that reliability is important when they see that these resources remain dedicated to their proactive charter, even when the plant has pressing reactive issues.
5. Not focusing on the basics
Sometimes we get caught up with shiny new things in the marketplace. I see this a lot with things related to Artificial Intelligence (AI) and Machine Leaning (ML). While these technologies have tremendous potential, we often cannot even do the basic “blocking and tackling” activities to keep our plants running smoothly. If your site is culturally reactive in nature, first make sure you are working on the basics before moving on to these other initiatives. For example, if you have a huge maintenance backlog and you often break the maintenance schedule to perform work, then you need to focus on what is wrong with your maintenance work management and failure elimination processes.
Solution: First focus on making sure you do the “right” work well before diverting resources to early adopter activities. Work Management, Root Cause Analysis, Storeroom Management, Precision Maintenance, Operator Rounds, Asset Strategy, etc. While AI and ML will no doubt make a significant dent in our future improvement efforts. Just be sure that you have the culture and processes in place to take advantage of them.
6. Limited involvement of frontline workers
I once had a maintenance millwright tell me that: “You guys think you run this plant, but the fact is we actually run it.”. This stuck with me because many salary (management) types think that they define all the rules and guidelines, and the plant will just adopt them. This is rarely the case because many times the rules and guidelines are not practical and do not fit what really happens in the field. If the frontline workers are not involved in helping define both strategic and tactical plans, they will likely fail or at a minimum, provide less than expected results.
Solution: Always provide forums for frontline workers to vet out new processes that they are expected to enable in the field. This will ensure that they will be both practical and accepted by the people who are expected to enable them in the field. Plants run better when there is good alignment between both salary and hourly (frontline) workers. This is not always easy, but in the long run, better communication will drive better outcomes.
7. Focusing on Sporadic failures instead of chronic failures (defect elimination)
When most people think of performing a Root Cause Analysis (RCA), it is usually for a significant sporadic failure. A large fire, explosion, power outage, etc. While this is important, we often overlook the significance of analyzing chronic failures. These are the events that are repetitive in nature, and they often get quick fixes just to get it back up in running. However, the data shows that these failures cost a site much more in lost productivity and cost over time due to their frequency.
Solution: Determine bad actor equipment (chronic failures) and create a process to perform RCA. I often like to call this work a “reliability project” to separate it from the traditional RCAs on sporadic failures. These “reliability projects” should be discussed on a regular basis with the Reliability Steering Team. Corrective actions need to be documented and tracked to completion. Removing these defects causes a virtuous cycle by reducing the need for reactive work and thus providing more time for additional proactive activities.
8. Limited Career Advancement for Reliability Engineers
Often we put bright young engineers into Reliability Engineer roles. This is great at first because they get a lot of exposure to plant equipment and their failure mechanisms. The problem is that in many cases, there is no upward mobility in that role. Perhaps, if you stay long enough, you might get to be the Reliability Supervisor. If we expect people to stay in roles that provide value, we need to make sure there is a progression path for them to want to stay in those roles. What typically happens is that they work for a year or two and then do not see a path forward for their career. So, they logically find roles in operations, maintenance, or project engineering where there are more defined career paths with greater compensation. Let’s be clear that we all effectively trading our time for money. Yes, it is important to love what you do, but you also must make a good living for you and your family.
Solution: Work with your HR department to develop a career path for your reliability team. For example, if you have vibration techs who are in salaried positions. You could work with HR to have multiple levels in that job category based on time in the role and the level of certifications they have achieved (e.g., Vib Tech Level 1, Level 2, Level 3). This will encourage them to want to stay and to continually improve their skills to achieve a promotion and higher pay. We need to make reliability a role that encourages continued learning and continuity. Otherwise, it is just a rotating of people in and out of the role.
9. Poor work order data quality
Reliability is all about data. It is very difficult to make good decisions with limited or even bad data. How often do you see work orders written to a high-level functional location that say something like “pipe is leaking” or “pump broke”? This makes it very difficult to use this data to determine bad actors and to try to predict future failure events.
Solution: Ensure that we train end users to write corrective notifications (aka Work Requests) to the lowest possible level of the hierarchy and to provide adequate descriptions. Explain why this is important to the site and how the data will be used. Operational/Maintenance Coordinators should ensure the accuracy of the notifications and set proper priorities. When the work is complete, we need to ensure accurate time confirmations and provide adequate history notes and repair codes.
10. Short term focus on maintenance cost
A typical site spends months coming up with an amount for their annual maintenance budget. They present this number to upper management only to be told to cut it by 20%. There is no analysis for which 20% to cut, just lower the number. Once the year starts, everyone tries to quickly do their major repair projects because they know proactive jobs will be cut as the year progresses due to unexpected cost of reactive work. Sound familiar? All this does is postpone needed work for another year. Eventually you run out of time and now that equipment fails unexpectedly, and you have to spend the money and probably a lot more because you were not prepared for it. It is a vicious cycle. At the end of the day, you can plan for the repairs or the equipment will plan for you. Either way you spend the money, except it costs much more for unplanned repairs!
Solution: Perform risk-based budgeting. Every year, make a list of all the large maintenance projects that need to be performed that year. For each one of those jobs, determine the risk to the operation if the work is not performed that year. For example, if we cut a specific job from the budget, we will have a 75% chance of having a failure causing $250,000 in unplanned repair cost and 2 days of production loss at $500,000 per day. So, the risk of not doing that job is .75 x (250,000 + (2 x 500,000)) = $937,500. So, when management asks you to cut some jobs to lower the budget, you can ask them what risks they are willing to take. Cut the job to save $200,000 with the risk of having $937,500. They are much less likely to cut once they know the risk they are taking.
I hope you enjoyed reading this list and most importantly I hope it helps you to avoid some of the same mistakes. Not every situation is the same, so there might be some disagreement on some of these. So remember, these are just my experiences and observations over a long period of time working in this space. Happy to discuss and offer additional guidance. Please feel free to reach out at firstname.lastname@example.org.
Ken is currently the Managing Director of Prelical Solutions, LLC. He has extensive Maintenance and Reliability experience in both continuous process and batch manufacturing plants. His work while at Meridium, WestRock and GE has helped large asset intensive companies to get increased production rates and reduced maintenance costs while improving safety and environmental performance.
Ken has an extensive background in Root Cause Analysis (RCA), Reliability Improvement Work Processes, Reliability Centered Maintenance (RCM), Failure Modes and Effects Analysis (FMEA) and a host of other areas of Asset Performance Management (APM).
He also have a strong background in the use of SAP EAM/GE Digital (Meridium) APM and have developed enterprise level work processes and tools to improve asset performance. These tools and work processes include tank integrity management, motor management, RCA, paper machine roll management and history, maintenance budget forecasting and analysis and many others.