In the first installment of this series we described the basics behind proactive maintenance and some of the considerations users need to make.
The second installment describes RCM – the “gold standard” for reliability program development and physical asset related risk management. This article is for those who are in “panic” or “fire fighting” mode. If you don’t have a proactive program, equipment runs until it breaks and you can’t seem to get ahead of it, then this one is for you. In a few cases you may have a PM program but your not getting the results you want. You could be overdoing overhauls, not doing enough predictive work, not following up on what you find, or the maintenance actions are simply inappropriate for the failures that occur in your circumstances.
What I’ve described is, sadly, fairly common. It’s particularly evident in industries where profit margins are razor thin and the accountants are driving a hyper-focus on costs. The result is that training is often lacking or non-existent, wages are a bit lower than competitors (and so are the skills), understanding of reliability and failures is sorely lacking, trust in vendor specified maintenance programs is high, but those programs are not well suited to the equipment and the “lean” resources are really beyond lean – they are anorexic. Cost cutting hasn’t just cut excess fat, it has eliminated any form of improvement, cut into the muscle and sometimes even the bone. If your company isn’t going bankrupt, it is very likely a takeover target because it would be a bargain!
Disclaimer: What I share here and in my book, “Uptime – Strategies for Excellence in Maintenance Management” (3rd edition) are guidelines only. Use it at your own risk. While we believe it can help you out of a bad situation, it won’t be perfectly tailored and could in some cases, be wrong. If you don’t understand why that could happen, then you really need to call for help.
We cannot be held responsible for disappointing results. This is not a replacement for RCM or even PM Review/Optimization. We cannot guarantee you will get improved results. However, what’s here reflects years of experience (both good and bad) and an application of the logic underlying RCM. I stress though, it IS NOT an RCM program so it is not the best you can do by any means. Use it at your own risk.
If you are an RCM user or even if you think you are, then this is not for you. If you are not getting the desired results from your RCM derived program, this is not a suitable replacement. Your RCM assumptions, decision process or decisions resulting in tasks and frequencies should reflect your operating context. If you are not getting the results you expect, then you need to revisit that program using RCM. There’s no quick fix for flawed analyses or an operating context that has changed over time.
Condition Based Maintenance:
Condition based maintenance involves two steps – 1. checking for faults that are developing and then 2. acting on those faults when you find them. You must do both – observe, then act. I have seen numerous companies with CBM programs that are not acting on what they find. The program doesn’t help and it is incredibly frustrating for the technicians who do the condition monitoring too. If you’ve ever thought of giving up on your oil analysis, or vibration analysis, then you are likely in this boat. The whole program will fail if you do not act on the defects you find in a timely manner – your work management processes will need to be capable of handling the demands that will arise and your supervision will need to make sure it happens. Supervisors who are addicted to the praise they get from fixing breakdowns, or who do not believe in CBM, may be undermining your program.
Most of your checks should reveal that things are operating satisfactorily. You can expect that for much of the checking you do, it ends there. That does not mean it is ineffective or wasting effort. Only a small portion should reveal defects. DO NOT stop inspecting and checking only because you do not find problems frequently – you shouldn’t find problems in any given equipment all that often. If you do, then you’ve probably got deeper design or operating problems that cannot be corrected with maintenance alone. Our suggestions are:
- If it rotates at high speeds (i.e.: pumps or rotary compressors greater than 1800 rpm) use an overall velocity or accelerometer reading on vibrations. You’ll need to determine what normal is in each case by checking vibrations when the equipment is known to be operating well. Readings will be in mm/s or in/s If readings are “high,” then act to correct the fault, if they are normal or only slightly elevated, then leave it alone.
- For low speed rotational equipment (i.e.: crushers, mills, autoclaves) you need to use displacement readings. Anything > ½ bearing clearance is bad. Figure out those clearances (bearing manuals should tell you) and start from those. Reading might be in “mils”.
- For anything at very high speeds (e.g.: turbines, tubo-expanders, axial flow compressors) you’ll be best served with accelerometers and acceleration (g’s) readings.
- The frequency of taking readings should be less than the warning period you usually experience for problems from the equipment. In the past, if you’ve discovered a problem existed but the equipment was still functioning, how long were you able to run before you had to take it down or before it failed? It could be days, weeks or months. Halve that time and use the result as your checking interval. For all other vibration readings – take them no less frequently than weekly. You can generate “routes” to follow (taking readings at every accessible bearing housing and right over the bearings, take one radial and one axial reading each). The routes can be loaded into your CMMS or dealt with on checklists. I’ve experience companies who have contractors taking these readings on a monthly, quarterly or even half yearly basis. That simply isn’t enough. Stop wasting your money (cut the program entirely), or spend more (increase inspection frequency).
- Keep track of your readings so you can trend them graphically (spreadsheets work well for this but your maintenance software may have a capability to display trends). Some vibration analysis equipment comes with its own software as does oil analysis and others. If trending is difficult to set up, then simply look at past readings when you take the new ones. If your check list contains columns you can keep readings from multiple checks on the same sheet and see trends quite easily. When the readings trend upwards, begin to monitor more frequently (we suggest daily). Your goal is to allow the degrading condition to continue (extend the life of the equipment) but then catch it just before it crashes. Keep in mind that readings taken under load and at operating speeds may be bad but the equipment may not reveal problems if rotated by hand with no load. In the shop, any equipment taken out of service before it has totally failed may appear to be in good condition. This is normal. The stress of higher speeds and loads (which you cannot duplicate in the shop) are actually what allows you to detect the problems.
- Whenever equipment is installed (i.e.: after a repair) take a set of CBM readings to use as a baseline. Later readings can be compared to it. Begin the trend at the baseline. If possible (and this is difficult to do so don’t dwell on it), take readings when equipment is under constant load and at the same load for all readings (i.e.: make sure it’s running in the same conditions each time) or readings will be variable. It’s difficult to achieve those same conditions every time, so over time you will get used to seeing a normal range of variation and realize that at first it can be alarming. You might check one day and get a high reading, next day it is low. If you are concerned, always monitor more frequently before taking repair action. Do not take a single high reading as an indicator for removal – you need to watch it a bit to be sure it’s bad.
- For critical equipment with large open oil sumps – the kind with drains for oil replacement and sampling, not smaller sumps that are typically sealed with screwed plugs, you can use oil analysis. Take oil samples, analyze for particulates, water, contaminants and act on recommendations from the analysis lab. You can do some simple tests yourself – you can feel grit, see dirty oil and using a heater, test for excessive moisture. For more in depth analysis, you can send samples to a lab for analysis. Sample monthly, act on results within a week unless the lab analysis points to greater urgency. Create routes for sampling just as you did for vibration checks. Note: if oil change frequency from manufacturer is only a month or two (or less), then omit from this sampling / analysis and stick purely with oil changes. There’s little point in checking a sample from a freshly filled oil sump!
- Best oil analysis techniques to use are ferrographic (visual inspection of particles under a microscope) and particle counts. Do not rely on spectrographic methods unless you are very familiar with them. That method can give increasing readings while damage is minor, but then the readings fall giving a false sense that things getting better on their own. Of course that’s just when it gets bad enough to worry about. Spectrographic methods rely on microscopic particle sizes and can not detect larger (more harmful) particles. Don’t get all excited about what metals are in the oil samples – that info is useful only for diagnostics and only really worth doing if the machine is critical and un-spared (i.e.: no backup).
- For smaller or less critical equipment make sure the oil is uncontaminated by dirt and water. Water emulsified into your oil will give it a milky look. Dirt will discolor the oil – like the oil in your car, it goes in clear and dark gold, comes out black. Water in small quantities can be detected using a spatter test – heat a small sample of oil in a spoon over an open flame or a plate burner (in a safe environment). If the oil spatters it has water in it.
- Carry out Infra-red inspections using IR cameras on all electrical switch-gear quarterly. Electrical problems are often caused by mechanical looseness or mis-alignments that reduce surface contact areas. The current flow through those smaller areas and heats them up. Look for hot spots. Ideally this is done with the cabinets open if it is safe to do so (watch for arc-flash risks) but can still give results even with cabinets closed if problems are significant. There are now special windows you can install on cabinet doors to look inside with your IR gear without the need to open the doors.
- IR may also be useful for any other areas where problems show up as heat – blocked pipes carrying normally hot fluids (downstream looks cooler), rotating equipment couplings (no need to remove guards to check this), bypassing steam traps (the drain stays hot), loose or misaligned belts (the sheaves and belts are hot), excessively loaded motors will appear hotter than most do, hot bearings on one end of a machine but not the other, motors with dirty cooling fins, gearboxes overheating, uneven exhaust temperatures on engines, blocked or partially blocked heat exchangers, damaged insulation on tanks, pipes, exchangers, walls, the roof, etc.
- Visual inspections should be carried out by operators doing regular “rounds” twice in each shift – at the start (immediately before or after shift change) and half way through the shift. Operators need to be taught what to look for (see below under “cleaning”). Any anomalies need to be logged and reported as work requests for further inspection and action. Operators need to know that any condition they feel is “uncomfortable” is a potential problem that must be dealt with. Better to be overly cautious and find a problem than to miss it and potentially suffer downtime or worse.
- Visual inspections by Maintenance foreman should be done once per shift also. These can be used to investigate any reports from operators of potential problems, to see that operators are indeed doing their cleaning and not missing the obvious and to spot less easily identified flaws that operators might miss.
This is work that is done regardless of the equipment condition. You are restoring good operating conditions through cleaning or other restoration type activities, replacing working fluids or parts (like dirty filters). You do it at a regular frequency that is shorter than the usual time to failure. If you don’t have good records that tell you what the times to failure are, then ask your maintainers and operators about their experience. They probably have a good “gut feel” for it – that is usually valuable information and surprisingly accurate!
- For mobile equipment – in the absence of a thorough RCM analysis, follow the manufacturer recommendations for oil / filter / component changes. This could be overkill in some cases but doing it generally won’t hurt anything, so long as you don’t need to do a lot of disassembly work to get to it. At worst, you use too many filters and oil. If you have experience that varies from the recommendations (e.g.: longer frequencies), then go with the site experience.
- Ditto for overhauls of mobile equipment components. I often say that manufacturers ask you to do too much overhaul work because they are being overly conservative and they have no incentive to take a risk. After all, you buy materials and parts from them. However, in the case of most mobile equipment, they do have a good deal of experience maintaining it and their recommendations, while still conservative, should not be ignored without good data from your own operating experience.
- For any plant equipment with closed oil sumps, change oil and filters at manufacturer’s recommended frequency.
- Conveyors – lubricate rollers and idlers, watch belts for signs of damage and tracking error, or slippage. If you have a lot of conveyors, then have someone dedicated to this – he / she can start walking around and when he gets to the end, start over.
- Cleaning – keep equipment and surrounding areas clean. This is done to avoid contaminants that can get into your equipment. It also reveals minor problems early. It eliminates dirt or spill related safety hazards and fosters a sense of pride in the work area. This applies to mobile equipment as well as shops and plants. Anyone doing cleaning must know how to do it without harming equipment (i.e.: no water hoses aimed at bearing housings) and they must know what to watch for (drips, dirt accumulating on oil films). Cleaners should be trained (this doesn’t take long) to look for obvious signs of equipment distress – leaks that soil cleaned areas soon after cleaning, higher vibrations or sounds than normal, cracks in grout or foundations of equipment or in floors near equipment, loose guards, etc. Your maintainers can probably come up with a very good and site specific list of things to watch for in these operator inspections.
- Heat exchanger cleaning – experience at the site probably reveals where heat exchangers have been problematic. In those cases schedule cleaning at a frequency less than that of the problems arising. Note that dirt will reduce heat transfer capability, decrease process efficiency and increase energy costs.
- Roads – keep graded and clear of large rocks or other debris that can harm truck tires. If you have dirt roads, keep them watered to reduce dust in the air which can get into equipment and storage areas and contaminate equipment and spare parts.
- Shops – clean them up! Avoid sources of contamination such as dusty laydown areas outside – pave them. If the shop is hot, ventilate using filtered air if you are in a normally dusty environment.
- For all back up equipment – testing once per month will prove that it is capable of starting.
- If the equipment is subject to wear out type failures (e.g.: air compressor valves), then use testing to prove operation and auto start (fake the low pressure condition), run for a short time and then switch back to normally operating equipment. Avoid equalizing the running hours on backup equipment that fails with usage. Don’t simply swap equipment around and equalize hours if it tends to fail due to wear out or age.
- If the equipment is subject to random types of failures (e.g.: mechanical seals and bearings on centrifugal pumps), then the test is accomplished by starting the standby equipment and putting it into operation until the next test interval (i.e.: swapping the equipment back and forth actually works well in this situation). Equal running hours is not a problem where failures are random.
- For safety devices you want to test them periodically to prove that they will work. Note that you will find there are far more of these than you might expect once you really start looking for them. The best test is a full “end to end” test if possible, not a simple push the test button and watch the lights go on – that only tests the bulbs. Consider things like high / low level / pressure / temperature alarms and stops, process parameter driven stops or alarms, fire alarms, etc. Simulate the real alarm condition if it is possible to do safely. Don’t forget warning signs and escape route signs – if they are missing or obscured with dirt they can’t do their job when needed.
- Frequency – the more frequently you test the device the great the risk reduction you achieve. If you have a critical protective device you will want to test it more frequently than for devices that are not all that critical. A good start is to test most devices monthly unless testing is highly impractical. For things like safety valves that may be covered by some sort of legislation, do them at the legislated frequency and in the manner the regulations call for. These guidelines are no substitute for following regulations that you may be subjected to. Only a thorough RCM analysis with good data can reveal alternative frequencies and tests to what regulators will typically require, and even then, you need to get the regulator to give you approval to change.
- Signage – check that all warning signs are where they should be, clearly visible (unobstructed) and good condition (e.g.: lit) so they can be read. If this includes emergency escape route signs, make sure they are pointing the right way! While doing this, inspect escape routes to ensure they are unobstructed by debris, tools, etc.
- Safety equipment (fire extinguishers, first aid kits, etc.): these are probably inspected / tested in accordance with regulations – make sure this is happening. Also make sure that access to them is not obstructed and their locations are clearly marked.
Operational Basic Care (Mobile Equipment)
- Carry out circle checks of equipment before using it. The operators need a good checklist and supervisors must make sure they do this thoroughly. Get your maintainers to work with operators to create the checklists.
- The best checks require that actual readings be recorded for trending purposes. Simply saying it was checked is not a good practice. The operator needs to know what he / she is checking for – e.g.: correct oil level is important, not just the presence of oil on the dipstick!
- Drive (operate) the equipment within its operating parameters and do not tolerate abusive equipment operating practices. Industrial equipment is robust, but it is not designed for play – abuse can cause failures.
In equipment that is not critical to operations, or where it is unlikely to cause a safety or an environmental problem if it fails, you might be comfortable running it to failure. Failures will always require repair, but if the losses or risks (safety, production, quality, etc.) that come with it are negligible, then you can probably tolerate the failure and save the money you might otherwise spend preventing or predicting failures. Allowing equipment to run to failure actually maximizes the average at at which the equipment fails. All the proactive methods aimed at reducing or eliminating failures are done before you get to the failed state.
To accept run-to-failure as your strategy, the equipment must meet a few simple criteria:
- Worst case failure has little to no impact on production. (e.g.: consider running spared equipment to failure if it meets the rest of this criteria).
- Worst case failure has little to no environmental impact.
- Worst case failure does not create a safety hazard.
- The cost of repair after failure is less than the cost of preventive maintenance over time.
- The cost of repair after failure is less than the cost of predictive maintenance over time.
A simple way to tell if run-to-failure is acceptable is to ask yourself if in the past, when the equipment failed did you really need to put it back into service in a hurry? If the answer is “no” then you have a candidate for run-to-failure. Note that returning equipment to service in a hurry because someone wants their spare back, is not (on its own) a good criteria. Consider the actual risks – severity and likelihood.
For equipment that you choose to run-to-failure you should make an annotation in your CMMS or in the equipment register to let people know that you’ve made that choice. If you don’t do this, there is a good chance that operators will put high work priorities on jobs that really don’t need them causing excessive work and squandering scarce and valuable maintenance resources.
If you read this and realize that your maintenance work force isn’t sufficiently sized to accommodate this work, then you are probably truly understaffed. You need overtime or contractors or new hires to help you out. If you do this sort of proactive work though, you should see the number of in-service failures go down and your ability to schedule work improve.
Keep in mind that these guidelines are by no means a replacement for thorough and informed analysis and evidence based decision making. The list here in not all-inclusive, but it should get you started and it should help you get things “under control” so that you free up resources to apply more thorough analysis like RCM. As you work your way through these guidelines and apply them in your operation you may see other equipment or systems that are not specifically identified here. Hopefully this has given you enough of a basic understanding of principals to enable you to handle those unique cases. If not, then you should consider an RCM program.
Ultimately, RCM is where you want to go in all cases. Anything less than that will invariably miss something. If it misses something that is critical to your business, or that can result in a safety problem or environmental non-compliance then you are ultimately to blame. Failures don’t always get you into legal troubles, but accidents can.
Imagine how you might answer a judge at an inquiry into a fatal accident caused by a preventable equipment failure if he asks you, “did you do everything in your power to prevent that failure?” If you’ve done RCM, and done it well, then you can honestly answer “yes”, even if your analysis was flawed. After all, you are human and you will make mistakes. But if you haven’t done that analysis, or you have just followed guidelines like these (with its disclaimer and cautions), then you have not done as much as you can.