(Reproduced from the article “Death of a Reliability Engineer” by Dev Raheja, Reliability Review, Vol. 30, March 2010 with permission)
When I first wrote the article in March 1990, I implied an ‘F’ grade to reliability engineers. Now almost 20 years later, I would give them a “E’. Yes, there is a little improvement but nothing you can write to your mother about.
The MTBF cancer was wide spread and is still wide spread in the DoD. The only reason I upgraded the reliability engineer from F to E is because the MTBF in some industries is no longer used such as in the automotive industry. They use the failure rates instead to hide their shame.
Failure rate is just the reciprocal of MTBF. Good job! Same old corn flakes with a new product name!
Several recent discussions cause me to recall some highlights of a conversation I had over a decade ago with the late Dr. Austin Bonis while we were conducting the first ASQC course on Reliability Engineering. He made the following interesting statement:
The design engineer knows a lot but is never able to do a lot; the quality control engineer does not know a lot, therefore does not do a lot; the reliability engineer knows a lot, does a lot, but too late!
At this time, I am moved to amend his statement so that it reflects a frequent problem, as follows:
The reliability engineer knows a lot, does a lot, but what he does is usually wrong!” Also, “If the basic reliability work is done with a lot of mistakes; then it does not matter if the work is done too late!
Some Good, Some Bad
There are some excellent reliability engineers. They have prepared for their tasks with a basic engineering education that included physics, chemistry, and fundamental concepts and principles. They combine this knowledge with statistical theory.
On the other hand, many so called reliability engineers ignore physics and chemistry, and fail to consider design reliability lessons. They jump straight into statistics. They are lost if hard data are not available. They ignore the fact that reliability can be improved without the statistical analysis.
All one has to do is study the failure modes, accelerated testing results and be aware of the customer problems. The bottom line in reliability is to prevent all failures during the useful life. Some reliability engineers will not agree with this statement. They think a certain level of failures is unavoidable.
In my opinion, such engineers should go and perform time and motion studies rather than work as reliability engineers.
A Case History
An engineer was assigned to work as a Reliability Engineer. He had taken a statistics course in the college he attended; therefore felt prepared for the position. Soon he began to encounter difficulty in applying his knowledge of statistics. The test data was never enough; the field data never was complete and was full of errors. He kept complaining about the lack of data collection effort. Meanwhile many shoddy products went out the door. No one knew what to do when only a small quantity of data existed. After a few product disappointments, the company decided to go into a full- fledged reliability program to assure high reliability in the design. He then proceeded to apply these skills. The first item on the agenda was reliability prediction. Our reliability engineer was eager to see this task done right. Now he has a bunch of numbers in his grip. He can compute failure rates for each component from the MIL- HDBK-217E, add them up and calculate the MTBF. The design engineer was not involved. Why involve him? The product is already designed and being tested!
As years pass, the reliability engineer discovers that many gross assumptions were made in his education. The fundamentals of engineering were overlooked. He was taught to assume that the failure rate is constant which makes calculations simple. Even some of the industry “experts” in aerospace companies seemed to make that assumption.
The real failure rate was rarely constant for any component because of infant mortality failures from manufacturing defects. Even electronic components showed decreasing failure rate with time. For mechanical failure mechanisms the failure rates increased with time. The decreasing failure rates would have been good news except that the starting failure rate was anywhere from 10 to 15 percent, which made the customer mad as hell.
Eventually, the management got the message that the customer is not supposed to pay for 10 percent defective products and therefore some in-house screening was added. This raised the cost of reliability but the reliability engineer felt more secure. Soon the screening tests became the standard operating procedures. This was helpful because management was interested in technical merit. Sometimes the customer wanted this information.
Unfortunately, the result of all his work (which cost about four man months) had very little to do with the real MTBF. The failure rates in the MIL-HDBK-217E were outdated, and collected over a large variety of applications. They are based on the assumption of constant failure rate. This implies the failure distribution for components is exponential. Very few real components actually had this failure distribution.
The components, even electronics, followed several shapes of failure distributions. All these shapes were ignored. It was too much work to determine the real failure distribution. Such data does not exist in the data banks. Since the whole industry had already been using MIL-HDBK-217E to make reliability predictions, our reliability engineer had no choice but to go along.
The predictions, to be credible, should have been a combined result of the review of past experience, qualification tests, and the MIL-HDBK-217E. I suppose a conservative prediction is better than no prediction. Such predictions can always be adjusted by multiplying with so called experience factors, sometimes crudely called the fudge factors!
Arrhenius Model
Our reliability engineer was told that the Arrhenius model applies to electronic components. He was not sure but he did not question the judgment of those with over 20 year experience. But he did find out later that many failures in electronics are mechanical. The Arrhenius distribution did not apply to such failures. He found it convenient to use Arrhenius as long as everyone around is a believer in it. He also used the Activation Energy constant from the published data of device manufacturers. This never made sense to him since his devices were not built exactly the same as the original manufacturer built them; but he had to go along with it.
This company never had money to run a few experiments to assess the real Activation Energy constant.
The management had a great TQM (Total Quality Management) program. But that was only in name, not in spirit. When the time came to put money on the table for quality improvements, the management was very unhappy. The TQM program meant that you talk up improved quality but do not spend time or money implementing the new effort.
FMEA and Fault Trees
After becoming frustrated with the make believe world of reliability numbers (my opinion), the reliability engineer sought more tools.
He found the Failure Mode Effects, and Criticality Analysis (FMECA) and the Fault Tree Analysis (FTA). But, he did not quite know how to use them correctly. The experts confused him more than they helped. Every expert had his own way and industry was already misusing these tools. Many were using these tools to perform reliability estimates and modeling rather than improving the product design. Reliability engineers labored many months to perform these analyses. They helped get attention on the failure rates but did not make much impact on the design engineer. Management was satisfied because these are the tools everyone is supposed to use.
The MIL-HDBK-217E predictions, the FMECA, and the Fault Trees were impressive. They dazzled management. That is, until the recession hit the industry. Then Management began searching for places to cut costs; non-essential tasks became a target. They found reliability engineering to be a non-essential cost. Then, the reliability engineer was given the pink slip and the whole reliability engineering department was eliminated to achieve profitability.
What Went Wrong?
There is a long list of things that went wrong. I will mention only a few. The right tools such as FMECA and the Fault Trees were used by the reliability engineer but he was not qualified to use them because he did not know all the details of the design. He should always use these tools together with the design engineer and the manufacturing engineer as a team BEFORE the design is released, not after. The tools should have been used for design improvement.
The MIL-HDBK-217E should be used for comparing design options, not for field reliability prediction. There are too many assumptions in the MIL-HDBK-217E which the user is not aware of.
The screening tests were used for inspecting the product rather than learning to eliminate the failure modes and lower the production costs. These observations show that the reliability engineer requested design changes that increased, rather than reduced the costs.
Look Into The Mirror
The above example is not uncommon. Look at yourself in the mirror. Possibly you will find similarities in your situation. I find indications worldwide that when misusing the tools continues, the professional death of the reliability engineer is likely. This is one place the constant failure rate applies. Not to the product. To the reliability profession! Actually there is nothing wrong with the reliability profession. The problem lies with the professors. They hardly teach reliability in the engineering school. Among those who teach reliability, they tend to emphasize applied statistics instead of robust design. I hope the universities will do something positive; not only to prevent the death of a reliability engineer, but also, the death of a design engineer who is responsible for reliability.
Dev Raheja is an International Reliability Consultant from Baltimore Maryland. He originally wrote this article for the March 1990 edition of Reliability review. A fellow of ASQ, he can be reached at Draheja@aol.com
James Karl says
I was a multi-certified reliability engineer who was widely known. I had plant engineering and design experience. I retired out of frustration after all of the senior management reliability supporters were “fired” in a very political internal purge. The new regime did not support any of my programs. I consider reliability mathematics to be simple, yet there was no managers support for it (not surprised, maybe they couldn’t do basic math?).
Ann Arbor is a terrible place , not one for any reliability engineer, very disappointing and medieval. In general, interest in reliability seems to occur only south of the Michigan state line. Managers in A.A. look down on engineering numerical fundamentals, and micromanage engineers into being full time managers themselves.
I put a lot of work and study into my certifications, all for nothing. All I can say is- don’t come to Michigan, you will be so disappointed. Look elsewhere…
Dev Raheja says
Hi James
I agree with you. I wrote the “Death of a Reliability Engineer” about 20 years back. Nothing has changed. In my experience, maths in not the problem. The problem is the wrong knowledge promoted by reliability engineers. They talk of MTBFs and mean time to failures, which is of no interest to senior managers. They are interested in zero failures with high return on investment based on Phil Crosby’s book “Quality is Free.” Read my book “Design for Reliability” for examples.I do that all the time with my clients. Phil Crosby at ITT was my client. With this approach, I get the attention of the Presidents quickly. At Harley-Davidson, the President required all the Purchasing staff to take my training so they can get zero failures from suppliers. The suppliers saved money because they did not have to pay for warranty failures. Let me know if you have some other ideas. Thanks for your post.
Dev Raheja
james karl says
Thanks Dev
Why is is so hard to get support for reliability engineering in Michigan? Go to Wisconsin, Indiana, Ohio and Tennessee and it thrives. The petrochemical industry here does seem to embrace it, though.
Dev Raheja says
Thanks James for the second question. My answer is that Reliability in Wisconsin, Indiana, Ohio and Tennessee may be thriving, but their accomplishments are just barely acceptable, which is not good enough.
There are some good companies, but very few. Hyandai is one of them. Their market share was going down for years, then they decided to design for 10 years minimum life and offered 10 year warranty to convince customers. Their market share shot up right.away. They are making good profits now. Corning Glass Works is another good company. Their policy is not to ship any product if know it is not going to be a high quality product. They include Reliability as part of quality. Example: Their customer Sony ordered glass screens for TVs. A technician at Corning discovered thru life testing that the glass screens can fade after 8 years (reliability). They halted the production right away and told Sony to stop their TV production until this problem is corrected. Sony told corning to ship them anyway because most users buy a new TV in 8 years anyway. But Corning convinced Sony not to use these potentially defective glasses by willing to pay the cost of stopping the production at Sony. THIS IS A GREAT EXAMPLE OF HIGH RELIABILITY.
Ken Neubeck says
I have been a reliability engineering since the 1970s, primarily in the aerospace field, with ten years in the commercial field (product support of bank ATMs).
There are no young reliability engineers left in the United States. Applied Math was a good field in the past to draw reliability talent, but now these applied mathematicians are going elsewhere.
Worse is that management does not realize that reliability MTBF of a product is a point estimate that is based on the complexity of the components that make up the product. It is not necessarily a selling point. Rather, field performance is the true measure for a product’s reliability. I am the last reliability engineer left in Long Island…..I will be turning off the lights soon…..
Baldev G Raheja says
I agree. Field performance is the true measure of Reliability.