Death of a Reliability Engineer
(Reproduced from the article “Death of a Reliability Engineer” by Dev Raheja, Reliability Review, Vol. 30, March 2010 with permission)
When I first wrote the article in March 1990, I implied an ‘F’ grade to reliability engineers. Now almost 20 years later, I would give them a “E’. Yes, there is a little improvement but nothing you can write to your mother about.
The MTBF cancer was wide spread and is still wide spread in the DoD. The only reason I upgraded the reliability engineer from F to E is because the MTBF in some industries is no longer used such as in the automotive industry. They use the failure rates instead to hide their shame.
Failure rate is just the reciprocal of MTBF. Good job! Same old corn flakes with a new product name!
Several recent discussions cause me to recall some highlights of a conversation I had over a decade ago with the late Dr. Austin Bonis while we were conducting the first ASQC course on Reliability Engineering. He made the following interesting statement:
The design engineer knows a lot but is never able to do a lot; the quality control engineer does not know a lot, therefore does not do a lot; the reliability engineer knows a lot, does a lot, but too late!
At this time, I am moved to amend his statement so that it reflects a frequent problem, as follows:
The reliability engineer knows a lot, does a lot, but what he does is usually wrong!” Also, “If the basic reliability work is done with a lot of mistakes; then it does not matter if the work is done too late!
Some Good, Some Bad
There are some excellent reliability engineers. They have prepared for their tasks with a basic engineering education that included physics, chemistry, and fundamental concepts and principles. They combine this knowledge with statistical theory.
On the other hand, many so called reliability engineers ignore physics and chemistry, and fail to consider design reliability lessons. They jump straight into statistics. They are lost if hard data are not available. They ignore the fact that reliability can be improved without the statistical analysis.
All one has to do is study the failure modes, accelerated testing results and be aware of the customer problems. The bottom line in reliability is to prevent all failures during the useful life. Some reliability engineers will not agree with this statement. They think a certain level of failures is unavoidable.
In my opinion, such engineers should go and perform time and motion studies rather than work as reliability engineers.
A Case History
An engineer was assigned to work as a Reliability Engineer. He had taken a statistics course in the college he attended; therefore felt prepared for the position. Soon he began to encounter difficulty in applying his knowledge of statistics. The test data was never enough; the field data never was complete and was full of errors. He kept complaining about the lack of data collection effort. Meanwhile many shoddy products went out the door. No one knew what to do when only a small quantity of data existed. After a few product disappointments, the company decided to go into a full- fledged reliability program to assure high reliability in the design. He then proceeded to apply these skills. The first item on the agenda was reliability prediction. Our reliability engineer was eager to see this task done right. Now he has a bunch of numbers in his grip. He can compute failure rates for each component from the MIL- HDBK-217E, add them up and calculate the MTBF. The design engineer was not involved. Why involve him? The product is already designed and being tested!
As years pass, the reliability engineer discovers that many gross assumptions were made in his education. The fundamentals of engineering were overlooked. He was taught to assume that the failure rate is constant which makes calculations simple. Even some of the industry “experts” in aerospace companies seemed to make that assumption.
The real failure rate was rarely constant for any component because of infant mortality failures from manufacturing defects. Even electronic components showed decreasing failure rate with time. For mechanical failure mechanisms the failure rates increased with time. The decreasing failure rates would have been good news except that the starting failure rate was anywhere from 10 to 15 percent, which made the customer mad as hell.
Eventually, the management got the message that the customer is not supposed to pay for 10 percent defective products and therefore some in-house screening was added. This raised the cost of reliability but the reliability engineer felt more secure. Soon the screening tests became the standard operating procedures. This was helpful because management was interested in technical merit. Sometimes the customer wanted this information.
Unfortunately, the result of all his work (which cost about four man months) had very little to do with the real MTBF. The failure rates in the MIL-HDBK-217E were outdated, and collected over a large variety of applications. They are based on the assumption of constant failure rate. This implies the failure distribution for components is exponential. Very few real components actually had this failure distribution.
The components, even electronics, followed several shapes of failure distributions. All these shapes were ignored. It was too much work to determine the real failure distribution. Such data does not exist in the data banks. Since the whole industry had already been using MIL-HDBK-217E to make reliability predictions, our reliability engineer had no choice but to go along.
The predictions, to be credible, should have been a combined result of the review of past experience, qualification tests, and the MIL-HDBK-217E. I suppose a conservative prediction is better than no prediction. Such predictions can always be adjusted by multiplying with so called experience factors, sometimes crudely called the fudge factors!
Our reliability engineer was told that the Arrhenius model applies to electronic components. He was not sure but he did not question the judgment of those with over 20 year experience. But he did find out later that many failures in electronics are mechanical. The Arrhenius distribution did not apply to such failures. He found it convenient to use Arrhenius as long as everyone around is a believer in it. He also used the Activation Energy constant from the published data of device manufacturers. This never made sense to him since his devices were not built exactly the same as the original manufacturer built them; but he had to go along with it.
This company never had money to run a few experiments to assess the real Activation Energy constant.
The management had a great TQM (Total Quality Management) program. But that was only in name, not in spirit. When the time came to put money on the table for quality improvements, the management was very unhappy. The TQM program meant that you talk up improved quality but do not spend time or money implementing the new effort.
FMEA and Fault Trees
After becoming frustrated with the make believe world of reliability numbers (my opinion), the reliability engineer sought more tools.
He found the Failure Mode Effects, and Criticality Analysis (FMECA) and the Fault Tree Analysis (FTA). But, he did not quite know how to use them correctly. The experts confused him more than they helped. Every expert had his own way and industry was already misusing these tools. Many were using these tools to perform reliability estimates and modeling rather than improving the product design. Reliability engineers labored many months to perform these analyses. They helped get attention on the failure rates but did not make much impact on the design engineer. Management was satisfied because these are the tools everyone is supposed to use.
The MIL-HDBK-217E predictions, the FMECA, and the Fault Trees were impressive. They dazzled management. That is, until the recession hit the industry. Then Management began searching for places to cut costs; non-essential tasks became a target. They found reliability engineering to be a non-essential cost. Then, the reliability engineer was given the pink slip and the whole reliability engineering department was eliminated to achieve profitability.
What Went Wrong?
There is a long list of things that went wrong. I will mention only a few. The right tools such as FMECA and the Fault Trees were used by the reliability engineer but he was not qualified to use them because he did not know all the details of the design. He should always use these tools together with the design engineer and the manufacturing engineer as a team BEFORE the design is released, not after. The tools should have been used for design improvement.
The MIL-HDBK-217E should be used for comparing design options, not for field reliability prediction. There are too many assumptions in the MIL-HDBK-217E which the user is not aware of.
The screening tests were used for inspecting the product rather than learning to eliminate the failure modes and lower the production costs. These observations show that the reliability engineer requested design changes that increased, rather than reduced the costs.
Look Into The Mirror
The above example is not uncommon. Look at yourself in the mirror. Possibly you will find similarities in your situation. I find indications worldwide that when misusing the tools continues, the professional death of the reliability engineer is likely. This is one place the constant failure rate applies. Not to the product. To the reliability profession! Actually there is nothing wrong with the reliability profession. The problem lies with the professors. They hardly teach reliability in the engineering school. Among those who teach reliability, they tend to emphasize applied statistics instead of robust design. I hope the universities will do something positive; not only to prevent the death of a reliability engineer, but also, the death of a design engineer who is responsible for reliability.
Dev Raheja is an International Reliability Consultant from Baltimore Maryland. He originally wrote this article for the March 1990 edition of Reliability review. A fellow of ASQ, he can be reached at Draheja@aol.com