What’s Wrong Now? Shotgun Repair

Shotgun repair is trying to fix a system problem by replacing parts until the problem goes away. It is often done without regard to parts’ age-specific reliability information. Should you test before replacement? Which test(s) should you do? In which order? How long? Which part should you replace next if the test gave no indication of what’s wrong? What if test indication is imperfect or the fault is intermittent? What if there are more than one part failure?

Automotive OBD (On-Board Diagnostics ), its successors, and its foreign equivalents allow a mechanic to plug into the OBD system to collect vehicle data and diagnose problems. OBD-II provides standardized series of https://en.wikipedia.org/wiki/OBD-II_PIDs, or DTCs, which “allow a person to rapidly identify and remedy malfunctions within the vehicle.” But OBD II does not necessarily specify which part caused the failure or which part to replace.

Wabash Magnetics Crank Position Sensors

Uniphase had 20% NTF (lasers) and Apple had 50% NTF for computer service parts when I worked for them. Wabash Magnetics claimed returned crankshaft position sensors had 89-90% NTF (no trouble found). When car engines run rough, mechanics suspect a crank sensor problem and swap in a new crank sensor. That usually doesn’t cure the problem. The Auto-Mechanics’ creed was, “Never remove a part you installed if it didn’t fix the problem.” Returns around end of first year (figure 1) look like WEAP, Warranty Expiration Anticipation Phenomenon [George].

Figure 1. Crank Position Sensor return counts by month since introduction.

Optimal test sequence?

Expert systems attempt to combine experience using the cost or time of tests and the probabilities that tests detect the problem. The objective of the optimal test sequence is to minimize the cost-of or time-to detect a failed part [Arruda et al., Boumen et al.].

The same problem formulation occurs in new-product testing when tests have to be done sequentially [Ahmed and Chateauneuf]. Wayne Smith, Apple product reliability manager, told me, “Our job is to make sure that Apple doesn’t ship a computer that doesn’t work.” He said he got only a few computers to test and not enough time-to-market to do all the testing he wanted to do.

What is the optimal test sequence? Which test(s) should you do? In which order? How long? Estimate the fixed cost and cost per hour for each fault for duration of down time. Costs may depend on the number of previous occurrences of same fault in the sequence of occurrences prior to repair. Estimate the age-specific probability that each part could be the cause of the failure conditional on system failure.

Lorne G. (Jack) Mitten first published proof of optimal test sequence that people already knew; “Test what gives the biggest bang per buck.” I. e. test the for the most probable cause that costs the least or takes the least time (ratio). [Jack Mitten hired me to teach at University of British Columbia right out of UC Berkeley. I am grateful for the chance.] This optimal test policy assumes a failed part stay failed and that tests are perfect. Probable cause(s) depends on parts’ ages; i.e., their age-specific field reliability. The test sequence should be age specific because the optimal test sequence depends on the age of the system.

For example, assume Weibull reliabilities for parts A, B, and C. The Excel WEIBULL.DIST() function parametrizes Weibull cumulative distribution as 1-EXP(-(t/Beta)^Alpha). The optimum sequence depends on the cost or time of the test and the age-specific probability of part failure. Ratios P[Part Life<=t]/Cost are computed in an Excel spreadsheet. A, B, and C represent parts. The second set of A, B, and C columns list the optimal test sequence depending on age in the first column at which a system failure occurs.

Table 1. Example of optimal part test sequences depending on product ages.

Parameter/Parts	A	B	C	A	B	C
Alpha (shape)	0.5	1	1.5
Beta (scale)	3	6	6.646
Mean	6	6	6
Cost or time	6.3	10	12
Age t	P[Life<=t]	P[Life<=t]	P[Life<=t]
0	1	1	1
1	0.561	0.846	0.943	3	2	1
2	0.441	0.716	0.847	1	3	2
3	0.367	0.606	0.738	1	2	3
4	0.315	0.513	0.627	1	2	3
5	0.275	0.435	0.521	3	2	1

Optimal Policy Depends on Test Time and Sequence

What if the objective is to minimize test time to find which part caused system failure? Don’t parts age while other parts are being tested?

Suppose a series system and its parts continue to age during sequential testing, perhaps due to the need for continuous operation. Their probabilities of failure conditional on system failure increase during the testing. This may lead to futile testing if system failure indication is faulty, or this may lead to additional failure(s) while testing. For the simple example in table 2, please assume that there is at most one failure. A sequel will describe how multiple-failure testing is done on the Space Station using fault tree analysis [Iverson et al., Veseley et al., Lambert and Yadigaroglu].

Table 2 describes the enumeration of all test sequence permutations: ABC, BCA, CAB, BAC, ACB, CBA and the ratios and sums of P[Part Life<=t]/time ratios. Table 2 uses the same Weibull distribution parameters as in table 1, but the test times in table 2 are shorter so the example will fit in this article. The age-specific failure probabilities are same as in table 1 and account for the age of the parts when they are tested. Each entry in the lower part of table 2 in columns A, B, and C is the sum of the ratios of the probability of failure at the time of their tests divided by their test times. It looks like the optimal test sequence according to the sums of the P[Part Life<=t]/time ratios is CBA followed by either BCA or CAB, tied.

Unfortunately, that’s wrong! If you want to minimize expected total test time (last column), then the optimal test sequence is BCA with expected test time of 2.5 time units. The expected total test time (E[Time]) accounts for possible deterioration of parts while awaiting tests. I.e., test sequence ABC means part B awaits 1 time unit while A is tested. If the test determines that A has not failed, the test of part B begins one time unit older, after starting to test part A. Similarly, if part B has not failed, then the test of part C begins 1+2=3 time units older, after starting to test part A.

Table 2. Optimal part test sequence(s) depend on components’ ages at the times they are tested, which depend on test sequence.

Parameter/Parts	A	B	C
Alpha (shape)	0.5	1	1.5
Beta (scale)	3	6	6.646
Mean	6	6	6
Test time	1	2	3
Probability/Time	A	B	C	Total	E[Time]
ABC	0.561	0.303	0.141	1.006	2.931
BCA	0.243	0.423	0.246	0.912	2.500
CAB	0.315	0.423	0.174	0.912	3.134
BAC	0.368	0.423	0.209	1.000	3.555
ACB	0.561	0.217	0.283	1.061	3.031
CBA	0.243	0.257	0.314	0.814	3.671

If you would like the Excel workbook that does the computations for tables 1 and 2, ask pstlarry@ yahoo.com. Meanwhile, start estimating the nonparametric, age-specific reliability of your systems and their parts; GAAP requires statistically sufficient ships and returns counts.

References

Hussam Ahmed, Alaa Chateauneuf “Optimal Number of Tests to Achieve and Validate Product Reliability,” Reliability Engineering & System Safety, Volume 131, pp. 242-250, 2014

Edilson F. Arruda, Basilio B. Pereira, Clarissa A. Thiers and Bernardo R. Tura “Optimal testing policies for diagnosing patients with intermediary probability of disease,” Artificial Intelligence in Medicine, Volume 97, Pages 89-97, June 2019

R. Boumen, I. S. M. de Jong, J. M. van de Mortel-Fronczak and J. E. Rooda “Test Time Reduction By Optimal Test Sequencing,” INCOSE, 2005

L. L. George “Test vs. Field Reliability: Read it and WEAP,” Test Magazine, June 2018

D. L. Iverson, L. L. George, and F. A. Patterson-Hine, “Fault Tree Based Diagnosis With Optimal Test Sequencing for Field Service Engineers,” Technology 2004, NASA, Washington, DC, Nov. 8-10, 1994

H. E. Lambert and G. Yadigaroglu “Fault Trees for Diagnosis of System Fault Conditions,” Nuc. Sci. and Eng., Vol. 62, pp. 20-34, 1977

L. G. Mitten, “An Analytic Solution to the Least Cost Testing Sequence Problem,” J. of Ind. Eng., pp. 16-17, Jan.-Feb. 1960Vesely, W.E., F.F. Goldberg, N.H. Roberts, and D.F. Haasl “Fault Tree Handbook,” NUREG-0492, U.S. Nuclear Regulatory Commission, Washington, D.C., 1981.

Wabash Magnetics Crank Position Sensors

Optimal test sequence?

Optimal Policy Depends on Test Time and Sequence

References

About Larry George

Leave a Reply Cancel reply

Carl will personally answer all questions. If portions of the question/answer are used in future articles, the content will be completely anonymous.