What’s Wrong Now? Intermittent Failures?

“Aircraft LRUs test NFF (No-Failure-Found) approximately 50% of the time” {Anderson] Wabash Magnetics claimed returned crankshaft position sensors had 89-90% NTF (No-Trouble-Found), Uniphase had 20%, Apple computer had 50% [George].

Figure 1. Actuarial TNI (Trouble Not Indicated) return rates for Freightliner crank sensors show quarterly and annual periodicity. These rates are derived from) population ships and returns counts. Crank sensors require an ohmmeter to check correct resistance limits, different for each part number.

Shotgun tests test or replace everything and then decide whether the failure has been fixed. Compare the difference between cost of a shotgun test vs. the cost of the optimal system test sequence and test time until a test recognizes the failure. The simple solution of sequencing tests in order of bang-per-buck, P[Part Failure]/(Cost-of-part test) [Mitten], appears to be optimal even if part failures are intermittent.

This article describes optimal test sequence and test time to locate intermittent failures: https://en.wikipedia.org/wiki/Intermittent_fault. NTF, NFF, DNI, DNF, CND, FNF, and other acronyms indicate the problem. Using a specific test sequence may not be enough to optimize the process of detecting and eliminating intermittent failures. Tests could depend on time spent testing as well as the test costs and probabilities of detecting an intermittent failures. Testing for intermittent failures has been subject of many publications.

“NFF (No-Failure-Found) costs the DoD between $2 and $10 billion annually,” [Anderson]. Ken Anderson’s presentation shows worst cases and describes some automated test equipment to detect specific intermittent failures.

Sensitivity vs. specificity: testing could also generate false positives. “Type I and type II error probabilities are included and single-pass sample paths are required. The model accounts for the expected costs of testing components, false positive termination, and no-defect-found outcome” [ Susan Reller]. Her thesis includes a BASIC program for optimal test sequence similar to table 2 in this article for eight subsystems.

Intermittent failures could depend on the ages of the system or the subsystem containing the failure. The intermittence rate could increase, the duration of the intermittence could increase, or the detectability of the failure could improve with age [Ahmad et al., Nakagawa et al.]. Those authors describe test methods, fault tree analysis, and Markov models, but not test sequencing.

What is the optimal diagnostic method for multiple failures with imperfect or unreliable tests? Shakeri et al. describe algorithms for diagnostics, including repetitive tests and multiple faults or failures.

Where could we get the data for test time, test cost, intermittent failure probabilities, and imperfect test detection probabilities? Ask people with experience. Update probabilities with Bayes law. Apply the results to combat readiness [Vatterott et al.]. Their article won the INFORMS Franz Edelman award, perhaps because of its comprehensive scope from origin to implementation.

Optimal test sequence is also important for testing new products [Boumen et al.]. Optimal test sequence is also important in medical diagnostics [Arruda et al.]. Software testing might also optimize the test sequence [Srivastava].

Geometric Distributions of Intermittent Failure Detection

Automated Test Equipment is usually designed to detect one kind of failure in one subsystem. Some tests simply indicate whether a failure is present or absent at the moment when the test is done. This means that the number of tests to find an intermittent failure in one system has the geometric distribution with some probability of detection at each test. What if failures could occur in more than one subsystem? What if failures could be of more than one type?

Assume an intermittent failure has probability p of being detected when tested, because the failure is present and the test detects the failure. The number of tests to detect the failure has the geometric distribution, the discrete equivalent of the continuous exponential distribution, with the “memoryless” property. Then the probability of detection at the second test is p*(1-p) if not detected on the first test, and the mean number of tests to successful failure detection is 1/p.

A shotgun test tests all subsystems (or replaces parts) and then decides whether a failure has been detected or eliminated. Compare shotgun test vs. optimal test sequence for intermittent failures, which terminates the test on failure detection. Consider the problem of testing a series-subsystem product with intermittent failures, different test costs or times, different intermittent failure probabilities, whether due to the intermittency of the failure or imperfection of the tests. With two subsystems, tests could be conducted in order of 1 then 2 and then repeated if no failure is found. Is it optimal to repeat the test sequence 1 then 2 if no failure is found? It could depend on the intermittent failure occurrence and detection probabilities.

If there is more than one subsystem to be tested, the cost or time of the optimal test sequence depends on results of previous passes through the subsystems. Table 1 shows the probabilities and costs for shotgun test of both subsystems for two passes, if the first pass does not detect the failure. Shotgun tests both parts and then indicates whether a failure is detected. Costs could be either time or $$$. Ptest1 is the probability of detecting subsystem failure on the first pass, and Ptest2 is the probability of detecting subsystem failure on the second pass conditional on no-failure-found on first pass. (The probabilities of detecting a failure on the second pass are conditional on NOT detecting a failure on the first pass.) “Means” are the geometric expected numbers of passes until failure in either subsystem is found and the expected cost.

Table 1. Success probabilities and costs for shotgun test of a two independent subsystems in series. The first four passes or shotgun replacements are shown. Probability of detection on second pass is Ptest2 = Ptest1/(1-0.3); because 1-0.3 is the probability of no failure found on first pass test sequence. P(A|B)=P(A and B)/P(B) where event A is the detection probability on the second pass, event B is the that the second pass is needed, and event A and B is the event of a failure detection on the second pass.

Subsystem	1	2
Cost or time	$6	$10
Ptest1	0.1	0.2
	P[Detect 1]	P[Detect 2]	P{Detect]	Total Cost
1^st Pass	0.1	0.2	0.3	$16
2^nd Pass	0.143	0.286	0.429	$32
3^rd Pass	0.175	0.350	0.525	$48
4^th Pass	0.211	0.421	0.632	$64
Means	10 passes	5 passes	3.33 passes	$53.33

On the first pass, the shotgun test costs $16 and the shotgun detection probability is 0.3. On the second pass, if necessary, the cost is $16 and detection probability is 0.428. Expected cost of two shotgun passes is $32. The average number of shotgun passes is 3.57, and the average cost is 3.57 passes times $16 per pass = $53.33

Table 2 is an example of the optimal test sequence assuming testing stops on detection of the failure, using the same cost or time data as in my previous article on shotgun and optimal test sequence and the same as table 1. Mitten’s rule would say the optimal test sequence is test 1 then test 2, because test 1 P{Test]/Cost is 0.1/$6 = 0.01667 and test 2 P{Test]/Cost = 0.2/$10 = 0.02.

The total test time or cost depends on the optimal test sequences at successive passes. Testing stops when a single failure is detected. The probability of detecting a single failure on the first test is 0.1 for subsystem 1 or 0.2 for subsystem 2, depending on the sequence of test. If the first test doesn’t detect a failure, then the other subsystem is tested and the probability of detecting a single failure is 0.28 regardless of the sequence, even though testing may stop if first test detects a failure. The probability of NOT detecting a single intermittent failure, 1-0.28, is the same regardless of test sequence, equivalent to shotgun testing of every subsystem. This testing is repeated if no failure is found on first pass.

The expected costs differ depending on test sequence on successive passes if necessary. I tried different costs and detection probabilities but the optimal test sequence remained the same for pass 1 and pass 2. Although it appears that the optimal test sequence on successive passes is 1, 2, I wouldn’t claim that is generalizable, especially of more subsystems than two. Obviously the expected cost of sequential test is less than shotgun tests, because sequential tests terminate on failure detection.

Table 2. Two subsystems with one intermittent failure. The 1,2 and 2,1 denote the test sequence alternatives. The first pass test sequence 1,2 costs less than 2,1, and the second pass optimal test sequence is also 1,2, conditional on the fact that the first pass does not detect the intermittent failure.

Subsystem	1	2
Cost	$6	$10
Ptest1	0.1	0.2
1^st Pass	P[Detect]	P[Detect]	Total P	1-Pass cost	E[Cost]	Total
1	0.1			$6	$0.6
2		0.2		$10	$2
1,2		0.18	0.28	$16	$4.48	$5.08
2,1	0.08		0.28	$16	$4.48	$6.48
Ptest2	0.138889	0.277778		2-Pass Cost	E[Cost]	Total
2^nd Pass
1	0.138889			$6.00	$0.83
2		0.277778		$10.00	$2.78
1,2		0.24	0.38	$16.00	$6.05	$11.96
2,1	0.100309		0.38	$16.00	$6.05	$15.31

The optimal test sequence for the second pass seems to be the same as for the first pass! The optimal test sequence is in order 1,2 with expected cost $5.08 for the first pass and expected cost for two passes if necessary is $11.96, including the cost $5.08 of the first pass.

Optimal Test Time for an Intermittent Failure

If you replace every part, the intermittent failure might go away. “Shotgun” repair replaces parts until the intermittent failure goes away. But you still have to wait to determine whether the failure goes away after all part replacements. And the failure may come back after the shotgun replacement, so you have to do it again.

You may complain that the optimal test sequence might be irrelevant for intermittent failures; the problem is how long to test before doing shotgun replacement of every part that could have failed? It seems reasonable to spend some time waiting for an intermittent failure instead of just doing a sequence of tests and hoping the intermittent failure is present when the tests are done. Test time costs, and may incur different costs for different tests as well as the costs of replacement parts, if the test is part replacement to see whether the failure goes away.

Alternatives include: combinations of: Ignore problem, test for problems, replace most probable cause(s), verify the fix, and repeat as required. Probably won’t repeat this indefinitely because eventually you’ll replace everything. Troubles mount if more failures occur before original failures are cured or if test tactics cause additional failures. Nevertheless, it may be worthwhile to optimize test time for all tests.

The objective is to derive the maximum test time T to minimize the expected cost of testing vs. shotgun replacement. The presumption is that testing actually detects the failed part, if the system demonstrates the failure while the system is under test.

Problem formulation: the objective is to find the optimal system test time T depending on the cost of test time CT, replacement parts’ costs C, and the frequency of intermittent failure(s) p. I.e., minimize total time or cost for each failure (including repeated tests and parts), subject to limited time, tools, spares inventory; imperfect diagnostics; and not to exceed warranty replacement cost plus intangible costs. Intangible costs include: frustration for both customer and service manpower, facilities, tools, spares, repeated service demands, and deleterious WOM (Word-Of-Mouth.

Murphy’s assumption: P[part i causes failure|Test shows failure j] > 0; i.e., several parts might cause failure j and part i might cause several failures, probability replacement of part identified with failure fixes problem < 100%, replacement causes other problems, what was replaced last time and how long test was?

Caution: neurohazards ahead! Mathematics and statistics! Nomenclature:

k = the number of parts to be tested or replaced

lambda = intermittent failure rate, because Greek lambda “l” doesn’t get rendered sometimes.

p = probability the correct part is identified and fixed or replaced if intermittent failure is observed during a test.

1/(k-i+1) := probability correct part replaced when failure not seen in i-th test. Shotgun probability 1/(k-i+1)could be replaced with Paretos if FRUs are replaced in order of observed frequency of field failures.

T*CT+C := Cost of test time T where CT is cost per unit time and C is fixed cost of test or part cost, the same for the part that causes the intermittent failure and the other parts

The expected number of tests is 1/(1-p). If no testing, the expected number of shotgun repairs or replacements is (1+k)/2.

P[correct repair at the first test] =P[correct and no failure in test] + P[correct and failure] = P[k, p, T, lambda]:= Exp[-lambda*T]/k+p*(1-Exp[-lambda*T]). Expected cost =

EC[CT,C,k,p,T,lambda]:=T*CT+C/(1-Exp[-lambda*T]/k+p*(1-Exp[-lambda*T]))

To find the optimal test time T, solve the expected cost derivative equation d[EC[CT,C,k,p,T,lambda]/dT==0, for T. The solution is pretty messy. I used Mathematica.

Figure 2 shows results for some inputs with different numbers of parts of FRUs.

Figure 2. Test costs $1 per unit time, part costs $10, 0.9 = P[replace right component|failure appears], and 0.1 = failure rate per unit time.

The probability of correct repair at the first service call is the accidental probability of replacing a failed part, even though it’s an intermittent failure, plus the probability of replacing the correct part when the failure occurs during the test of a part,
P[correct|no failure in test]*P[no failure in test time T]+P[correct|failure]*P[failure in test time T]

Assuming system failures occur as Poisson(lambda) process events, P[No failure in test time T] = Exp[‑lambda*T]. Then the probability of correct repair is P[k,p,T,lambda]:=
(1/k)*Exp[-lambda*T]+p*(1-Exp[-lambda*T])

The cost of the repair at any attempt is T*CT+C, and the expected number of repairs is
1/(1-P[k,p,T,lambda]) with expected cost =

EC[CT,C,k,p,T,lambda]:=T*CT+C/(1-(1/k)*Exp[-lambda*T]+p*(1-Exp[-lambda*T]))

If no testing, the expected number of shotgun repairs (1+k)/2, and the total cost is Ct+C*(1+k)/2. If testing is attempted, then the maximum test time is obtained by finding the test time T that equates the marginal costs of testing and shotgun repair.

Conclusions

This article describes and compares alternatives for testing for intermittent failures: shotgun (replace everything and hope the failure doesn’t happen again), optimal test sequence repeated if necessary, and optimal time to test whether using shotgun or optimal test sequence. These alternatives are not all there is to know about dealing with intermittent failures.

Pretty obviously shotgun replacement of all suspected failed parts may not cure intermittent failures. I conjecture that the optimal test sequence may be the same for repeated passes if failure is not found on the first pass. I checked a three-subsystem optimal test sequence, and it did not change on multiple passes, even with different costs and failure detection probabilities. The optimal time T to spend testing is a complicated but computable formula.

If you want the Excel workbook for tables 1 and 2 (including 3 subsystems) or the Mathematica notebook for optimal test time, let me know (pstlarry@yahoo.com). If you want Mathematica solutions, send your cost and failure parameter data. If intermittent events do not have Poisson counts or exponential distribution intermittency, send field data, and I will estimate distributions and try to adapt test sequences and time to optimize intermittent failure testing.

References

Ken Anderson, “Intermittent Fault Detection Technology Reduces No Fault Found (NFF) & Enables Cost Effective Readiness,” Universal Synaptics, https://www.usynaptics.com/wp-content/uploads/2020/08/Intermittent-Fault-Detection-Overview.pdf, 2018

Edilson F. Arruda, Basilio B. Pereira, Clarissa A. Thiers and Bernardo R. Tura “Optimal testing policies for diagnosing patients with intermediary probability of disease,” Artificial Intelligence in Medicine, Volume 97, Pages 89-97, June 2019

R. Boumen, I.S.M. de Jong, J.M. van de Mortel-Fronczak and J.E. Rooda “Test Time Reduction by Optimal Test Sequencing,” INCOSE, 2005

L. L. George, “What’s Wrong Now? Shotgun Repair,” https://accendoreliability.com/whats-wrong-now-shotgun-repair/ Sept. 2022

L. L. George, “What’s Wrong Now? Multiple Failures,” https://accendoreliability.com/whats-wrong-now-multiple-failures/#more-499735, Sept. 2022

L. G. Mitten, “An Analytic Solution to the Least Cost Testing Sequence Problem,” J. of Ind. Eng., pp. 16-17, Jan.-Feb. 1960

T. Nakagawa and K. Yasui, “Optimal testing-policies for intermittent faults,” IEEE Transactions on Reliability,Volume 38, Issue 5, December 1989

T. Nakagawa , M. Motoori , K. Yasui “Optimal testing policy for a computer system with intermittent faults,” Reliability Engineering & System Safety, Volume 27, Issue 2, pp. 213-218, 1990

Susan R. Reller “Reliability Diagnostic Strategies for Series Systems Under Imperfect Testing,” Masters Thesis, VPI, Sept. 1987

Mojdeh Shakeri, Krishna R. Pattipati, Vijaya Raghavan, and A. Patterson-Hine “Optimal and Near-Optimal Algorithms for Multiple Fault Diagnosis with Unreliable Tests,” IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews, Vol. 28, No. 3, August 1998

Praveen Ranjan Srivastava “Optimal Test Sequence Generation Using Firefly Algorithm”, Swarm and Evolutionary Computation Journal, Vol 8, No 1, pp. 44-53, Feb, 2013

Andrew Vatterott and 16 other authors “Bayesian Networks for Combat Equipment Diagnostics,” Interfaces, February 2017

Comments

Larry George says
November 14, 2022 at 4:42 PM
I should have cited Eileen Bjorkman, “Test and Evaluation Resource Allocation Using Uncertainty Reduction [entropy] as a Measure of Test Value,” George Washington Univ. PhD Thesis Aug. 2012
Not only is entropy reduction a good idea for test planning, her thesis contains reviews of alternative test-planning articles not included in sequential test plans!