In those situations where we sample without replacement, meaning the odds change after each sample is drawn, we can use the hypergeometric distribution for modeling. Great, sounds like statistician talk. So, let’s consider a real situation.
While working with a component vendor on a new part, we agreed to conduct some testing. We agreed on the testing conditions and determined that if the 30 samples provided passed the test, we would accept the component and use it in our products. The vendor was happy with the test conditions and even helped to set up and conduct the test. They were very confident the testing would pass given their stated failure rate of 100 ppm (1/10,000 chance) given their internal testing and field history.
The total production of components, which provided the evidence that everything was alright, was about 500,000 units. So, the testing proceeded and found 28 of the 30 samples failed. Everyone involved was surprised at this result.
- The test setup was done properly.
- No damage occurred during shipment or setup
- The test measurements were accurate and repeatable.
- Even the vendor representative agreed everything was done correctly.
A lot of meetings, discussions, and failure analysis occurred. The 28 of 30 units were valid failures. Everyone involved accepted this result.
So, my boss received the argument that it was simply chance that the 30 samples drawn at random* and the 28 failures were due to simple random chance.
Interesting
Having just learned about the hypergeometric distribution, I decided to calculate the probability of 28 of 30 units failing when the population of 500,000 enjoyed a 1/10,000 failure rate. Given the low failure rate, we would expect 500,000 * 1/10,000 = 50. Meaning out of all half million units there are only 50 units that are expected to fail.
One way to express the probability density function of the hypergemetric function is
$$ \large\displaystyle f(x,N,n,m)=\frac{\left( \begin{array}{l}m\\x\end{array} \right)\left( \begin{array}{l}N-m\\n-x\end{array} \right)}{\left( \begin{array}{l}N\\n\end{array} \right)}$$
where $$ \large\displaystyle \left( \begin{array}{l}m\\x\end{array} \right)=C_{x}^{m}=\frac{m!}{x!(m-x)!}$$
This calculates the probability of exactly x successes in a sample of n taken from a population of N which contains m successes. For this problem we are defining the test failure as a success and we’re interested in the chance that a sample of 30, n, taken from the half million, N, which has an expected total number of ‘successes’ of 50, m, would result in exactly 28, x, ‘successes’.
$$ \large\displaystyle f(28,500,000,30,50)=\frac{\left( \begin{array}{l}50\\28\end{array} \right)\left( \begin{array}{c}500,000-28\\30-28\end{array} \right)}{\left( \begin{array}{c}500,000\\30\end{array} \right)}=\frac{\left( 8.47498x{{10}^{13}} \right)\left( 1.24986x{{10}^{11}} \right)}{3.50802x{{10}^{138}}}=3.16203x{{10}^{-114}}$$
This means there is a 1 in 10114 chance of this result if the population really does have the claimed low failure rate. We determined the test was valid and the argument of ‘random chance’ not plausable. We did not accept the new component into our application.
*truly random sample is very unlikely – most likely the samples came from a few lots that were immediately available. We are assuming the sample and lots are representative of the overall population.
Related:
Sample Size – success testing (article)
Hypothesis Tests for Variance Case I (article)
Paired-Comparison Hypothesis Tests (article)
Sorin Voiculescu says
I fully agree with the interpretation of the results both from the point of view of the final client and from the statistical one. The present case wanted to contradict the “simple random chance” statement and did successfully.
I am just trying to relate the crushing results to real components. We have on one hand a very confident vendor and on the other a failed test. It seems like the testing procedure cannot be questioned. I was wondering, from the component vendor point of view, how did he get there?
– Was it design related?
o was the failure root cause identified as being repetitive – could reliability growth have been a solution
o from the text, it seems like a previous design was updated: was the failure root cause related to the design/operating conditions update?
o Have some development testing been done previously to the Reliability testing?
– Was it production related?
o was it some production line related root cause – have the units been previously exposed to ESS/HASS?
Fred Schenkelberg says
Hi Sorin,
Good thoughts and questions. All of which and more were considered and explored at the time. Part of the story was the use of the components in a slightly different environment, hence all the testing. The slight rise in the temperature effected the solder joint metallization causing embrittlement. Cheers,
Fred
John Evans says
Fred,
This in an interesting example and it is quite stunning that any reasonable person would suggest the outcome occurred by pure chance; although, pretty much nothing people say it defense of their products or designs surprises me anymore.
With regards to the Hypergeometric distribution, I doubt I would have bothered with the slightly extra effort of calculation over the binomial distribution for this example. I pretty much consider 500,000 units to be an infinite, rather than finite population. If the production lot was less than a few hundred, given the sample size, perhaps I would use the Hypergeometric distribution. I would be very interested in knowing other’s thoughts on what a reasonable finite population size would be as the cut-off for using the Hypergeometric distribution in this case.
With regards to the PoF, what was the time frame involved from the application of the higher temperature application to testing? Solder embrittlement would take some time to occur. Although I once experienced a similar problem with a CMOS camera chip. There was a very high infant failure rate in production from a product that was claimed to have a very low defectivity. SEM, EDX and other analysis revealed the problem. Field data clearly showed a decreasing TTF from time of deployment for newer and newer equipment. The issue turned out the be that the CMOS camera chips were all from the same production batch which was acquired from the manufacturer as a last time buy. They had been sitting in storage and the TTF, when normalized for actual chip age was highly consistent. The verification and validation report that we were able to obtain from the manufacturer (after much effort) revealed that the CMOS chips were actually fabricated a year or two before those batches used for verification/validation. Clearly the manufacturer had a problem prior and couldn’t pass verification with the earlier production lots. They did; however, not seem to have a problem selling thousands of these unverified devices as a last time buy to a customer.
Fred Schenkelberg says
Hi John,
Thanks for the note. I’ve given much though to when the assumption of infinite population occurs, yet that would be an interesting exercise. For the PoF I’ll duck behind the shield of non-disclosure which I’m under. Needless to say, many different paths exist to causing what would normally be considered a ‘good’ solution, to go bad. Henry Petroski’s book Design Paradigms talks about altering an existing design just a little, then a little more, etc. which leads to a design failure, is not uncommon.
cheers,
Fred
Askhat Turlybayev says
This means that the product was not used or tested under originally specified product`s mission profile conditions, which in turn means that the results are invalid. In this situation 28 failures out of 30 tested items is expected.
Fred Schenkelberg says
In some cases I would agree – here both buy and vendor thought the product would work. There was a test based on use conditions that would reveal a suspected failure mechanism (the agreed upon most likely to occur) and we thought it would be a very rare occurrence. The testing results were a complete surprise. I’m sorry that I cannot be more specific – pesky NDA’s and such. And do recall that point of the article was the calculation of the claim it was due to random chance, yet it was possible, just not very likely.
cheers,
Fred