This nonparametric test evaluates if two continuous cumulative distributions are significantly different or not.

For example, if the assumption is two production lines producing the same product create the same resulting dimensions, comparing a set of samples from each line may reveal if that hypothesis is true or not.

The concepts looks at an ascending ordered list of the data from the two distributions (in our case lines) keeping an identifier with the data on which line the data point originated.

Then we count the number of times the data changes from one line to the other.

If too few changes or runs then the two samples most likely come from different distributions.

Small samples here means the number of samples from each source is 10 or less. With more samples, we can use a normal distribution approximation of the expected count of runs and will discuss that in a separate article.

## Example: Determine the number of runs

Letâ€™s say we have two sets of 8 samples from the two production lines and we measure output voltage at a key test point of the individual products (of course you should measure something of importance to the final product quality or reliability).

We have line A and line B with the following data:

A: | 17.65 | 12.95 | 20.20 | 25.00 | 15.50 | 12.75 | 27.05 | 25.20 |

B: | 27.45 | 25.10 | 17.95 | 15.70 | 27.25 | 25.30 | 10.30 | 10.90 |

Letâ€™s order the values and tag each value with the line letter so we can keep track of which value is from which line.

10.30 B | 10.90 B | 12.75 A | 12.95 A | 15.50 A | 15.70 B | 17.65 A | 17.95 B |

20.20 A | 25.00 A | 25.10 B | 25.20 A | 25.30 B | 27.05 A | 27.25 B | 27.45 B |

The runs are identified as which line the data came from. So, in this example the first two, lowest, values are from line B. This is one run, those two values.

Then there are three values from line A, 12.75, 12,95, and 15.50. This is another run. And so on creating a total of 11 runs.

## How many runs is too few?

Now, letâ€™s say the two lines really were very different and produced results that were dramatically different.

We may have all the line A value centered tightly around 20 and of the values of line B centered around 50.

Ordering the values would create the 8 values of line A followed by the line B values creating two runs.

If the two lines were a little closer such that they just overlapped with two values we may have 4 runs A, B, A, B.

Yet, if that was 4 Aâ€™s, followed by 4 Bâ€™s, then 4 Aâ€™s and the remaining 4 Bâ€™s, that is pretty close to being an overlap of the two distributions.

Or is it?

The Wald Wolfowitz approach is to estimate the probability of the number of runs that may occur using (basically) a binomial distribution approach, we can tally the probability of the number of runs till we achieve a reasonable critical value to define the threshold to make a decision.

## The Wald Wolfowitz 2 (small) Sample Run Test

The null hypothesis is the two samples are from the same distribution.

$$ \displaystyle\large {{H}_{o}}:F\left( x \right)=G\left( x \right)$$

The alternative hypothesis is the two samples are not from the same distribution.

$$ \displaystyle\large {{H}_{o}}:F\left( x \right)\ne G\left( x \right)$$

The test statistic takes some work to determine. We need to estimate the probability of 2 runs, then 3, or 4, or 5, etc. number of runs. We can do this till we have the number of observed runs, or reach the critical value of interest.

First the test statistic is calculated by summing the probabilities of observing the count of possible runs. For an even number of runs use:

$$ \displaystyle\large P\left( R=2k \right)=\frac{2\left( \begin{array}{l}{{n}_{1}}-1\\k-1\end{array} \right)\left( \begin{array}{l}{{n}_{2}}-1\\k-1\end{array} \right)}{\left( \begin{array}{c}{{n}_{1}}+{{n}_{2}}\\{{n}_{1}}\end{array} \right)}$$

Where R is the number of even runs and equal to 2k, where k is a positive integer. Where, n_{1} andÂ n_{2} are the number of samples from the two sources.

For an odd number of runs use:

$$ \displaystyle\large P\left( R=2k+1 \right)=\frac{\left( \begin{array}{c}{{n}_{1}}-1\\k\end{array} \right)\left( \begin{array}{c}{{n}_{2}}-1\\k-1\end{array} \right)+\left( \begin{array}{c}{{n}_{2}}-1\\k\end{array} \right)\left( \begin{array}{c}{{n}_{1}}-1\\k-1\end{array} \right)}{\left( \begin{array}{c}{{n}_{1}}+{{n}_{2}}\\{{n}_{1}}\end{array} \right)}$$

That is a lot of calculating when there is large number of samples, thus weâ€™ll use a normal approximation for samples larger than 10 from each source. Yet, here we have only 8 samples from each source, thus we need to calculate the probabilities.

In this case with n_{1} andÂ n_{2} equal to 8 the calculation for the probability of just two runs, R = 2 and therefore k = 1, is:

$$ \displaystyle\large P\left( R=2 \right)=\frac{2\left( \begin{array}{c}8-1\\1-1\end{array} \right)\left( \begin{array}{c}8-1\\1-1\end{array} \right)}{\left( \begin{array}{c}8+8\\8\end{array} \right)}=\frac{2}{12,870}=.00016$$

And the calculation for R = 3 and therefore k=1 again is

$$ \displaystyle\large P\left( R=3 \right)=\frac{\left( \begin{array}{c}8-1\\1\end{array} \right)\left( \begin{array}{c}8-1\\1-1\end{array} \right)+\left( \begin{array}{c}8-1\\1\end{array} \right)\left( \begin{array}{c}8-1\\1-1\end{array} \right)}{\left( \begin{array}{c}8+8\\8\end{array} \right)}=\frac{14}{12,870}=.00109$$

For R = 4, k = 2, P(R = 4) = 0.00761

For R = 5, k = 2, P(R = 5) = 0.2284

And, for R = 6, k = 3, P(R = 6) = 0.06853.

Letâ€™s tally these up and see where we are for cumulative probabilities of being equal to or less than a number of runs.

$$ \displaystyle\large P\left( R\le 3 \right)=0.00016+0.00109=.00125$$

And,

$$ \displaystyle\large P\left( R\le 4 \right)=0.00016+0.00109+0.00761=.00886$$

And,

$$ \displaystyle\large P\left( R\le 5 \right)=0.00016+0.00109+0.00761+0.02284=.00317$$

And,

$$ \displaystyle\large P\left( R\le 6 \right)=0.00016+0.00109+0.00761+0.02284+0.06853=.10023$$

We can now select our critical value or the probability of null hypothesis actually resulting in the run count observed or greater.

For example, if we would to take a relatively small risk, say a 5% risk, or 95% confidence, that the two distributions are actually different when they are actually the same, we select 0.05 as the critical value.

If the count of runs is 5 or less the test statistic is 0.0317 given 8 samples from each source, and the statistic is 0.1002 for R = 6 or less.

Thus if we actually have 5 or fewer runs we have a 95% confidence that the two sources, in this two production lines, are different.

In this case, we have 11 runs, which is creating then the 5 or fewer associated with the critical value, thus we cannot conclude there is sufficient evidence the two lines are different.

## Tables to make this quicker

This approach requires quite a bit of calculation to determine the test statistic.

Yet the values are independent of the actual values measured, as we use the count of runs.

Thus, we can calculate a table for a various number of samples and specific confidence levels or risk thresholds.

Of course, this has been done already and here is one example with a critical value of 0.05 (a 95% one-sided confidence)

n_{1} | n_{2} | Critical R |

10 | 10 | 7 |

10 | 9 | 6 |

10 | 8 | 6 |

10 | 7 | 6 |

10 | 6 | 5 |

10 | 5 | 4 |

10 | 4 | 4 |

10 | 3 | 3 |

10 | 2 |

Thus if we have 10 samples from one source, and 8 from another source, if the number of runs is 6 or less then we reject the null hypothesis the two sources create the same results (in other words they are not the same).

Note there is not critical R value for 10 and 2 samples as it is not possible to conclude with any count of runs if the two sources are different or not with 95% one-sided confidence.

n_{1} | n_{2} | Critical R |

9 | 9 | 6 |

9 | 8 | 6 |

9 | 7 | 5 |

9 | 6 | 5 |

9 | 5 | 4 |

9 | 4 | 4 |

9 | 4 | 3 |

9 | 3 |

n_{1} | n_{2} | Critical R |

8 | 8 | 5 |

8 | 7 | 5 |

8 | 6 | 4 |

8 | 5 | 4 |

8 | 4 | 4 |

8 | 3 | 3 |

8 | 2 |

n_{1} | n_{2} | Critical R |

7 | 7 | 4 |

7 | 6 | 4 |

7 | 5 | 4 |

7 | 4 | 3 |

7 | 3 | 3 |

7 | 2 |

n_{1} | n_{2} | Critical R |

6 | 6 | 4 |

6 | 5 | 4 |

6 | 4 | 3 |

6 | 3 | 3 |

6 | 2 |

n_{1} | n_{2} | Critical R |

5 | 5 | 3 |

5 | 4 | 3 |

5 | 3 | |

5 | 2 |

With fewer than 8 total samples we are not able to make a determination using this method.

## Related:

When to Conduct HALTÂ (article)

Accelerated Life TestingÂ (article)

Reliability TestingÂ (article)

Vivek Namboodiripad says

Hi Fred, Thanks for sharing. I am totally new to this test. Is this related to the Run test for randomness which is also mentioned in a later post? I find the approach same. I am yet to go through it fully.

Fred Schenkelberg says

Yes Vivek, it is very similar to the Run test described in another post. The way the data is arranged to find the runs is a little different, yet the same concept is at the heart of it. cheers, Fred

pradnya says

this eg u gave contains different observations in both samples.

please explain if there are same observations in both the samples repeated then how to calculate total no of runs?

Fred Schenkelberg says

Hi Pradnya, the two values would be listing in the ordering next to each other. So one from group A and the next with the same value from group B, of course you could list B then A. The run is based on which group the reading is from, so the count may change if the listing is reversed. The process is not exact yet with get you within one run count.

For example let’s say the lowest two readings are both 10.1, and the next value in the sort is 10.5 from group B. So one way to order these is A, B B, resutling in two runs. If we reverse the listing of the first two, like ethis B, A, B, then the run count is three. Like I said, not exact.

I would count the runs with both ordering of groups and if it doesn’t change the result, not a big deal. If the change of one run in the count changes the result, then select the more conservative result for your given situation.

Cheers,

Fred

Ajeet kumar says

why we take critical value in this test one sided.

Fred Schenkelberg says

Hi Ajeet, doing a quick review of the test procedure I’m not sure it could be set up for a one-sided test. The non-parametric method is really only looking at differences and able to determine if two groups are different enough to reveal the difference in the count of reversals. If anyone has better information that would be great. cheers, Fred