It seems that anytime we draw a sample, it should be taken randomly. Statistics books and papers regularly advise using a random sample. The adverse effect on results drawn from the experiment may hinge on the randomness of the selection of samples.
As a regular exercise, I ask students to pick a number between zero and 10. The most common selected digit is 7 followed by 3. Not a random outcome at all. Philip J. Boland and Kevin Hutchinson asked various groups of students to create a random sequence of digits and found the same preference for 7. Student Selection of Random Digits, Journal of the Royal Statistical Society. Series D (The Statistician) Vol. 49, No. 4 (2000), pp. 519-529.
So, if you find yourself wondering about the randomness of a sequence there are various ways to test if the sequence is random or not. There are frequency, serial, poker and gap based tests, and others. One simple test is to count the number of runs above and below the average and compare to the expected number of runs for a given sample size of numbers.
Runs defined
The data can be a set of Bernoulli trials (pass/fail, heads/tails, etc.) or a set of numbers. In each case, we generally arrange the data in the order collected. Thus a set of 10 coin flips may appear as:
H H T H T T T H T
The first run is the first two H’s, then there is a run of one, the T. In all, in this group of data, there are 6 runs.
For a set of numbers, the same idea applies yet what causes one run to end and next one to start is if the next data point is on the other side of the average value for the dataset. Let’s look at a set of ten digits.
3 8 7 4 1 2 9 9 5 6
The average is 5.4, therefore the first run is the first digit, 3, as the next digit, 8, is above 5.4. 8 and 7 make up the next run as they remain above the average, then a new run starts with the 4. In all there are 6 runs.
Are the data sets from random processes?
Fortunately, Frieda S. Swed and C. Eisenhart published Tables for Testing Randomness of Grouping in a Sequence of Alternatives, an article in The Annals of Mathematical Statistics, Vol. 14, No. 1, Mar., 1943, pp. 66-87. The paper includes tables which provide the probability of observing the number of runs given the set of data characteristics (values above or below the average and number of runs).
In both cases above with the data containing 5 values above the average or heads, and give values below the average or tails, and observing 6 runs, there is little evidence the sequence of values is no random. If we would have seen either 2 or 10 runs, where each run count has a less than 5% chance of occurring we may suspect the data collected as not random.
Basic table
Using a significance level of α = 0.05 and with n1 being the count of points above the average and n2 being the count of points below the average, we can establish limits for a range of cases.
n1 + n2 Plotted Points | Smallest Run Limit | Average # of Runs | Largest Run Limit |
8 | 1 | 5 | 9 (not possible) |
10 | 2 | 6 | 10 |
12 | 3 | 7 | 11 |
14 | 3 | 8 | 13 |
16 | 4 | 9 | 14 |
18 | 5 | 10 | 15 |
20 | 6 | 11 | 16 |
22 | 7 | 12 | 17 |
24 | 7 | 13 | 19 |
26 | 8 | 14 | 20 |
28 | 9 | 15 | 21 |
30 | 10 | 16 | 22 |
34 | 11 | 18 | 25 |
40 | 14 | 21 | 28 |
50 | 20 | 26 | 32 |
Thus with 5 data points being heads and five as tails, n1 + n2 equals 10 and we observe 6 runs, thus since 6 runs is the expected value and between 2 and 10, the sequence is probably random. Or we could say there isn’t anything funny about the data to indicate it is not a random sequence of value.
Related:
Statistical Terms (article)
Hypothesis Tests for Variance Case I (article)
Hypothesis Tests for Variance Case II (article)
Leave a Reply