Introduction
When analyzing a continuous variable or type of measurement using statistics, an analyst often assumes data is normally distributed. But, how can this normal assumption be verified? While there are numerical normality tests, an alternate approach is to use graphical methods. The old adage, “A picture is worth a thousand words”. This captures the idea that the human mind is good at discerning patterns.
Assume the data is plotted on normal probability paper and shows a linear trend. Then, the data is said to follow a normal distribution. The analysis proceeds using the normal distribution and a wealth of statistical methods can be used.
Probability paper is available for many distributions, including the normal, lognormal, Weibull, and exponential. These probability papers designs are specific to a distribution type. This article focuses on normal paper. Other probability papers and plots will be covered in future articles.
Normal Probability Paper
Consider this normal probability paper from Weibull.com, figure 1.
Figure 1
All normal paper have a linear x-axis. In figure 1, the x-axis is divided into 10 major intervals, each divided into 5 sub-intervals. Normal paper designs are available with a different number of intervals. The x-axis range should span the data being analyzed. The number of intervals and tick mark labels should be selected to create an easily read scale.
The y-axis is a non-linear cumulative probability scale. In figure 1, the scale is unique to the normal distribution. It spans the range from 0.1% to 99.9%. Other normal papers may span a different range. The range 1% to 99% is commonly frequently found. If computer software is used, the y-axis range can be any values greater than 0% and less than 100%. Sometimes the y-axis is marked with decimal values rather than percent.
Cumulative Normal Probability
A statistical notation for describing a normal distribution is $-N(\mu,\sigma^2)-$. Here, the leading character N indicates the normal distribution. L would be used for a lognormal distribution; W, a Weibull; and E, an Exponential. Distribution parameters are provided within the parenthesis. For the normal, the parameters are the mean $-\mu-$ and the variance $-\sigma^2-$ . Recognized that these parameters are generally unknown.
Consider the two normal distributions, $-N(10,4)-$ and $-N(5,1)-$. The cumulative normal for each is plotted, figure 2, and both display as straight lines. The $-\mu-$’s control where each distribution line crosses the 50% grid line. The $-\sigma-$’s control the slopes of each line. All cumulative normal distributions display a straight line when plotted on normal paper.
Figure 2
Sample Statistics
From the data, the sample mean is calculated with equation 1 and sample standard deviation is calculated with equation 2.
$$\bar{x}=\frac{1}{N}\sum_{i=1}^{i=N}x_i$$
(1)
$$s=\sqrt{\frac{1}{N-1}\sum_{i=1}^{i=N}(x_i-\bar{x})^2}$$
(2)
Because of random variation, $-\bar{x}-$ is an estimate of $-\mu-$. Similarly, $-s-$ is an estimate of $-\sigma-$.
Plot the Data
Consider any data set with sample size N. To plot the data:
- Sort the data from lowest to highest values, creating a revised ordering.
- Assign an order number from 1 to N to each value.
- For each order number, estimate the median rank, which is an estimate of the median cumulative probability for each data value.
- Plot the cumulative probability for each data value on the normal probability paper.
Step 3 requires a formula to calculate the median rank. If the data is complete, it has no missing or incomplete data. Then Bernard’s approximation formula may be used, equation 3.
$$MR(i)=\frac{i-0.3}{N+1}$$
(3)
Let’s consider the sample data 12.0694, 11.4538, 9.3931, 10.5877, and 8.4254. We want to determine if it is normally distributed? The process steps are followed, yielding the results in table 1.
x | Order Number (i) | MR(i) |
8.4254 | 1 | 0.1296 |
9.3931 | 2 | 0.3148 |
10.5877 | 3 | 0.5000 |
11.4538 | 4 | 0.6852 |
12.0694 | 5 | 0.8704 |
Table 1
Then, Minitab was used to create the probability plot, figure 3
Figure 3
Minitab’s normal probability plot includes graphic and some basic statistics in the upper right. The middle blue line in the graph represents the best fit normal distribution. The 95% confidence bounds bracket the best fit line. The red dots are the data and show a linear trend about the best fit line. Using equations 1 and 2, $-\bar{x}=10.386-$ and $-s=1.487-$. Visually, one can see the best fit line crosses the 50% grid line close to 10.4. The data was drawn from the normal population $-N(10,1)-$. These sample parameter estimates differ from the population parameters because of sampling variation.
Normally Distributed Variables
Consider two larger data sets with 100 values each. They are drawn from populations that are $-N(10,4)-$ and $-N(5,1)-$. A Minitab normal probability plot is shown in Figure 4.
Figure 4
The Minitab graphics include a probability plot, and key labeling and statistics in the upper right.
The plot shows the two samples are normally distributed. The $-N(10,4)-$ data, marked with black circles, follows the black solid best fit line. Similarly, the $-N(5,1)-$ data, marked with red squares, follows the red dashed best fit line. So different normally distributed data follow their respective best fit lines.
Again, the lines bracketing the data and prediction lines are 95% confidence intervals on the prediction percentiles. The data are contained within the confidence intervals with some outliers in the upper tails
The best fit line for each data set intersects the 50% cumulative probability close to the population mean. The tabular statistics in the plot are calculated from the data and show the same. The observed pattern is that different normally distributed variables will intersect different 50th percentile values close to the population mean. Also, the slope changes as the standard deviation changes. Higher standard deviations produce patterns with shallower slopes.
The main purpose of the probability plot is to assess normality. These two examples of normally distributed data produced patterns where the data followed a linear trend on normal probability paper.
Non-Normal Variables
What happens if non-normal data is plotted? Consider data values from a population that is lognormal distributed with parameters $-\mu=2.15-$ and $-\sigma=1-$, i.e., $-L(2.15,1)-$. When plotted on a normal probability paper, figure 4, the data does not follow a straight line.
Figure 4
How do normal probabilities and data agree? The best fit line indicates the probability of having data below 0 is about 30%, or about 30 of 100 samples. However, none of the data is less than 0. From the graph, it is apparent that something forbids values below 0. At high values, there is less data than predicted. For example, the probability of data less than 50 is about 82%. So 18% will exceed 50. So with 100 samples, we expect 18. But, only 5 values exceed 50. The normal distribution predicts more data in the tails of the distribution than are seen in the data. One concludes the data is not normally distributed.
Conclusion
This brief introduction to probability plots considers the use of normal probability paper to determine if data is normally distributed. If normally distributed data is plotted on normal probability paper, it will form a straight line with random variation about a best fit line. If data is not normally distributed, then the trend will deviate from a linear trend.
These normal probability plots were created with computer software. It is easy to perform a normal distribution analysis of data.
In future articles, other probability distribution papers and plots, censored data, and adding confidence limits to the best fit line will be discussed.
Dennis Craggs, Consultant
Quality, Reliability and Analytics Services
(810)964-1529
Leave a Reply