In another post, I started the discussion about variability and interquartile range. This is part 2 of that discussion and will focus on variance.
With rare exception, most distributions or groups of data require more than one parameter (or statistic) to fully describe both the location and spread (scale and shape) of the group of data. Is the data clumped tightly about some value, or spread out over a wide range.
Let’s consider a sample of five height measurements taken from 5 men in a company.
The readings in inches are: 63, 67, 61, 66, 68
The sample mean is
$$ \large\displaystyle \bar{y}=\frac{\sum\nolimits_{i}{{{y}_{i}}}}{n}=\frac{325}{5}=65$$
One way to describe the variation is to consider the differences between the individual values and the mean. The differences in inches are: -2, 2, -4, 1, 3. We could average these difference to get the average difference from the mean, yet that will always turn out to be zero. Not very informative.
Another thought would be to use the absolute values of the differences although this approach is difficult to interpret. So, let’s consider another common way to remove the sign of values, square the values. Thus the list of differences becomes 4,4,16,1, 9.
The average of this list has the property of being equal to or greater than zero, yet is not in the same units as the data. In this case, it is inches squares, and we are not talking about an area with reference to the heights of people. We call this variance, the root mean squared value, and it does have useful properties for the statistical theory which is well beyond the scope of this entry.
We can take the square root of the root mean square of the difference to find the standard deviation. More on that in a moment.
Population Variance, σ2
The variance of a set of n measurements y1, y2, … yn which include all items in the population with mean ȳ is the sum of the squared deviations divided by n.
Sample Variance, s2
The variance of a set of n measurements y1, y2, … yn with mean ȳ is the sum of the squared deviations divided by n-1. [Ott, page 79]
It is the square root of the variance that is of most interest and easiest to interpret for most statistical discussions. We call this value the standard deviation.
Population Standard Deviation, σ
The standard deviation of a set of n measurements is defined to be the positive square root of the variance. [Ott, page 84]
Sample Standard Deviation, s
The sample standard deviation of a set of n measurements y1, y2, … yn with mean ȳ is the positive square root of the sum of the squared deviations divided by n-1.
For groups of data that are roughly bell-shaped6 (normal distribution) then the standard deviation provides the following information about the spread of the data. Keeping these values in mind may prove useful when estimating solutions or check values from a standard normal table.
- The interval ȳ ± s contains approximately 68% of the measurements
- The interval ȳ ± 2s contains approximately 95% of the measurements
- The interval ȳ ± 3s contains approximately 99.7% of the measurements
Another concept related to the value of standard deviation and not tied to the normal distribution is the use of Chebychev’s inequality. More on that in another post.
Reference
Ott, R. Lyman, An Introduction to Statistical Methods and Data Analysis, 4th Ed. Duxbury Press, Belmont, CA, 1993.
This was my graduate data analysis course text, and I recommend you use a comprehensive statistics text that you are very familiar with during the exam.
Related:
Hypothesis Tests for Variance Case I (article)
Statistical Terms about Variation (article)
Levene’s Test (article)
Leave a Reply