Statistical Terms about Variation

“Statistics is the language of variation.” I’m sure that is a quote by someone, not me, though. It is true. Statistics is all about variation. In this post let’s explore some of the ways statisticians talk about data and specifically the amount of dispersion in the data.

Like mean, median, and mode, concerning where the center of data lies, there are a few ways to describe the dispersion or spread of data.

When we have only one point of data – we only have an estimate of the center, or mean, of the data. The mean value is the most likely to occur, and with only one point, well, that’s all we know.

If we have two points or more and they are all the same, we probably would question the data collection or measurement processes. And, it would be rather uninteresting. Most of the time when we have more than one data point, they are different. It is the description or summary of this difference that is the subject of this note.

Range is by far the easiest term to understand. It’s the difference between the smallest and largest value in the dataset. So, for example, if we have

1,2,3,4,5

as a dataset. Then the range is 5 (the largest) minus 1 (the smallest) or 4. Simple. Sort the data to find the high and low, and a quick calculation determines the range.

While not terribly useful by itself, it does provide a very quick way to estimate the difference between two datasets.

Interquartile range (IQR) is a little more complicated and similar. I consider this as similar to the median; the data at the ends of the sorted values (the tails) do not significantly influence the IQR. While a single outlier value at the end of the dataset directly impacts the value of the range. For example, expanding the dataset above with one value, say 50, to have a dataset of 1 2, 3, 4, 5, 50. We would find the range to be 49, yet it does not describe where most of the data occurs.

The IQR is one way to describe the spread of the data near the bulk of the data. It is the distance between the largest and smallest values of the center 50% of the data. Consider only the half of the data about the median.

To calculate IQR, first order the data. Let’s use an example of the following dataset.

2, 4, 7, 9, 3, 8, 5, 2, 8, 5, 1

and, sorted the dataset is

1, 2, 2, 3, 4, 5, 5, 7, 8, 8, 9

Then, find the median

1, 2, 2, 3, 4, 5, 5, 7, 8, 8, 9

which in this case is 5.

Now consider the number above and below the median.

(1, 2, 2, 3, 4,) 5, (5, 7, 8, 8, 9)

Find Q1 and Q3 or the medians of the first half and second half of the data respectively.

(1, 2, 2, 3, 4,) 5, (5, 7, 8, 8, 9)

Finding Q1 = 2, and Q3 = 8

Subtract Q1 from Q3 to find the IQR

IQR = Q3 – Q1 = 8 – 2 = 6

Example based on work of Stephanie at Statistics How To at http://www.statisticshowto.com/articles/how-to-find-an-interquartile-range-in-statistics/ on Nov 20^th, 2011.

As you can see, the IQR is not sensitive to the extremes of the data, as a median.

As far as I know, there isn’t a parallel concept for variation to mode. For the similar concept to mean, we’ll explore variance.

Variance is important to statisticians. It is the second moment (mean, or the expected value is the first moment) of the data’s distribution. If you are curious about moments as used in statistics, take a graduate level statistics course in data analysis.

Variance of the population often is represented by σ; whereas, for a sample we often use s. More on variance and standard deviation in Part 2.

Statistical Terms (article)

Role of reliability statistics (article)

Variance (article)