Building a Basic Box Plot
One of the first things to do when faced with a set of numbers is to plot them. A histogram is often the first choice, maybe a dot plot. Up your data plotting skills and let your data provide a bit more information by using a box plot.
An Example Box Plot
Here’s some data.
2.860928 17.671176 3.679519 12.683250 15.954954 2.185074
10.089316 29.102870 27.585598 5.700319 18.738644 1.694618
11.233156 79.872179 58.078349 11.434015 1.331777 4.846609
14.558336 3.445164 38.214733 12.080222 4.226581 2.426053
15.648076 6.978497 23.055192 8.722669 1.893071 2.748054
Interesting, isn’t it? Is it normally distributed, does it have a single-mode, is there a long tail or outliers? A table of numbers is difficult to understand clearly, thus we plot the data.
Here is the same data as a basic box plot.
To read a box plot, let’s step through the various markings. The dark line within the box is the median of the data. The box upper and lower edges (hinges) are bound the interquartile range (the middle half of the data from the 25th percentile to the 75th percentile of the data set).
The dashed lines out to the small horizontal lines, the whiskers, mark the most extreme non-outlier data points – without outliers, the whiskers mark the extent of the range.
The two dots above the upper whisker are indicating potential outlier data points. Here the outliers are identified using the interquartile range criterion. If a data point is outside 1.5 times the interquartile range it is designated an outlier and not used to calculate the location of the whiskers.
The width of the box and whiskers is arbitrary and adjusted for plot legibility.
Why Plot Data Using a Box Plot
Like a histogram, a box plot provides some information about the shape of the dataset. Unlike a histogram, there are no bin widths to contend with which may alter the appearance, thus the interpretation of the plot.
A box plot provides basic information about the location of the center (median) of the data along with where the bulk of the data lies. If the median line is centered within the box the data is roughly symmetrical. If the median is closer to one edge of the box, it indicates the data is skewed in that direction.
The line out to the whiskers provides information on the range of the data set (ignoring points identified as outliers). And the individual dot indicates potential outliers. If the whiskers are equally distant from the box, this again supports symmetry, and if not equal distance indicates skewness.
I use a box plot as it’s quick, clear, and informative.
Using Excel or R
Using R software, I first created a data set using a random number generator for a lognormal distribution with a mean of 10 and a standard deviation of 2.5, assigned the set of data to the variable x.
x<-rlnorm(30, log(10), log(2.5))
Then used just the default boxplot() command to create the above box plot
That is pretty easy.
In Excel, select the data series and then select from the Insert tab the Statistical charts then Box and Whisker option. Depending on the version of your particular spreadsheet, the ease of creating a box plot may vary.
The next time you have a set of data, plot it. Plot it multiple ways and let the data show you what information it may contain.
Also published on Medium.