When confronted with a stack of data, do you think about creating a histogram, too? Just tallied the 50th measurement of a new process – just means it’s time to craft a histogram, right?
There isn’t another data analysis tool as versatile. A histogram (bar chart) can deal with count, categorical, and continuous data (technically, the first two graphs would be bar charts). It like a lot of data yet reveals secretes of even smaller sets. A histogram should be on your shortlist of most often graphing tools.
What is a Histogram or Bar Chart?
A histogram is one of the 7 basic tools of quality control. It provides a representation of the distribution of numerical data. A histogram is an estimate of the probability distribution of a continuous variable.
A very close cousin, the bar chart, also provides a graphical representation of categorical variables. A bar
A histogram or bar chart is a means to visualize the data. Explore the spread, shape, and a range of other details concerning a set of data.
First Look and Clean Up
The phrase, ‘just plot the data’ means get the numbers out of a column or table and onto a graph. Take a look at the data. The easiest and (in my opinion) best is to use a histogram or bar chart.
This first look helps to identify data errors, anomalies or potential outliers. Just trying to create a graph with the data will identify entries that are not numeric. If there are ’28’ dogs and ‘1a’ cats at the kennel last Tuesday, well, a simple bar graph is not able to plot the value ‘1a’, as it’s most likely an error.
The first plot may reveal a potential outlier or another error in the dataset. For example, if the kennel only has the capacity to house 50 creatures and last Thursday the headcount is 238, while the day before it was 24 and the day after was 32, it is likely one of three numbers in ’238’ is wrong or the entire entry is wrong. Or, there really was a surge is the population for that one day – worth checking either way.
The initial plot helps you, the analyst, to cleanse the data any obvious errors, potential clerical errors, plus helps to quickly identify potential outliers, as well.
What is the Dataset’s Shape
A simple look at the data, in graphical form, allows you to see if the data is symmetric or skewed, unimodal or bi-/multi- modal. To see the shape, having more data certainly helps.
Do the counts balance out in an on/off situation, or is there an overwhelming number in one category? You can see the result. While not as precise as a hypothesis test the bar chart may be all that is needed to be convincing.
One note of caution, be true to the data. With continuous data you need to explore different bin sizes to allow the data to reveal its secrets. With categorical data, the bins may or may not benefit from grouping, again a little exploration will help the data tell its story.
Count and Categorial Data
For categorical data a simple plot assists in seeing the relative proportions. Organizing the data into a Pareto chart then provides a means to focus attention on the groups with the highest counts, which may be useful when considering failure modes.
If considering modeling the data using a binomial distribution, the process to create a bar chart provides your estimates for the proportions.
Continuous Data
The histogram is a first visualization of a continuous datasets underlying distribution. It is a first look at the probability density function (PDF). With scant data avoid making distribution selections based on the histogram, yet with datasets over about 50 readings you certainly will have some ideas around which distributions, if any, may describe the population.
One consideration when creating histograms is understanding the measurement system and measurement error involved. Creating bin sizes smaller than the width of the measurement error is likely to guide your interpretations of the data astray.
Comparisons
You’ve create a good histogram or bar chart. Great. Now run your experiment, continue production for another week, then make a comparison. Can you see the difference? While not as ‘precise’ as the various statistical analysis methods available, a graph may be convincing on its own or set expectations on what the statistical analysis result my reveal.
Overlaying data on a plot is a classic way to detect shifts in the mean or variance. Before revealing the plot, note your hypothesis, what should you see. Did the process improve, did the mean shift up, is the variance reduced? Reveal the plot and compare your expectations to what you just learned.
Proportion Above or Below a Target Value
One question a histogram or bar chart may answer concerns proportions that meet specific criteria. For example, given the set of measurements, what proportion are within the specifications? What percentage are defective?
You may not need to fit a distribution or do any math more complex than counting or summation.
What Other Reasons Are There?
Let’s think about this for a moment. Such a simple graphing technique, how useful can it be when we have so many powerful visualization tools available. When presented with a new dataset or a dataset that is a simple vector of measurements or a set of counts for a set of categories, a histogram or bar chart may be your only, if not the best, option.
Besides the above reasons to use such charts, you may need to:
- illustrate your report
- illustrate your proposal
- consider different measurement techniques
- evaluate a new material
- explore a design change idea
- start a data modeling project
- confirm or challenge assumptions
- think visually about a problem
- estimate statistical parameters
- illustrate a forecast or trend
Should I continue? You get the idea, right. These plots are foundational to our ability to grapple with data in a meaningful way. If you use histograms or bar charts, how so? Add your favorite uses in the comments below.
PS: the tally of reasons above is short the 50 reasons mentioned in the title – please help this article get to 50 solid reasons if that is possible.
Larry George says
11. It’s easy to make a histogram
12. Management understands bar charts
13. Excel Histogram data analysis does it
13.1 Excel Histogram now sorts bins for Paretos
13.2 Excel Histogram now will accumulate to show cdf
14. Your choice of bin size to show or hide features, cut off part of sample, hide outliers
15. Viewers will think it’s from a random sample even if it isn’t
Selvamany Srinivasan says
16. Check for normality