In the current century, data is ubiquitous but how to make use of it to drive product decisions can take time and effort. Data visualization is a useful technique to identify patterns in a dataset and derive exploratory statistics to analyze trends.
What is a histogram and when to use it?
Data exploratory tool to visualize continuous data distribution, not just the mean or median values. A better understanding of data can be achieved if we group or ‘bin’ them, creating a graph called a histogram. For categorical data, a bar chart is used. A frequency table can be visually represented as a histogram with bins on the x-axis and count (or percentage) on the y-axis. Bins are ranges of numbers of equal size to create groups in a data set, and the height of the bar is the count of observations in each group.
A histogram is great to use when checking how data is spread within a range. For example, if we want to understand the life distribution of a particular product in years, we can use a histogram chart to find the count of observations within a specific age range. Thus, the histogram is great when we have a large number of observations in a dataset and we want to understand the distribution.
One step to create histogram using Excel and Python
Option 1: Excel
Select data ->Insert Chart -> Histogram. The example above shows the histogram of 100 random numbers between 1 and 500.
Option 2: Python
Importing matplotlib and seaborn data visualization libraries, histograms can be easily created.
Exploratory Data Analysis & Storytelling using Histogram
When we group data in a histogram, patterns can be identified, such as which observations occur the most. The most common observation in a histogram is the mode. We can also talk about the mean and median of the observation. Common shapes of histograms are also effective in determining if the data is symmetric or skewed. A histogram gives a view of the shape of data and is dependent on the choice of bin size. A bimodal histogram (two peaks) is a very interesting data distribution that indicates there is a missing variable/factor or lurking variable – more about this distribution in a future article.