Data Outliers and Questions
When looking at a pile of data, sometimes there is a data point that is not like the others. It attracts attention as it is different than the rest of the data.
When I spot something odd in a dataset, I wonder if there is something to learn here. Is this an opportunity to make a discovery or improve a process?
All too often it is tempting to remove the outlier as a mistake. Or to drop the outlier as it doesn’t make any sense and ‘messes up’ the analysis.
The Definition of an Outlier
My computer’s build in dictionary defines an outlier in relation to statistics as:
a data point on a graph or in a set of results that is very much bigger or smaller than the next nearest data point.
Another couple of definitions, that may be helpful are:
A physical defect that does not correlate with a known process, equipment or procedure and is outside the expected or actual probability-density function of time or location.
An apparent deviant observation in a sample.
The hard part of these definitions is they do not define how much difference has to exist to call it an outlier or not. There are general guidelines, yet not are all that crisp to clearly determine if a bit of information is unique or as expected.
Causes of Outliers
There are three main reasons an outlier may appear in your dataset.
- It is the result of a clerical error. The data point experience a transcription error, a transposed or extra digit. A measurement recorded in the wrong field, or some other benign error.
- It is the rare yet expected event within the variation of the subject of the measurements. If you are doing a random sample, there is a finite probability of selecting a sample that is more than 4 standard deviations from the center of the population. It can happen, yet is rare.
- Something changed or is really just different in some meaningful manner. We use this concept for control charts to spot changes in a process. The outlier is from a different process or something has changed in the process to create a higher probability of this item to have a ‘strange’ value.
There are other reasons, that are worth mentioning:
- A large fluctuation in measurement error
- Sampling an item from a different population
- Sampling bias such that nearly all items selected are similar, expect one or a few
- Non-random sampling practices
- Process startup or shut down anomalies
- Material degradation
- Damaged sample
As is clear, there are more than just clerical errors at play creating outliers.
Questions, Troubleshooting, and Understanding Outliers
When you identify what is possibly an outlier, what do you do? Hopefully, your action is not to quickly dismiss the data point and move on.
I recommend that you start asking questions:
- Given what is known about the population is this measurement possible?
- Is there anything about the sample obviously different than other samples?
- What else is different about this sample?
- Are the measurement devices and the measurement process stable and capable?
- Where could errors in the data stream from measurement to now occur?
- Where in the sampling process could we create subgroups with a bias toward one group over others?
- Is there more than one path to create the feature measured?
- Is the outlier worth investigating to learn how it occurred (is it better and worth replicating, or worse and worth avoiding)?
- Was this likely a clerical error?
As a last resort, I recommend conducting your data analysis with and without the outlier data. If the results and next steps based on the analysis do not change with or without the outliers, then leave the outliers in the dataset. If the result does change, you need to work to understand if the outliers represent nothing more than a simple error, or the actual variation to expect, or a physical/chemical change this may be your big discovery.
When dealing with data you will find outliers, those items not like the others. Such bits of data may occur for many reasons and may represent something quite novel and interesting or an error.
As with a root cause analysis, you should only take action once you know the underlying cause(s) of the outlier information. Your action may be to improve the measurement system, data collection process, or launch a study to improve your process, or even launch a study to confirm your discovery.
What is your definition of an outlier and what do you do when you find one? Leave a comment below and add to the discussion.
Also published on Medium.