Once word got out that I was taking graduate-level courses in statistics, I dreaded the knock on the door. Colleagues, some of which I knew and others from some far reach of the company, would ask if I could take a look at their data. I didn’t learn the necessary first steps with a stack of data in class.
I’ve lost count of the number of data sets I’ve reviewed and analyzed. I know there are important considerations and questions before creating the first plot. Let’s review the essential first steps you should take when presented with data.
Is there a decision related to this data?
Why are you looking at this data? Now, I find it difficult not just to jump in and start the analysis, yet, which analysis are you attempting to accomplish? A great question to answer is about the decision this dataset is to inform. Is this a comparison, an optimization, or an exploration?
If the question is “Will this design create a product that meets our reliability goal?” that helps to guide your next steps. If the decision is about which vendor better meets our requirements, that suggests a range of analysis options.
The type, quality, and quantity of data depend on the decision the analysis is to inform. Thus, when first encountering a dataset, start with what information you will need from the data. Plus, assess if the data is sufficient to provide the necessary information.
Data Collection and Errors
Let’s say the dataset provided has 1,000 entries, and all are times till each of those 1,000 products failed. This would be complete if the organization only shipped 1,000 units. If they shipped 100,000 units, what happened to the other 99,000? While just failure data is fine for some situations, it is not enough information to estimate the impact on future warranty claims.
An often-forgotten aspect of data collection is measurement error. Every measurement system has some error included. None are perfect. Understanding the measurement system used to collect the data may prompt additional questions on the quality of the data and the magnitude of the measurement error.
Another detail to understand concerns the completeness of the data. Is the data a random or not-so-random sample? Or does the dataset include measurements from all items in the population? This affects the type of analysis and how to interpret the results.
Consider the measurement frequency. If the measurement system records events as they happen, that is different than a system that checks for events once a month. Interval data requires different handling and analysis.
Data format and organization
To this point, we haven’t looked at the data within the dataset. Take a look at the data now. This is the start of the data clean-up process. Things like missing data or recording date variations impact various software packages’ ability to use the data. Are the missing data a clerical error or deliberate?
To understand the dataset, hopefully, the columns should have informative labels. “Column 1,” “column 2”, etc., doesn’t provide the necessary information about what is within the column. Dozens of columns with 4-digit numbers without labels or a legend, if not useful.
While some software packages can handle data presented using Nevada charts, not all can. This may require organizing the data for the intended analysis.
One thing that has often caused me problems is a column of numbers with a few data entries stored as text. These are hard to spot, yet when expecting numbers, most software packages balk when confronted with a field with text.
Exploring the data
Ok, you understand the decision this data is to inform and understand the dataset, including how it was collected, measurement error, and how the data is organized. Great. Now it’s time to start the analysis. Or is it?
At this point, I recommend plotting the data in a few different ways. Visualize the data to identify basics like the shape or structure of the data. Time series plots, XY plots, and others provide basic information concerning the nature of the data.
For example, if plotting a column by date collected, and there are long gaps between clusters of measurements, it may indicate the need to understand why that occurred. Another example is sudden changes in the magnitude of the data. The data starts with single-digit numbers and then jumps to 7-digit values. Was that a change in the measurement system scale being used, or does it accurately reflect what happened?
The first step is to know the data, its history, and its behavior. Then do the actual work to conduct the analysis.
Larry George says
Thanks for your observations on data. It’s good to ask Why are you doing what you do with the data? What would it be worth to get more data?
You’re right, not all data come in form or relational db: records in rows, factors in columns. Inferring missing values scares me.
Stay tuned for my next article. “While some software packages can handle data presented using Nevada charts, not all can.’ Those software packages make Kaplan-Meier reliability estimates and they usually include the variances of the reliability estimates and maybe even confidence bands. Don’t believe them.