When it’s Not Normal: How to Choose from a Library of Distributions
When trying to fit a probability distribution to quantitative results, sometimes the normal probability doesn’t fit. Minitab has a wealth of distributions to pick from. Do you just pick whichever one Minitab tells you fits the best? Maybe not. Just because the distribution fits your data doesn’t mean it’s a good one to use. We review my top 3 distributions for product testing and some other ones that come up but may not be appropriate to use.
We’ll also share what you need to think about when picking a distribution:
- think about your purpose of test
- consider your failure mode and how you’re expecting your product to perform – does the typical use case of a distribution fit?
- when you have options, the simpler the distribution the better (i.e. choose a 2-parameter over a 3-parameter)
If not normal, try lognormal and a 2-parameter Weibull distribution first. If your analysis is really complicated and the stakes are high, ask a reliability engineer for some help fitting an accurate model.
Minitab has a help guide on distribution fit for reliability analysis. It lists the available distributions in Minitab, and you can read more about them. Bookmark the page as a starting point to help you.https://support.minitab.com/en-us/minitab/19/help-and-how-to/statistical-modeling/reliability/supporting-topics/distribution-models/distribution-fit/
Reliability engineers also use Gamma, Beta, and Log-Logistics distributions. Here is a link that explains them if you really want to know about these distributions. https://www.weibull.com/hotwire/issue56/relbasics56.htm
Citations
Episode Transcript
We’re doing verification testing and getting quantitative results, or data and numbers. We’re trying to plot a data set to fit a probability distribution, and it doesn’t quite fit a normal distribution. You know, that’s the one that looks like a bell curve and the one everyone really wants to have because it’s the gateway to lots of other statistical tools, but it doesn’t fit it. Then we run the distribution fit function in Minitab and get this weird distribution we’ve never heard of. But it’s got the right fit. Do we use that one? Maybe not. Let’s review after this brief introduction.
Hello and welcome to Quality during Design, the place to use quality thinking to create products, others love for less. My name is Dianna. I’m a senior level quality professional and engineer with over 20 years of experience in manufacturing and design. Listen in and then join the conversation at qualityduringdesign.com.
In this episode we’re talking about choices of distributions of a data set. I’m glad to be able to talk with you about this today because it’s something that comes up quite often.
First, the basic takeaway is going to be just because the distribution fits doesn’t mean it’s a good one to use. Our discussion today is going to be high concept. We’re not going to get into nitty gritty details about math and statistical theory. Doing that in a podcast isn’t something I want to do, and not likely something you want to listen to either.
Secondly, let’s get on the same page with some terminology. Probability density function is a plot of a function over a random variable. If we look at a probability density function of a normal distribution, it’s going to look like it’s famous bell curve.
Third thing I’m going to talk about is continuous data sets. Continuous data sets are those that take on fractions of a whole. It’s not discrete data. In other words, it’s not data that’s countable. Time, force, pressure: these are examples of continuous data.
Back to our problem at hand: fitting a distribution. Now, Minitab has lots of probability distributions to choose from. So it’s built to fit the need of many different industries. Its options are to fit everyone. Some of those distributions have applications that just don’t match up with product performance testing. So, in other words, just because you found a probability distribution that Minitab says fits our data best overall others we need to follow through ourselves on a couple of things before we use that distribution to make decisions.
Now distributions, when picking there’s some things that we need to consider.
First, what is our purpose of test? What are we trying to do with our data? For example, are we trying to show that our device is stronger than a minimum requirement? Is the minimum requirement of 5N and all of our data is over 5000 N? If we don’t need to calculate a performance with a certain reliability and confidence level, then let’s move on. The product is performing way above what we needed to at any probability distribution is going to be far away from our requirement. It doesn’t make practical sense to spend a lot of time on it. However, if we decided that we needed an accurate model then we need to take care on which distribution we choose. Maybe we need an accurate model because our data set doesn’t have a huge factor of safety. Maybe we need an accurate model because we want to report life data or we’re using values to make a decision about different design features.
When choosing a distribution, the second thing to consider is what type of failures are we testing? What is failing and how is it failing? How are we expecting it to perform? The importance of failure modes comes into effect here.
Let’s talk about some common distributions that are not normal. We’re going to review some use cases, their probability density functions (or what they look like on a graph), and generally what we use them for.
First is the exponential distribution. Let’s say that our test is of a product that would likely last forever if it weren’t for an external force. Like electronic parts that fail because of surges or time until a light bulb burns out. An exponential distribution is a model of something that has constant failure rate. Its unit under test doesn’t degrade or wear out overtime. Its probability density function looks like a slide starting from near zero and trailing off with a long right tail. So, if our test is of cycles to failure or wear out, then we don’t choose this one. We may be able to manipulate it to fit, but it wouldn’t be appropriate to do so.
If our failure mode is wear-out, I have two top contenders that I work with after the normal probability: the two parameter Weibull and the Lognormal.
A Weibull distribution is loved by many reliability engineers for its capability to be manipulated and yet still be accurate. An example of this love is a well-known reliability website, called weibull.com. I also had a professor in a college course that loved Weibull. The probability density function of a Weibull distribution can mimic other distributions. It’s a shapeshifter. The three parameter Weibull has a shape, scale, and threshold parameter which gives it its copycat-type behavior. It can model failure rates that are increasing, decreasing, or constant. So, the Weibull is a good model to try if all others are just not fitting. I always pick a 2-parameter Weibull in my top three distributions to test against. For some reliability folks, is their go to distribution.
A Lognormal distribution is used if a failure mode is a wear-out function, like degradation or corrosion. Its probability density function looks like a normal distribution but is always positive or to the right of the zero and has a long extended right tail. What’s special about the Lognormal distribution is if our data fits it, then we can transform our data to make it follow a normal distribution. That then allows us to do all those fun statistical tests that need the assumption of a normal distribution, like confidence intervals and hypothesis tests.
Now, Extreme Value, either smallest or largest, seems to come up quite a lot. In product design verification, I haven’t really found a use case where it fits well. The use case for the extreme value distributions is for extreme phenomena. Meteorologists, safety engineers working in a dam, and insurance companies, finance, economics, and material sciences are more likely to use it than engineers for product design performance testing. The probability density functions of these distributions also have a bell-shaped curve but with extended tails: the smallest extreme value has a long tail to the left and the largest has a long tail to the right. For product design testing, there are other distributions that fit better than an extreme value. Use Weibull instead. These are used for reliability engineering, so if you really think your case fits then get some help from your favorite reliability engineer. Reliability engineers also use gamma, beta, log logistics; and I’ll include a link in the podcast blog if you really want to know about these distributions.
So what do we do with what we’ve been talking about today? Before settling on a probability distribution, we understand the underlying type of failure it’s modeling and we consider the typical use cases of that distribution. Does what we’re testing makes sense to what that model is typically used for? Does the behavior of our design failure match the idea behind the model? And we pick the simplest model that we can, which means choosing a 2-parameter model over 3-parameter model.
Now some insight to action: If our data is not normal, try lognormal and two parameter Weibull distributions first. If your analysis is really complicated and the stakes are high, ask a reliability engineer for some help fitting an accurate model. On this podcast blog, I will post a link to the Minitab Help guides on distribution fit for reliability analysis. It lists the available distributions in Minitab and you can read more about them. Bookmark the page as a starting point to help.
Please visit this podcast blog and others at qualityduringdesign.com. Subscribe to the weekly newsletter to keep in touch. If you like this podcast or have a suggestion for an upcoming episode, let me know. You can find me at qualityduringdesign.com, on Linked-In, or you can leave me a voicemail at 484-341-0238. This has been a production of Deeney Enterprises. Thanks for listening.
Leave a Reply