Sample Sizes for Hypothesis Testing

As an industrial statistics consultant for the past 25 years, I have frequently fielded questions related to sample size determination. Unfortunately, I have encountered many instances where simple rules of thumb were used for any purpose (like always use 30). Sample size guidance really depends on what the goal of the study is, the type of data we are dealing with, what statistical method we are using and some other factors as well. Common activities which typically require sample size determination include:

Hypothesis Testing (including Equivalence Testing)
Estimation of statistics like means, standard deviations, proportions
Calculation of Tolerance Intervals (range of data a process uses)
Designed Experiments (number of replicates)
Statistical Process Control Charts (e.g. X-bar charts)
Acceptance Sampling (to disposition lots or batches of raw materials or finished products)
Reliability Testing to estimate Reliability performance
Reliability Testing to demonstrate Reliability performance

All these applications require different assumptions and calculations to determine an appropriate sample size. In this article, we focus on Sample Size determination for Hypothesis Testing. It is assumed that the reader is already familiar with Hypothesis Testing.

Possible Errors in Hypothesis Testing

When performing hypothesis tests, we try to minimize the risk of making an error when deciding whether to reject the Null Hypothesis. The possible errors we could make are:

Type I – Incorrectly Reject the Null Hypothesis when it’s true
Type II – Failure to Reject the Null Hypothesis when it’s false

The Type I error is controlled explicitly via the choice of the significance level (α). So, if we choose α = 0.05 and we reject the null hypothesis we do so with 95% confidence. (In general, our confidence level when we reject the null hypothesis is (1- α)*100%.

Type II errors are not controlled explicitly, but rather are a function of α, the practical difference we are trying to detect, the standard deviation, and the sample size. We specify the probability of making a Type II error as β. We typically control the risk of making a Type II error by ensuring an adequate sample size.

It is common to characterize the probability of making a Type II error by specifying or computing the Power, which is simply 1 – β. That is, the Power is the probability of correctly rejecting the null hypothesis when it is false and Power = 1 – β.

The table below summarizes the possible outcomes when making a decision using hypothesis testing. We aim to minimize the risks of being in the top right or lower left quadrants.

a 2x2 matrix with truth on one axis and our decision on the other - indicating two ways we make the correct decision based on the hypothesis test results and two ways our decision is incorrect.

Just as we can depict α graphically we can also depict β graphically. Recall that α is the area under the distribution curve of our test statistic (assuming the null hypothesis is true) beyond the critical value(s). β comes into play when the null hypothesis is actually false and is the probability that we fail to reject the null hypothesis in this case.

Example:

An ice cream manufacturer is filling up cartons of ice cream that are labeled as 16 oz. To protect against underfilling, the producer has a target process average of 16.3 oz. The manufacturer plans to select a sample of cartons to test whether the process is running on target or below target. If the process average is running at least 0.15 oz below target, the null hypothesis should be rejected.

Here the null and alternate hypotheses are as follows:

H₀: μ = 16.3 oz

H₁: μ < 16.3 oz

The picture below represents the hypothesized process (“Original Process”) and a potentially changed process that is centered at 16.15 oz. These are distributions of individuals and note that the standard deviation of the individuals is 0.1 oz.

two bell curves one denoting the original process, and to the left a bell curve denoting the changed process

The picture below illustrates the critical value (based on our 1-sided test with α = 0.025).

the same two bell curves with the lower tail of the original process marking the critical value (based on our 1-sided test with α = 0.025).

Suppose that the process average has actually shifted to 16.15 oz, an amount deemed significant to the manufacturer. Following the shift, we weigh one container of ice cream to see if a process shift has occurred. As shown in the illustration on the next page, there is a good chance that we will not detect this shift. This is because more than half of the shifted curve (changed process) still falls within the cutoffs of the original distribution (centered at 16.3 oz).

Graphically, β is the area under the shifted curve that falls within the cutoffs (determined by α) of the original curve. The power (1-β) is the area beyond the critical value — it represents the probability that we will detect the change. Clearly, the power is less than 0.50 so there is less than a 50% chance we will detect a shift of 0.15 oz when we are sampling individual ice cream containers.

same two bell curves with large portion of changed process's lower tail shaded indicating the power of the hypothesis test planned

Specifying Power

Specifying the desired power of a statistical test depends on several factors.

1. The severity of making a Type II error

If the occurrence of a Type II error results in adverse consequences, then the power of the test must be high. Suppose you are testing a new drug to determine if the side effects are more severe than the currently approved drug. Failure to detect a true difference because of a Type II error may lead to disastrous consequences. Thus, the power of such a test must be very high (e.g. 0.999).

2. Size of the effect needed to detect

By choosing a sample size large enough, any arbitrarily small change or difference may be detected statistically. The key is determining a practical difference to be detected. A sample size should be selected so that the test has high power to detect a difference that you care about — but low power to detect a meaningless difference. For example, suppose you are trying to detect a difference between two injection molding cavities by weighing parts produced from each cavity. For this application it may be desirable to detect a true average difference of 0.1 ounce or more.

Therefore, the power to detect this size difference should be relatively high. On the other hand, the power to detect an average difference of 0.01 ounce should be low since this difference is not a practical difference in the parts.

Factors Affecting Power

In general, several factors influence the power for a given hypothesis test. These include the significance level (α), the standard deviation (σ) of the data, the sample size (n), and the size of the difference or effect (D). The impact of each factor on power is discussed next.

1. The significance level (α)

As discussed earlier, α is the probability of making a Type I error (the null hypothesis is incorrectly rejected). All other factors being equal, as the probability of a Type I error (α) increases, the probability of a Type II error (β) decreases. Thus, as α increases, the power (=1 – β) increases.

This relationship can be seen in the graphic below. If α increases (say to 0.05), then the cutoff on the original process curve is closer to 16.3. This results in a smaller area of overlap. That is, the area under the changed process curve that falls within the cutoff values of the original curve (β) is less than before. Conversely as we reduce the Type I error (make α smaller), then the area of overlap (β) increases.

same two bell curves showing reduction in type II error and increase in type I

2. The standard deviation (σ)

σ is usually not known but is estimated from the data. When σ is estimated we indicate it by σ-hat (σ with a hat on top). A common estimator for σ is s, the sample standard deviation. σ is inversely related to the power. When σ is large, there is a lot of variation in the data (a wide distribution) which makes detecting changes or differences difficult. When σ is small it is relatively easy to detect changes or differences and the resulting power is greater.

This can be seen in the graphic above. Imagine that both the original process and changed process have a smaller standard deviation so that their curves are taller and narrower. Then, the area under the “changed process” curve that falls within the cutoffs of the “original process” curve (β) is smaller and the power is larger.

3. The sample size (n)

The larger our sample size, the closer our estimates are to the true process or population value. As a result, we see less variation in our estimates when we use large sample sizes. This results in the fact that as n increases, the power of the test increases (assuming all other factors that affect β are held constant). The two graphics below illustrate the impact of sample size on the probability of making a Type II error (and power). Increasing the sample size from n = 1 to n = 5 to n = 12 increases the power significantly. Thus, we are much more likely to detect a shift in the process average from 16.3 oz to 16.15 oz with a sample size of 12 than we are with a sample size of 5.

two sets of bell curves with sample size of 5 and another with 12 showing the decrease in the variance and improvement of reduction in error probabilities

4. The difference to detect (D)

Assuming all other factors that affect β are held constant, as the difference we are trying to detect (D) increases, the power of the test increases. This is because it is relatively easy to detect large differences while it is more difficult to detect small differences.

For example, in the graphic below, suppose we wanted to detect a shift in the process average from 16.3 oz to 16.29 oz. Here, the “Changed Process” would be practically indiscernible from the “Original Process” and almost the entire “Changed Process” curve would overlap with the cutoffs of the original process. Here β would be very large (almost 100%) indicating that the power to detect this change is almost 0! On the other hand, if the process actually changed from 16.3 oz to 15.3 oz, how easily would we detect this change?

sample size 5 two bell curves showing difference to detect and resulting power

Power / Sample Size Calculations

Depending on the type of hypothesis test being performed (e.g. 1-sample t test, 2-sample t test, 2 proportions test, etc.) appropriate formulas may be utilized to compute the power of the test for a given sample size (n), standard deviation (σ), type I error probability (α), and difference to detect (D). Alternatively, if the desired power (β) is specified, the necessary sample size may be computed.

The applicable formulas may be found in statistics textbooks. Since most statistical software applications perform these computations, we will focus on the use of software. The use of MINITAB is illustrated here.

Power and sample size computations are available via the “Power and Sample Size” choice from the “Stat” menu as illustrated below.

screen shot of Minitab's stats menu options

Example: 1-Sample Z Test

Here, we test whether a bowling ball diameter is equal to the target value of 8.55 in.

H₀: μ = 8.55 in

H₁: μ = 8.55 in

Based on a sample size of n = 30, we estimate xbar = 8.47 and s = 0.20 inches. Also, the significance level (α) = 0.05. We want to know the power of this hypothesis test if we would like to detect a difference (e.g. reject the null hypothesis) of 0.10 inch?

We use MINITAB and specify the values for n, D, s, and α. (The form of the alternate hypothesis and α are specified under “Options”).

screenshot of Minitab Power and Sample Size for 1-Sample Z

screenshot of Minitab Power and Sample Size for 1-Sample Z options panel - selecting not equal

Screenshot of Minitab worksheet for power and sample size, showing results table

Thus, there is about a 78% chance that we will correctly reject the null hypothesis if the true process average is 0.1 inch away from the test value (8.55 inches) with a sample size of 30.

A power curve plots the power vs. the difference (for a specified sample size, α, and σ). As can be seen from the curve below, as the difference we are trying to detect gets smaller, the power gets smaller (all other factors held constant).

screenshot of minitab power curve plot for 1-sample Z test

We may specify multiple sample sizes to see the impact on the power.

Screen shot of Power and Sample Size for 1-Sample Z with multiple sample sizes listed

power and sample size resutls with 5 different sample sizes show the increase in power with more samples

So, as the sample size increases, so does the Power. The Power Curves for each of the different sample sizes is shown next.

the power curve plot for 1-sample Z test showing the increase in power with higher sample sizes

More typically, when planning a study, we want to specify values for the power and compute the required sample sizes. This is illustrated below.

Screenshot of Power and Sample Size for 1-Sample Z with 4 differnt power values 0.85 to 0.99

power and sample size results with different target power and associated increase in sample size

Thus, to achieve a Power of 0.90, a sample size of 43 is required and to achieve a Power of 0.95, a sample size of 52 is required.

Other hypothesis tests (and equivalence tests) work similarly. For example, the sample size results below are from a 2-sample t test, where we are interested in detecting a difference in the group averages that is at least 0.05. Here the estimated pooled standard deviation (from both groups) is 0.0774 and the alternate hypothesis is a 2-sided test (not =). The table shows what sample size is needed to achieve the desired Power.

results of similar analysis as above with different sample sizes for a 2-sample t-test again showing relationship between higher power and sample sizes

Summary

This article aimed to provide a conceptual understanding of determining an appropriate sample sizes for conducting statistical hypothesis tests. The sample size is a crucial factor in ensuring that changes or differences we need to detect are actually detected while not being overly sensitive in detecting differences that are not practically important. MINITAB software was utilized to illustrate some examples. Sample size formulas can be found in the MINITAB help or in statistics textbooks.

Possible Errors in Hypothesis Testing

Example:

Specifying Power

1. The severity of making a Type II error

2. Size of the effect needed to detect

Factors Affecting Power

1. The significance level (α)

2. The standard deviation (σ)

3. The sample size (n)

Power / Sample Size Calculations

Example: 1-Sample Z Test

Summary

Summary

About Steven Wachs

Leave a Reply Cancel reply