16  Hypothesis Testing

We finished the discussion of estimation, interval estimation in particular in the previous chapter. The other part of statistical inference is hypothesis testing. In this chapter, we first discuss the meaning a hypothesis in statistical analysis, followed by the testing procedures for population mean μ when σ is known and when σ is unknown. Pay attention to similarity and difference of estimation and testing.

16.1 Introduction

What is Hypothesis Testing?

In statistics, a hypothesis is a claim or statement about a property of a population, often the value of a population distribution parameter. For example,

  • The mean body temperature of humans is less than 98.6 F. Here the mean body temperature is a property or characteristic of target population human beings. We can turn the verbal claim into a brief mathematical expression μ<98.6.

  • Marquette students’ IQ scores has standard deviation equal to 15. The IQ score standard deviation is a characteristic of the population Marquette students. Mathematically, we can write the claim as σ=15.

You can see that we usually focus on claims about a population distribution parameter.

The null hypothesis, denoted H0, is a statement that the value of a parameter is equal to some claim value, or the negation of the alternative hypothesis that will be discussed in a minute. Often H0 represents a skeptical perspective or a claim to be tested, or the current status of the parameter. For example, the claim “the percentage of Marquette female students loving Japanese food is equal to 80%” is a H0 claim because of the key word “equal”. Usually we are not very convinced that the H0 claim is true, and in our analysis we want to test the claim, and see whether the evidence and information we collect is strong enough to make a conclusion that the percentage is not equal to 80%.

The alternative hypothesis, denoted H1 or Ha, is a claim that the parameter is less than, greater than or not equal to some value. It is usually our research hypothesis of some new scientific theory or finding. If we think the percentage of Marquette female students loving Japanese food is greater than 80%, this hypothesis is the H1 claim. If after a formal testing procedure, we conclude that the percentage is greater than 80%, we sort of make a new research discovery that overturns the previous claim or status quo that the percentage is equal to 80%.

Let’s do one more exercise. Is the statement “On average, Marquette students consume less than 3 drinks per week.” a H0 or H1 claim? Because of the key word “less than”, it is a H1 claim.

So what is hypothesis testing? Hypothesis testing is a procedure to decide whether or not to reject H0 based on how much evidence there is against H0. If the evidence is strong enough, we reject H0 in favor of H1.


Example

Before we jump into the formal hypothesis testing procedure, let’s talk about a criminal charge example. How a criminal is convicted is similar to the formal testing procedure.

Suppose a person is charged with a crime, and a jury will decide whether the person is guilty or not. We all know the rule: Even though the person is charged with the crime, at the beginning of the trial, the accuse is assumed to be innocent until the jury declares otherwise. Only if overwhelming evidence of the person’s guilt can be shown is the jury expected to declare the person guilty, otherwise the person is considered not guilty.

If we want to make a claim about whether the person is guilty or not, what are our H0 and H1? Remember that the null hypothesis represents a skeptical perspective or a claim to be tested, or the current status of the parameter, so we have

  • H0: The person is not guilty 🙂

This is how we write a hypothesis: start with H0: followed by the statement. Being not guilty is the default status quo of anyone, although the jury may doubt or be skeptical of the person being not guilty. The prosecutors and police detectives are trying their best the collect enough strong evidence to proof beyond a reasonable doubt to the jury. Therefore the alternative hypothesis is

  • H1: The person is guilty 😟

In the example, the evidence could be photos, videos, witnesses, fingerprints, DNA, and so on . How do we decide to keep H0 or to accept H1? After all evidence including defense attorney and prosecutor’s arguments are presented to the jury, the decision rule is the jury’s voting . Finally, to close the case, we need a conclusion that is the verdict “guilty” or “Not enough evidence to convict” .

Please go through the entire criminal charge process again:

H0 and Ha => Evidence => Decision rule => Conclusion

The process is quite similar to the formal procedure for a hypothesis testing.

16.2 How to Formally Do a Statistical Hypothesis Testing

The entire hypothesis testing can be wrapped up in the following six steps. No worries if you don’t have any idea of it. We will learn this step by step using a test for the population mean μ.

  • Step 0: Check Method Assumptions

  • Step 1: Set the H0 and Ha in Symbolic Form from a Claim

  • Step 2: Set the Significance Level α

  • Step 3: Calculate the Test Statistic (Evidence)

Decision Rule I: Critical Value Method

  • Step 4-c: Find the Critical Value

  • Step 5-c: Draw a Conclusion Using Critical Value Method

Decision Rule II: P-Value Method

  • Step 4-p: Find the P-Value

  • Step 5-p: Draw a Conclusion Using P-Value Method

  • Step 6: Restate the Conclusion in Nontechnical Terms, and Address the Original Claim

Let’s look at this example: Is the New Treatment Effective?

A population of patients with hypertension is normal and has mean blood pressure (BP) of 150. After 6 months of treatment, the BP of 25 patients from this population was recorded. The sample mean BP is x=147.2 and the sample standard deviation is s=5.5.

Source: https://unsplash.com/photos/i1iqQRLULlg

Our goal is to determine whether a new treatment is effective in reducing BP. Let’s learn the testing procedure step by step using this example.


Step 0: Check Method Assumptions

Any statistical method is based on some assumptions. To use the method, and analyze our data appropriately, we have to make sure that the assumptions are satisfied. In this book, most of the distribution-based methods require

  • Random sample

  • The population is normally distributed and/or the sample size n>30.

Source: https://www.pinterest.ph/pin/633387417082544/

Example Step 0: Check Method Assumptions

  • From the question description, A population of hypertension group is normal .

Step 1: Set the H0 and H1 from a Claim

The first step of testing is to understand the H0 and H1 claims, and express them using mathematically using population parameters. The followings provdie three examples.

  • 🧑‍🏫 The mean IQ score of statistics professors is higher than 120.
    • H0:μ120H1:μ>120
  • 💵 The mean starting salary for Marquette graduates who didn’t take MATH 4720 is less than $60,000.
    • H0:μ60000H1:μ<60000
  • 📺 The mean time between uses of a TV remote control by males during commercials equals 5 sec. 
    • H0:μ=5H1:μ5

Keep in mind that the equality sign is always put in H0, and H0 and H1 are mutually exclusive. Also, the claims are for population parameters, not sample statistics. We are not sure the value of the parameter being tested, but we want to collect evidence, and see which claim about the parameter is supported by the evidence.

Example Step 1: Set the H0 and H1 from a Claim

The claim that the new treatment is effective in reducing BP means the mean BP is less than 150, which is an H1 claim. So we can write our H0 and H1 as

H0:μ=150H1:μ<150

where μ is the mean blood pressure.


Step 2: Set the Significance Level α

Next, we set the significance level α that determines how rare or unlikely our evidence must be in order to represent sufficient evidence against H0. It tells us how strong the collected evidence must be in order to overturn the current claim. An α level of 0.05 implies that evidence occurring with probability lower than 5% will be considered sufficient evidence to reject H0. Mathematically, α=P(Reject H0H0 is true) As a result, α=0.05 means that we incorrectly reject H0 5 out of every 100 times we collect a sample and run the test.

Here is the idea. When we want to see if what we care about (the population parameter) is not as described as in the null hypothesis H0, we first assume or believe H0 is right, then based on this, we see if there is sufficient and strong evidence to conclude that it is probably not the case, and find the alternative hypothesis more reasonable.

Let’s explain α by an example. Suppose we would like to test the claim that “The mean IQ of statistics professors is greater than 120.” Or in short H0:μ=120 vs. H1:μ>120. With large sample size, we can assume X follows a normal distribution. Now, to test whether the mean IQ is greater than 120, we need to first treat the mean not being greater than 120 unless later we have sufficient evidence to say it is greater than 120. In particular, we need to do the test and analysis on the basis that the mean is under H0. That is, we first assume the mean is 120, or μ=120, then see if the assumption really makes sense.

Because X is normal, we do the test under the assumption that XN(120,σ2) for some σ2, say 9. (Let’s focus on μ and ignore σ at this moment). If X has mean 120, and from our sample data we got the sample mean x=121, do you think the claim H0:μ=120 make sense? How about you got x=127? Now α comes into play. Let me ask you a question. What is the threshold or value of the sample mean x that you think it is too large to believe that H0:μ=120 is a reasonable assumption or data generating mechanism? What is the threshold that makes you start to believe that H1:μ>120 makes more sense than H0:μ=120? The significance level α is such threshold value. With α being specified, we know what is the corresponding sample mean threshold x, which is the one such that P(X>x)=α.

illustrates the significance level α. Once we decide α, we determines how rare or unlikely our sample mean x must be in order to represent sufficient evidence against H0:μ=120. In this example, if our x is greater than 125, we would think the evidence is strong enough to conclude that μ=120 is not so reasonable because the chance of such value happening is no larger than α. We instead think H1:μ>120 makes more sense.

Figure 16.1: Illustration of significance level, alpha.

The entire rationale is the rare event rule.

Rare Event Rule

If, under a given assumption, the probability of a particular observed event is exceptionally small, we conclude that the assumption is probably not correct.

The level α is related to the α used in confidence intervals for defining a “critical value”.

Example Step 2: Set the Significance Level α

There is no α mentioned in the question description. Usually α is set by researchers themselves. Let’s set α=0.05. This means we are asking, “Is there a sufficient evidence at α=0.05 that the new treatment is effective?


Step 3: Calculate the Test Statistic

Setting α is kind of setting the threshold for determining whether our collected evidence is sufficient or strong enough or not. In this step, we are collecting our evidence. The evidence is collected from the information we have, which is the sample data. Sample data is the only source we have for the inference about the unknown parameter. So to do a test about the parameter, or decide whether a statement about the parameter makes sense, we let the data and evidence speak up.

The evidence used in the hypothesis testing is called test statistic: a sample statistic value used in making a decision about the H0. Suppose the test we are interested is H0:μ=μ0 and H1:μ<μ0 where μ0 is some population mean value that could be 150 lbs, 175 cm, 50 ounces, etc. When computing a test statistic, we assume H0 is true. Remember, we are trying to see if there is any strong evidence that is against H0. We should do our analysis in the world of H0 or the status quo. If the evidence is not sufficient, we stay in our current situation.

When σ is known, the test statistic for testings about μ is

ztest=xμ0σ/n

When σ is unknown, the test statistic for testings about μ is

ttest=xμ0s/n

Familiar with them? Those are z score and t score. Those are the sample statistics used for testing. When we calculate the test statistics, we need the value of μ. What value we should use? You are right if you use the value assumed in H0! The test statistics are the evidence we use in testing. The evidence is collected under the assumption that μ=μ0. We collect any evidence to prove that a suspect committed a crime under the assumption that he is innocent, right? We shouldn’t look at any person through colored spectacles, or frame anyone by treating someone as criminal, then make up a fake story for what he’s never done.

Example Step 3: Calculate the Test Statistic

Since we don’t know the true σ, and only know s, we use t distribution and the test statistic is ttest=xμ0s/n=147.21505.5/25=2.55. So if the true mean blood pressure is 150, our test statistic or evidence is about 2.55 standard deviations below the mean. Is this number too weird or uncommon to believe that the mean blood pressure is really 150? We need a decision rule, and that is what we are going to learn in the step 4.


Step 4-c: Find the Critical Value

In this step, we set the decision rule. There are two methods in testing, the critical-value method and the p-value method. The two methods are equivalent, leading to the same decision and conclusion. Let’s first talk about the critical-value method.

In step 2, we set the α, and in step 3, we collect the evidence. Now we need a way to decide whether the collected evidence is sufficient or not to reject the H0 claim. The critical value(s) is a value determined by the significance level α that separates the rejection region or critical region, where we reject H0, from the values of the test statistic that do not lead to the rejection of H0.

Which critical value to be used depends on whether our test is a right-tailed, left-tailed or two-tailed. The right-tailed test, or right-sided test is the test with H1:μ>μ0. When we are interested of μ is greater than some value, say μ0, in the sampling distribution, we will focus on the right hand side of the distribution, because the evidence, the test statistic calculated in the step 3, will usually be on the right hand side of the distribution, so is the critical value used in the decision rule. Similarly, The left-tailed test, or left-sided test is the test with H1:μ<μ0. For a two-tailed or two-sided test, we have H1:μμ0. In this case, we wonder μ is larger or smaller than the assumed μ0. So we need to pay attention to both sides of the sampling distribution.

illustrates rejection regions for the different types of hypothesis tests. Let’s assume σ is known as the unknown σ case is similar and we just replace the z score with the t score. Given the significance level α, for a right-tailed test, the critical value is zα, the standard normal quantile so that P(Z>zα)=α, where ZN(0,1). For a right-tailed test, the critical value is zα, or in fact z1α, the standard normal quantile so that P(Z<zα)=α or P(Z>zα)=1α. When the test is a two-tailed test, there are two critical values, one at the right-hand side, the other at the left-hand side of the distribution. Here, we need to split α equally into α/2, and the critical value at the right-hand side is zα/2 such that P(Z>zα/2)=α and the critical value at the left-hand side is zα/2 such that P(Z<zα/2)=α/2. Note that by definition, zα and tα,n1 are always positive and on the right hand side of the distribution.

Figure 16.2: Rejection regions for the different types of hypothesis tests. Source: https://towardsdatascience.com/everything-you-need-to-know-about-hypothesis-testing-part-i-4de9abebbc8a

The following table is the summary of the critical values under different cases. When σ is known, we use z scores, and when σ is unknown, we use t scores.

Condition     Right-tailed (H1:μ>μ0) Left-tailed (H1:μ<μ0) Two-tailed (H1:μμ0)
σ known zα zα zα/2 and zα/2
σ unknown tα,n1 tα,n1 tα/2,n1 and tα/2,n1

Example Step 4-c: Find the Critical Value

Since the test is a left-tailed test, and σ is unknown and n=25, the critical value is tα,n1 that is t0.05,251=t0.05,24=1.711.


Step 5-c: Draw a Conclusion Using Critical Value

The critical value separates the the standard normal values into the rejection region and non-rejection region. For a right-tailed test, the rejection region is any z value greater than zα, and the non-rejection region is any z value smaller than or equal to zα. For a left-tailed test, the rejection region is any z value small than zα, and the non-rejection region is any z value greater than or equal to zα. For a two-tailed test, the rejection region is the union of any z value smaller than zα/2 and any z value greater than zα/2.

If the test statistic ztest is in the rejection region, we reject H0. If ztest is not in the rejection region, we do not or fail to reject H0. is an example that we reject H0 in a right-tailed test. The test statistic is 2.5 which is greater than the critical value 1.645, so the test statistic falls in the rejection region.

Figure 16.3: Test statistic inside of critical region. Source: https://www.thoughtco.com/example-of-a-hypothesis-test-3126398

The rejection region for any type of tests is shown in the table below.

Condition     Right-tailed (H1:μ>μ0) Left-tailed (H1:μ<μ0) Two-tailed (H1:μμ0)
σ known ztest>zα ztest<zα ztest>zα/2
σ unknown ttest>tα,n1 ttest<tα,n1 ttest>tα/2,n1

Remember that a test statistic works as our evidence, and the critical value is a threshold to determine whether the evidence is strong enough. When the test statistic is more extreme than the critical value, it means that from our point of view, the chance of our evidence happening is way too small given the current rules of the game or under H0. Therefore, we don’t think we live in the world of H0, and it probably makes more sense to think we live in the world of H1, and it is commonplace to see these evidence happening.

Example Step 5-c: Draw a Conclusion Using Critical Value

We reject H0 if ttest<tα,n1. Since ttest=xμ0s/n=147.21505.5/25=2.55 and t0.05,251=t0.05,24=1.711 , we have ttest=2.55<1.711=tα,n1, so we reject H0.


Step 4-p: Find the P-Value

Another decision rule is the p-value method. The p-value measures the strength of the evidence against H0 provided by the data. The smaller the p-value, the greater the evidence against H0. As the name implies, the p-value is the probability of getting a test statistic value that is at least as extreme as the one obtained from the data, assuming that H0 is true. (μ=μ0). For example, p-value =P(ZztestH0) for a right-tailed test. We are more likely to get a p-value near 0 when H0 is false than when H0 is true. Because when H0 is true, ztest will be closer to zero or located around the center of the distribution (the μ0 value assumed in H0), and its p-value will be around 0.5. On the other hand, when H0 is false, or the true μ is not μ0, the test statistic ztest will be farther away from μ0 and located at the either tail of the distribution. Therefore, its p-value will be small.

P-Value Illustration

Since p-value is a probability, in the distribution, it represents the area under the density curve for values that are at least as extreme as the test statistic’s value. shows the p-value for different tests. Note that the p-value for a two-tailed test depends on whether the test statistic is positive or negative. If the calculated test statistic is on the right (left) hand side, the p-value will be the right (left) tail area times two.

Figure 16.4: Illustration of p-values for different types of hypothesis tests

Mathematically, the p-value for any type of tests is shown in the table below.

Condition     Right-tailed (H1:μ>μ0) Left-tailed (H1:μ<μ0) Two-tailed (H1:μμ0)
σ known P(Z>ztestH0) P(Z<ztestH0) 2P(Z>ztestH0)
σ unknown P(T>ttestH0) P(T<ttestH0) 2P(T>ttestH0)

Example Step 4-p: Find the P-Value

This is a left-tailed test, so the p-value is P(T<ttest)=P(T<2.55)= 0.01


Step 5-p: Draw a Conclusion Using P-Value Method

How do we use the p-value to make the decision? Well, here the p-value is like our evidence, and the significance level α is the cut-off for measuring the strength of the evidence. If the p-value α , we reject H0. If instead the p-value >α, we do not reject or fail to reject H0.

Yes, it is a pretty simple decision rule, but the p-value has been misinterpreted and misused for a long time. When we do a hypothesis testing, it is dangerous to simply compare the size of p-value and α, then jump into the conclusion. You can find more issues of p-value at XXX.

Example Step 5-p: Draw a Conclusion Using P-Value Method

We reject H0 if the p-value < α. Since p-value =0.01<0.05=α, we reject H0.


Both Methods Lead to the Same Conclusion

Remember I say both critical-value method and p-value method lead to the same conclusion? shows why. Test statistic and critical value are variable values, either z or t scores. The p-value and significance level α are probabilities, either z or t probabilities. The p-value is computed from the test statistic, and α defines the critical value. The more extreme test statistic implies the smaller p-value, and smaller α means more extreme critical value. When we reject H0, the following three statements are equivalent:

  • test statistic is in the rejection region.
  • the test statistic is more extreme than the critical value
  • the p-value is smaller than α.
Figure 16.5: The conclusion is the same regardless of the method used (Critical Value or P-Value).

The following distribution shows the equivalence of the critical-value method and the p-value method in the blood pressure example.


Step 6: Restate the Conclusion in Nontechnical Terms, and Address the Original Claim

The final step in the entire hypothesis testing procedure is to make a verbal conclusion, and address the original claim. gives you a guideline of how we make a conclusion.

Figure 16.6: Conclusions based on testing results. Source: https://www.drdawnwright.com/category/statistics/

Here is a reminder. We never say we accept H0. Why can’t we say we “accept the null”? The reason is that we are assuming the null hypothesis is true or the situation we are currently in. We are trying to see if there is evidence against it. Therefore, the conclusion should be in terms of rejecting the null. We don’t accept H0 when we don’t have evidence against it because we are already in the world of H0.

Figure 16.7: Meme about hypothesis testing conclusions. Source: https://www.pinterest.com/pin/287878601159173631/

Example Step 6: Restate the Conclusion in Nontechnical Terms, and Address the Original Claim

We have a H1 claim and we reject H0, so we conclude that There is sufficient evidence to support the claim that the new treatment is effective.

Below is a demonstration of how to work through the blood pressure example using R.

## create objects for any information we have
alpha <- 0.05; mu_0 <- 150
x_bar <- 147.2; s <- 5.5; n <- 25

## Test statistic
(t_test <- (x_bar - mu_0) / (s / sqrt(n))) 
[1] -2.545455
## Critical value
(t_cri <- qt(alpha, df = n - 1)) 
[1] -1.710882
## p-value
(p_val <- pt(t_test, df = n - 1)) 
[1] 0.008878158

The critical value is tα,n1, or the quantile such that P(Tn1<tα,n1)=α. Therefore, we use qt() to get the t value. Notice that the p-value is a probability that the Student’s t variable with degrees of freedom n1 is smaller (more extreme) than the test statistic. In R, we use pt() to get the probability. Without specifying the lower.tail argument in the function, by default, both qt() and pt() function focuses on the lower tail or left tail, which is what we need in this left-tail test.

Below is a demonstration of how to work through the blood pressure example using Python.

import numpy as np
from scipy.stats import t
## create objects to be used
alpha = 0.05; mu_0 = 150
x_bar = 147.2; s = 5.5; n = 25

## Calculate the t-test statistic
t_test = (x_bar - mu_0) / (s / np.sqrt(n))
t_test
-2.5454545454545556
## Calculate the critical t value
t_crit = t.ppf(alpha, df=n-1)
t_crit
-1.7108820799094282
## Calculate the p-value
p_val = t.cdf(t_test, df=n-1)
p_val
0.008878157746280955

The critical value is tα,n1, or the quantile such that P(Tn1<tα,n1)=α. Therefore, we use t.ppf() to get the t value. Notice that the p-value is a probability that the Student’s t variable with degrees of freedom n1 is smaller (more extreme) than the test statistic. In Python, we use t.cdf() to get the probability. Both t.ppf() and t.cdf() function focuses on the lower tail or left tail, which is what we need in this left-tail test.

16.3 Example: Two-tailed z-test

The milk price of a gallon of 2% milk is normally distributed with standard deviation of $0.10. Last week the mean price of a gallon of milk was 2.78. This week, based on a sample of size 25, the sample mean price of a gallon of milk was x=2.80. Under α=0.05, determine if the mean price is different this week.

Source: https://unsplash.com/photos/BYlHH_1j2GA

Step-by-Step

Step 1: Set the H0 and H1 from a Claim

Form the sentence “determine if the mean price is different this week”, we know the claim or what we are interested is an H1 claim. If we let μ be the mean milk price this week, we have the test H0:μ=2.78H1:μ2.78 where 2.78 is the mean milk price last week.

Step 2: Set the Significance Level α

α=0.05

Step 3: Calculate the Test Statistic

From the question we know that the population is normally distributed, and σ is known. So we use the z-test, and the test statistic is ztest=xμ0σ/n=2.82.780.1/25=1.00

Step 4-c: Find the Critical Value

Since it is a two-tailed test, we have two potential critical values. Because ztest>0 and on the right hand side of the standard normal distribution, we compare it with the critical value on the right, which is z0.05/2=1.96.

Step 5-c: Draw a Conclusion Using Critical Value

This is a two-tailed test, and we reject H0 if |ztest|>zα/2. Since |ztest|=1<1.96=zα/2, we DO NOT reject H0.

Step 4-p: Find the P-Value

This is a two-tailed test, and the test statistic is on the right (>0), so the p-value is 2P(Z>ztest)= 0.317 .

Step 5-p: Draw a Conclusion Using P-Value Method

We reject H0 if p-value < α. Since p-value =0.317>0.05=α, we DO NOT reject H0.

The critical-value and p-value method are illustrated in .

Figure 16.8: Illustration of Critical Value and P-Value methods

Step 6: Restate the Conclusion in Nontechnical Terms, and Address the Original Claim

There is insufficient evidence to support the claim that this week the mean price of milk is different from the price last week.

Below is an example of how to perform the two-tailed z-test in R.

## create objects to be used
alpha <- 0.05; mu_0 <- 2.78; 
x_bar <- 2.8; sigma <- 0.1; n <- 25

## Test statistic
(z_test <- (x_bar - mu_0) / (sigma / sqrt(n))) 
[1] 1
## Critical value
(z_crit <- qnorm(alpha/2, lower.tail = FALSE)) 
[1] 1.959964
## p-value
(p_val <- 2 * pnorm(z_test, lower.tail = FALSE)) 
[1] 0.3173105

Below is an example of how to perform the two-tailed z-test in Python.

## create objects to be used
alpha = 0.05; mu_0 = 2.78
x_bar = 2.8; sigma = 0.1; n = 25
## Calculate the z-test statistic
z_test = (x_bar - mu_0) / (sigma / np.sqrt(n))
z_test
1.0000000000000009
from scipy.stats import norm
## Calculate the critical z value
# z_crit = norm.isf(alpha/2)
z_crit = norm.ppf(1 - alpha/2)
z_crit
1.959963984540054
## Calculate the p-value
p_val = 2 * norm.sf(z_test)
p_val
0.3173105078629137

16.4 Testing Summary

Below is a table that summarizes what we have learned about hypothesis testing in this chapter.

Numerical Data, σ known Numerical Data, σ unknown
Parameter of Interest Population Mean μ Population Mean μ
Test Type One sample z test H0:μ=μ0 One sample t test H0:μ=μ0
Confidence Interval x¯±zα/2σn x¯±tα/2,n1sn
Test Stat under H0 ztest=xμ0σn ttest=xμ0sn
p-value under H0 H1:μ<μ0
p-value =P(Zztest)
H1:μ<μ0
p-value =P(Tn1ttest)
H1:μ>μ0
p-value =P(Zztest)
H1:μ<μ0
p-value =P(Tn1ttest)
H1:μμ0
p-value =2P(Zztest)
H1:μμ0
p-value =2P(Tn1ttest)

16.5 Type I and Type II Errors

It is important to remember that hypothesis testing is not perfect, meaning that we may make a wrong decision or conclusion. After all, the collected evidence may not be able to present the full picture of what the true population distribution is. There are two types of errors we may commit when doing hypothesis testing: Type I error and Type II error.

If in fact H0 is true, but we wrongly reject it, we commit the type I error. We shouldn’t reject it but we did. If H0 is false, but we don’t reject it, we make the type II error. We should have figured out that H0 does not make sense. The following table tells us when we make a correct decision and when we don’t. In practice, we will not know for certain if we made the correct decision or if we made one of these two errors because we never know the truth!

Decision H0 is true H0 is false
Reject H0 Type I error Correct decision
Do not reject H0 Correct decision Type II error

Back to the crime example that H0: The person is not guilty v.s. H1: The person is guilty . We can have a decision table like

Decision Truth is the person innocent Truth is the person guilty
Jury decides the person guilty Type I error Correct decision
Jury decides the person not guilty Correct decision Type II error

Is it worse to wrongly convict an innocent person (Type I error) or to let a perpetrator free (Type II error)? Both hugely negatively impact our society, and if possible, we should make the two errors as rarely as possible.

Figure 16.9: Example of Type I and Type II errors (https://www.statisticssolutions.com/wp-content/uploads/2017/12/rachnovblog.jpg)

It you still don’t get the idea of type I and type II errors, is a classical example of the two errors. Of course the null hypothesis is “not pregnant”, and the alternative hypothesis is “pregnant”. Claiming that a old man is expecting a baby is a type I error, and saying a pregnant woman not having a baby is a type II error.

In statistics, the probability of committing the type I error is in fact the significance level α.

α=P(type I error)=P(rejecting H0 when H0 is true)

If the evidence occurring with probability lower than 5%, it will be considered sufficient evidence to reject H0, even though H0 is actually the true mechanism giving rise to such evidence.

What is the probability of committing the type II error, the probability that we fail to reject H0 when H0 is a false statement? We call the probability β:

β=P(type II error)=P(failing to reject H0 when H0 is false)

α, β and sample size n are related. If we choose any two of them, the third is automatically determined. We would of course prefer α to be small since we would not like to conclude in favor of the research hypothesis falsely. But given the sample size, small α leads to a large β. On the other hand, too small α would most likely result in no discovery because we are gonna be conservative, set the threshold too high, and do not reject lots of H0 that should be rejected. In practice, we specify α beforehand, and then select n that is practical, so the β is determined.

It would be great if we correctly reject H0 when H0 is actually false. We hope the probability of having this to be large. The probability is actually 1β, which is called the power of a test. The power depends on the same factors as β does, including the size of α, the sample size, and the true value of parameter.

16.6 Statistical Power and Choosing Sample Size*

16.6.1 Power of a Hypothesis Test

β=P(type II error)=P(failing to reject H0 when H0 is false). In statistics, 1β=P(rejecting H0 when H0 is false) is called the (statistical) power of the test. We hope a test has high power because it means the test is able to correctly reject H0 when H0 is false, like we hope all criminals get arrested! The power depends on the same factors as β does, including the size of α, the sample size, and the true value of parameter.

Now let’s learn type I error, type II error, and power through distributions. Here a right-tailed example H0:μ=μ0 v.s. H1:μ>μ0 is illustrated in .

The distribution on the top is the distribution under H0, that is, the distribution has the mean μ0. Since it is a right-tailed test, we reject H0 in favor of H1 if the test statistic is greater than the critical value on the right of the distribution that is indicated at the green vertical line. If we reject H0 when H0 is actually true, we commit a type I error, and the probability of committing a type I error, which is α, is the green area on the right tail of the distribution.

Figure 16.10: Example of Power. Source

Now, suppose the true population mean is μa>μ0. That is, H0 is false and H1 is true, then the true distribution is the blue curve one centered at μa at the bottom. However, in reality most of the time we do not know the exact value of μa. Remember that we make our decision under the mechanism or assumption of H0. So we actually do our hypothesis testing under the distribution depicted by the purple-dashed curve. And we know this distribution because mu0 is the hypothesized value of μ we specify when we do the testing. If we fail to reject the false H0, we commit the type II error. The yellow area is the probability of committing the type II error. This yellow area is the probability that we fail to reject H0 when H0 is false and H1 is true (μ=μa), which is β. Although we don’t know the true value μa, for this right-tailed test, the left-tailed area whose value is smaller than the critical value is always the rejection region. The size of the yellow area, the chance of committing type II error, and the rejection region is smaller (larger) if μa is much (less) greater than μ0.

Which part represents the power in the figure? The power is 1β which is the green area under the blue density curve at the bottom. Remember that the total area under a probability density curve is 1, and the power is the total area minus β, the yellow area.

16.6.2 Power Calculation

In this section we illustrate how to calculate the type II error rate and power. Suppose we randomly sampled 36 values from a normally distributed population with σ=18 and unknown μ. At α=0.05, we are going to test a right-tailed test H0:μ=μ0=50H1:μ>50

Suppose we are sampling from a normal distribution and σ is known. Then the test statistic is Ztest=Xμ0σ/nN(0,1) under H0. Since it is a right-tailed test, we reject H0 if ztest>zα=z0.05=1.645.

Now we are going to re-express the rejection region in terms of X because it is going to help us calculate β. The question is, for what values of X will we reject H0? Remember, we reject H0 if the observed test statistic ztest=xμ0σ/n>1.645. We can isolate x in the expression by multiplying σ/n and adding μ0 on both sides. Then we have x>μ0+1.645σn=50+1.6451836=54.94.

Therefore, having evidence ztest>1.645 is equivalent to having evidence x>54.94. Both represent the same rejection region. One represents the evidence using standard normal distribution as the benchmark, and the other uses the original hypothesized distribution N(50,182) as the benchmark. The figure below illustrates this idea. We use different scales to measure the same thing.

We are not able to calculate β or power without knowing or assuming the true value of μ which is not μ0=50 because β is based on the fact that H0 is false. Suppose the true mean is μ=56. Let’s calculate P(Type II error)=β.

Now β=P(Do not reject H0μ=56).

Because we reject H0 if X>54.94, it means that we do not reject H0 if X<54.94. Plug into information into the probability, we have

β=P(Do not reject H0μ=56)=P(X<54.94μ=56)=P(X5618/36<54.945618/36|μ=56)=P(Z<0.355)=0.361

pnorm(-0.355)
[1] 0.3612948
pnorm(54.94, mean = 56, sd = 18/sqrt(36))
[1] 0.3619193
norm.cdf(-0.355)
0.36129479561631284
norm.cdf(54.94, loc=56, scale=18/np.sqrt(36))
0.3619192793326037

Therefore, Power =P(Reject H0μ=56)=1P(Do not reject H0μ=56)=1β=0.639

In general, if the hypothesized mean value is μ0, and the true mean value is μ1, for one sample z test, we have the formula for calculating β as follows. The derivation is left as an exercise.

  • For one-tailed tests (either left-tailed or right-tailed),

β(μ1)=P(Zzα|μ0μ1|σ/n)

  • For two-tailed tests,

β(μ1)=P(Zzα/2|μ0μ1|σ/n)

Note that again to compute β, we need the value μ1 of the true H1, which in reality is usually unknown. But we could definitely check how the value of μ1 affects β through this formula.

16.6.3 Power Analysis

Back to the milk price example in . we have the test

H0:μ=2.78H1:μ2.78 and we do not reject H0.

The question here is

  • If |μ0μ1|=|2.78μ1|0.05, is the conclusion that price has not changed reasonable or acceptable? Let’s see the chance of making the wrong decision.

We check the probability that we do not reject H0 when H0 is false, i.e., the probability that we conclude that the mean milk price has not changed, but it actually did.

Since it is a two-tailed test, and |μ0μ1|0.05, the type II error rate can be as high as β=P(Zzα/2|μ0μ1|σ/n)P(Z1.960.050.1/25)=P(Z<0.54)=0.2946. If the actual mean milk price this week is μ1=2.83 or μ1=2.73, (since |μ0μ1|=0.05), there is about 3 in 10 chance of making a wrong conclusion. If we think the risk is too high, we need to collect more than 25 samples. With a fixed α, we can increase the testing power or decrease type II error rate by increasing the sample size. This leaves a question: How do we choose a sample size that keeps β at some low level?

Let’s go back to the formula of β.

β(μ1)=P(Zzα|μ0μ1|σ/n)

What will increase the power (decrease β)?

  • μ0 and μ1 are further away: When μ0 and μ1 are far apart, the evidence that μ is not μ0 is stronger, and it’s more likely to reject H0 when it is false. The chance of making type II error decreases. Look how β decreases from the blue area to the red area when the true mean value increases from 56 to 60.

  • Larger α: α and β are trading off. Everything held constant, if you increases α, it means you use smaller critical values, and allows more type I errors or false discoveries. But at the same time, more rejecting H0 decreases the cases that we don’t reject H0 when it is false. In other words, type II error rate goes down. The figure below shows how β changes when α is increased from 0.05 to 0.2.

  • Smaller σ: When σ is small, given the same location of the distributions of H0 and H1, or the same |μ1μ0|, the two distributions are more separated apart, and have smaller overlapped regions. The figure below shows how β shrinks from the blue area to the red area when σ becomes smaller due to the fact that the two distributions become peaky and thin-tailed.

  • Larger sample size n: When sample size gets large, more information is collected, and therefore the sampling distribution of the sample mean becomes more certain about its possible values. This results in the same effect of having a smaller σ.

To keep α and β at a specified level, the only way is to increase the sample size. With Δ=|μ0μ1|, the formula for required sample size is shown as below. The derivation is left as an exercise.

  • One-tailed test (either left-tailed or right-tailed): nσ2(zα+zβ)2Δ2

  • Two-tailed test: nσ2(zα/2+zβ)2Δ2

Example: Sample size

A cereal company sell boxes of cereal with the labeled weight of 16 oz. The production is based on the mean weight of 16.37 oz. So that only small portion of boxes have weight less than 16 oz. The box weight is normally distributed with σ=0.225. The percentage of boxes weighting less than 16 oz. is 5%. They suspect that due to some production defect the weight filled in the boxes have mean less than 16.37, and like to conduct a test under α=0.05.

This is a H1 claim and the test is H0:μ=16.37H1:μ<16.37

Question: How many boxes should be sampled in order to correctly discover that mean is less than 16.37 with the power of 0.99 if in fact the true mean weight is 16.27 or less?

  • α=0.05. σ=0.225. We have β=1power=0.01, Δ=|μ0μ1|=16.3716.27=0.1.

  • The formula is n=σ2(zα+zβ)2Δ2

  • zα=z0.05=1.645, zβ=z0.01=2.33.

  • Thus, n(0.2252)(1.645+2.33)2(0.12)=79.99

  • They need at least 80 samples to conduct the test under the specified conditions (α=0.05, β=0.01).

16.7 Exercises

  1. Here are summary statistics for randomly selected weights of newborn boys: n=207, x¯=30.2hg (1hg = 100 grams), s=7.3hg.

    1. With significance level 0.01, use the critical value method to test the claim that the population mean of birth weights of females is greater than 30hg.
    2. Do the test in (c) by using the p-value method.
  2. You are given the following hypotheses: H0:μ=45HA:μ45 We know that the sample standard deviation is 5 and the sample size is 24. For what sample mean would the p-value be equal to 0.05? Assume that all conditions necessary for inference are satisfied.

  3. Our one sample z test is H0:μ=μ0H1:μ<μ0 with a significance level α.

    1. Describe how we reject H0 using the critical-value method and the p-value method.
    2. Why do the two methods lead to the same conclusion?

  1. Hypothesis testing is also called Null Hypothesis Statistical Testing (NHST), statistical testing or test of significance.↩︎