16 Hypothesis Testing
We finished the discussion of estimation, interval estimation in particular in the previous chapter. The other part of statistical inference is hypothesis testing. In this chapter, we first discuss the meaning a hypothesis in statistical analysis, followed by the testing procedures for population mean
16.1 Introduction
What is Hypothesis Testing?
In statistics, a hypothesis is a claim or statement about a property of a population, often the value of a population distribution parameter. For example,
The mean body temperature of humans is less than
F. Here the mean body temperature is a property or characteristic of target population human beings. We can turn the verbal claim into a brief mathematical expression .Marquette students’ IQ scores has standard deviation equal to 15. The IQ score standard deviation is a characteristic of the population Marquette students. Mathematically, we can write the claim as
.
You can see that we usually focus on claims about a population distribution parameter.
The null hypothesis, denoted
The alternative hypothesis, denoted
Let’s do one more exercise. Is the statement “On average, Marquette students consume less than 3 drinks per week.” a
So what is hypothesis testing? Hypothesis testing 1 is a procedure to decide whether or not to reject
Example
Before we jump into the formal hypothesis testing procedure, let’s talk about a criminal charge example. How a criminal is convicted is similar to the formal testing procedure.
Suppose a person is charged with a crime, and a jury will decide whether the person is guilty or not. We all know the rule: Even though the person is charged with the crime, at the beginning of the trial, the accuse is assumed to be innocent until the jury declares otherwise. Only if overwhelming evidence of the person’s guilt can be shown is the jury expected to declare the person guilty, otherwise the person is considered not guilty.
If we want to make a claim about whether the person is guilty or not, what are our
-
The person is not guilty 🙂
This is how we write a hypothesis: start with
-
The person is guilty 😟
In the example, the evidence could be photos, videos, witnesses, fingerprints, DNA, and so on . How do we decide to keep
Please go through the entire criminal charge process again:
The process is quite similar to the formal procedure for a hypothesis testing.
16.2 How to Formally Do a Statistical Hypothesis Testing
The entire hypothesis testing can be wrapped up in the following six steps. No worries if you don’t have any idea of it. We will learn this step by step using a test for the population mean
Step 0: Check Method Assumptions
Step 1: Set the
and in Symbolic Form from a ClaimStep 2: Set the Significance Level
Step 3: Calculate the Test Statistic (Evidence)
Decision Rule I: Critical Value Method
Step 4-c: Find the Critical Value
Step 5-c: Draw a Conclusion Using Critical Value Method
Decision Rule II: P-Value Method
Step 4-p: Find the P-Value
Step 5-p: Draw a Conclusion Using P-Value Method
- Step 6: Restate the Conclusion in Nontechnical Terms, and Address the Original Claim
Let’s look at this example: Is the New Treatment Effective?
A population of patients with hypertension is normal and has mean blood pressure (BP) of 150. After 6 months of treatment, the BP of 25 patients from this population was recorded. The sample mean BP is
Our goal is to determine whether a new treatment is effective in reducing BP. Let’s learn the testing procedure step by step using this example.
Step 0: Check Method Assumptions
Any statistical method is based on some assumptions. To use the method, and analyze our data appropriately, we have to make sure that the assumptions are satisfied. In this book, most of the distribution-based methods require
Random sample
The population is normally distributed and/or the sample size
.
Example Step 0: Check Method Assumptions
- From the question description, A population of hypertension group is normal .
Step 1: Set the
The first step of testing is to understand the
- 🧑🏫 The mean IQ score of statistics professors is higher than 120.
-
- 💵 The mean starting salary for Marquette graduates who didn’t take MATH 4720 is less than $60,000.
-
- 📺 The mean time between uses of a TV remote control by males during commercials equals 5 sec.
-
Keep in mind that the equality sign is always put in
Example Step 1: Set the
The claim that the new treatment is effective in reducing BP means the mean BP is less than 150, which is an
where
Step 2: Set the Significance Level
Next, we set the significance level
Here is the idea. When we want to see if what we care about (the population parameter) is not as described as in the null hypothesis
Let’s explain
Because
Figure 16.1 illustrates the significance level
The entire rationale is the rare event rule.
The level
Example Step 2: Set the Significance Level
There is no
Step 3: Calculate the Test Statistic
Setting
The evidence used in the hypothesis testing is called test statistic: a sample statistic value used in making a decision about the
When
When
Familiar with them? Those are
Example Step 3: Calculate the Test Statistic
Since we don’t know the true
Step 4-c: Find the Critical Value
In this step, we set the decision rule. There are two methods in testing, the critical-value method and the p-value method. The two methods are equivalent, leading to the same decision and conclusion. Let’s first talk about the critical-value method.
In step 2, we set the
Which critical value to be used depends on whether our test is a right-tailed, left-tailed or two-tailed. The right-tailed test, or right-sided test is the test with
Figure 16.2 illustrates rejection regions for the different types of hypothesis tests. Let’s assume
The following table is the summary of the critical values under different cases. When
Condition |
Right-tailed |
Left-tailed |
Two-tailed |
---|---|---|---|
|
|
||
|
|
Example Step 4-c: Find the Critical Value
Since the test is a left-tailed test, and
Step 5-c: Draw a Conclusion Using Critical Value
The critical value separates the the standard normal values into the rejection region and non-rejection region. For a right-tailed test, the rejection region is any
If the test statistic
The rejection region for any type of tests is shown in the table below.
Condition |
Right-tailed |
Left-tailed |
Two-tailed |
---|---|---|---|
|
|||
|
Remember that a test statistic works as our evidence, and the critical value is a threshold to determine whether the evidence is strong enough. When the test statistic is more extreme than the critical value, it means that from our point of view, the chance of our evidence happening is way too small given the current rules of the game or under
Example Step 5-c: Draw a Conclusion Using Critical Value
We reject
Step 4-p: Find the P-Value
Another decision rule is the p-value method. The
P-Value Illustration
Since p-value is a probability, in the distribution, it represents the area under the density curve for values that are at least as extreme as the test statistic’s value. Figure 16.4 shows the p-value for different tests. Note that the p-value for a two-tailed test depends on whether the test statistic is positive or negative. If the calculated test statistic is on the right (left) hand side, the p-value will be the right (left) tail area times two.
Mathematically, the p-value for any type of tests is shown in the table below.
Condition |
Right-tailed |
Left-tailed |
Two-tailed |
---|---|---|---|
|
|||
|
Example Step 4-p: Find the P-Value
This is a left-tailed test, so the
Step 5-p: Draw a Conclusion Using P-Value Method
How do we use the p-value to make the decision? Well, here the p-value is like our evidence, and the significance level
Yes, it is a pretty simple decision rule, but the
Example Step 5-p: Draw a Conclusion Using P-Value Method
We reject
Both Methods Lead to the Same Conclusion
Remember I say both critical-value method and
- test statistic is in the rejection region.
- the test statistic is more extreme than the critical value
- the p-value is smaller than
.
The following distribution shows the equivalence of the critical-value method and the p-value method in the blood pressure example.
Step 6: Restate the Conclusion in Nontechnical Terms, and Address the Original Claim
The final step in the entire hypothesis testing procedure is to make a verbal conclusion, and address the original claim. Figure 16.6 gives you a guideline of how we make a conclusion.
Here is a reminder. We never say we accept
Example Step 6: Restate the Conclusion in Nontechnical Terms, and Address the Original Claim
We have a
Below is a demonstration of how to work through the blood pressure example using R.
## create objects for any information we have
alpha <- 0.05; mu_0 <- 150
x_bar <- 147.2; s <- 5.5; n <- 25
## Test statistic
(t_test <- (x_bar - mu_0) / (s / sqrt(n)))
[1] -2.545455
## Critical value
(t_cri <- qt(alpha, df = n - 1))
[1] -1.710882
## p-value
(p_val <- pt(t_test, df = n - 1))
[1] 0.008878158
The critical value is qt()
to get the pt()
to get the probability. Without specifying the lower.tail
argument in the function, by default, both qt()
and pt()
function focuses on the lower tail or left tail, which is what we need in this left-tail test.
Below is a demonstration of how to work through the blood pressure example using Python.
import numpy as np
from scipy.stats import t
## create objects to be used
= 0.05; mu_0 = 150
alpha = 147.2; s = 5.5; n = 25
x_bar
## Calculate the t-test statistic
= (x_bar - mu_0) / (s / np.sqrt(n))
t_test t_test
-2.5454545454545556
## Calculate the critical t value
= t.ppf(alpha, df=n-1)
t_crit t_crit
-1.7108820799094282
## Calculate the p-value
= t.cdf(t_test, df=n-1)
p_val p_val
0.008878157746280955
The critical value is t.ppf()
to get the t.cdf()
to get the probability. Both t.ppf()
and t.cdf()
function focuses on the lower tail or left tail, which is what we need in this left-tail test.
16.3 Example: Two-tailed z-test
The milk price of a gallon of 2% milk is normally distributed with standard deviation of $0.10. Last week the mean price of a gallon of milk was 2.78. This week, based on a sample of size 25, the sample mean price of a gallon of milk was
Step-by-Step
Step 1: Set the
Form the sentence “determine if the mean price is different this week”, we know the claim or what we are interested is an
Step 2: Set the Significance Level
Step 3: Calculate the Test Statistic
From the question we know that the population is normally distributed, and
Step 4-c: Find the Critical Value
Since it is a two-tailed test, we have two potential critical values. Because
Step 5-c: Draw a Conclusion Using Critical Value
This is a two-tailed test, and we reject
Step 4-p: Find the P-Value
This is a two-tailed test, and the test statistic is on the right
Step 5-p: Draw a Conclusion Using P-Value Method
We reject
The critical-value and p-value method are illustrated in Figure 16.8.
Step 6: Restate the Conclusion in Nontechnical Terms, and Address the Original Claim
There is insufficient evidence to support the claim that this week the mean price of milk is different from the price last week.
Below is an example of how to perform the two-tailed
## create objects to be used
alpha <- 0.05; mu_0 <- 2.78;
x_bar <- 2.8; sigma <- 0.1; n <- 25
## Test statistic
(z_test <- (x_bar - mu_0) / (sigma / sqrt(n)))
[1] 1
## Critical value
(z_crit <- qnorm(alpha/2, lower.tail = FALSE))
[1] 1.959964
## p-value
(p_val <- 2 * pnorm(z_test, lower.tail = FALSE))
[1] 0.3173105
Below is an example of how to perform the two-tailed
## create objects to be used
= 0.05; mu_0 = 2.78
alpha = 2.8; sigma = 0.1; n = 25 x_bar
## Calculate the z-test statistic
= (x_bar - mu_0) / (sigma / np.sqrt(n))
z_test z_test
1.0000000000000009
from scipy.stats import norm
## Calculate the critical z value
# z_crit = norm.isf(alpha/2)
= norm.ppf(1 - alpha/2)
z_crit z_crit
1.959963984540054
## Calculate the p-value
= 2 * norm.sf(z_test)
p_val p_val
0.3173105078629137
16.4 Testing Summary
Below is a table that summarizes what we have learned about hypothesis testing in this chapter.
Numerical Data, |
Numerical Data, |
|
---|---|---|
Parameter of Interest | Population Mean |
Population Mean |
Test Type | One sample |
One sample |
Confidence Interval | ||
Test Stat under |
||
|
|
|
|
|
|
|
|
16.5 Type I and Type II Errors
It is important to remember that hypothesis testing is not perfect, meaning that we may make a wrong decision or conclusion. After all, the collected evidence may not be able to present the full picture of what the true population distribution is. There are two types of errors we may commit when doing hypothesis testing: Type I error and Type II error.
If in fact
Decision |
|
|
---|---|---|
Reject |
Type I error | Correct decision |
Do not reject |
Correct decision | Type II error |
Back to the crime example that
Decision | Truth is the person innocent | Truth is the person guilty |
---|---|---|
Jury decides the person guilty | Type I error | Correct decision |
Jury decides the person not guilty | Correct decision | Type II error |
Is it worse to wrongly convict an innocent person (Type I error) or to let a perpetrator free (Type II error)? Both hugely negatively impact our society, and if possible, we should make the two errors as rarely as possible.
It you still don’t get the idea of type I and type II errors, Figure 16.9 is a classical example of the two errors. Of course the null hypothesis is “not pregnant”, and the alternative hypothesis is “pregnant”. Claiming that a old man is expecting a baby is a type I error, and saying a pregnant woman not having a baby is a type II error.
In statistics, the probability of committing the type I error is in fact the significance level
If the evidence occurring with probability lower than 5%, it will be considered sufficient evidence to reject
What is the probability of committing the type II error, the probability that we fail to reject
It would be great if we correctly reject
16.6 Statistical Power and Choosing Sample Size*
16.6.1 Power of a Hypothesis Test
Now let’s learn type I error, type II error, and power through distributions. Here a right-tailed example
The distribution on the top is the distribution under
Now, suppose the true population mean is
Which part represents the power in the figure? The power is
16.6.2 Power Calculation
In this section we illustrate how to calculate the type II error rate and power. Suppose we randomly sampled 36 values from a normally distributed population with
Suppose we are sampling from a normal distribution and
Now we are going to re-express the rejection region in terms of
Therefore, having evidence
We are not able to calculate
Now
Because we reject
Therefore, Power
In general, if the hypothesized mean value is
- For one-tailed tests (either left-tailed or right-tailed),
- For two-tailed tests,
Note that again to compute
16.6.3 Power Analysis
Back to the milk price example in Section 16.3. we have the test
The question here is
- If
, is the conclusion that price has not changed reasonable or acceptable? Let’s see the chance of making the wrong decision.
We check the probability that we do not reject
Since it is a two-tailed test, and
Let’s go back to the formula of
What will increase the power (decrease
-
and are further away: When and are far apart, the evidence that is not is stronger, and it’s more likely to reject when it is false. The chance of making type II error decreases. Look how decreases from the blue area to the red area when the true mean value increases from 56 to 60.
-
Larger
: and are trading off. Everything held constant, if you increases , it means you use smaller critical values, and allows more type I errors or false discoveries. But at the same time, more rejecting decreases the cases that we don’t reject when it is false. In other words, type II error rate goes down. The figure below shows how changes when is increased from 0.05 to 0.2.
-
Smaller
: When is small, given the same location of the distributions of and , or the same , the two distributions are more separated apart, and have smaller overlapped regions. The figure below shows how shrinks from the blue area to the red area when becomes smaller due to the fact that the two distributions become peaky and thin-tailed.
-
Larger sample size
: When sample size gets large, more information is collected, and therefore the sampling distribution of the sample mean becomes more certain about its possible values. This results in the same effect of having a smaller .
To keep
One-tailed test (either left-tailed or right-tailed):
Two-tailed test:
Example: Sample size
A cereal company sell boxes of cereal with the labeled weight of 16 oz. The production is based on the mean weight of 16.37 oz. So that only small portion of boxes have weight less than 16 oz. The box weight is normally distributed with
This is a
Question: How many boxes should be sampled in order to correctly discover that mean is less than 16.37 with the power of 0.99 if in fact the true mean weight is 16.27 or less?
. . We have , .The formula is
, .Thus,
They need at least 80 samples to conduct the test under the specified conditions (
, ).
16.7 Exercises
-
Here are summary statistics for randomly selected weights of newborn boys:
, hg (1hg = 100 grams), hg.- With significance level 0.01, use the critical value method to test the claim that the population mean of birth weights of females is greater than 30hg.
- Do the test in (c) by using the p-value method.
You are given the following hypotheses:
We know that the sample standard deviation is 5 and the sample size is 24. For what sample mean would the p-value be equal to 0.05? Assume that all conditions necessary for inference are satisfied.-
Our one sample
test is with a significance level .- Describe how we reject
using the critical-value method and the -value method. - Why do the two methods lead to the same conclusion?
- Describe how we reject
Hypothesis testing is also called Null Hypothesis Statistical Testing (NHST), statistical testing or test of significance.↩︎