15  Hypothesis Testing

We finished the discussion of estimation, interval estimation in particular in the previous chapter. The other part of statistical inference is hypothesis testing. In this chapter, we first discuss the meaning a hypothesis in statistical analysis, followed by the testing procedures for population mean \(\mu\) when \(\sigma\) is known and when \(\sigma\) is unknown. Pay attention to similarity and difference of estimation and testing.

15.1 Introduction

What is Hypothesis Testing?

In statistics, a hypothesis is a claim or statement about a property of a population, often the value of a population distribution parameter. For example,

  • The mean body temperature of humans is less than \(98.6^{\circ}\) F. Here the mean body temperature is a property or characteristic of target population human beings. We can turn the verbal claim into a brief mathematical expression \(\mu < 98.6\).

  • Marquette students’ IQ scores has standard deviation equal to 15. The IQ score standard deviation is a characteristic of the population Marquette students. Mathematically, we can write the claim as \(\sigma = 15\).

You can see that we usually focus on claims about a population distribution parameter.

The null hypothesis, denoted \(H_0\), is a statement that the value of a parameter is equal to some claim value, or the negation of the alternative hypothesis that will be discussed in a minute. Often \(H_0\) represents a skeptical perspective or a claim to be tested, or the current status of the parameter. For example, the claim “the percentage of Marquette female students loving Japanese food is equal to 80%” is a \(H_0\) claim because of the key word “equal”. Usually we are not very convinced that the \(H_0\) claim is true, and in our analysis we want to test the claim, and see whether the evidence and information we collect is strong enough to make a conclusion that the percentage is not equal to 80%.

The alternative hypothesis, denoted \(H_1\) or \(H_a\), is a claim that the parameter is less than, greater than or not equal to some value. It is usually our research hypothesis of some new scientific theory or finding. If we think the percentage of Marquette female students loving Japanese food is greater than 80%, this hypothesis is the \(H_1\) claim. If after a formal testing procedure, we conclude that the percentage is greater than 80%, we sort of make a new research discovery that overturns the previous claim or status quo that the percentage is equal to 80%.

Let’s do one more exercise. Is the statement “On average, Marquette students consume less than 3 drinks per week.” a \(H_0\) or \(H_1\) claim? Because of the key word “less than”, it is a \(H_1\) claim.

So what is hypothesis testing? Hypothesis testing 1 is a procedure to decide whether or not to reject \(H_0\) based on how much evidence there is against \(H_0\). If the evidence is strong enough, we reject \(H_0\) in favor of \(H_1\).


Example

Before we jump into the formal hypothesis testing procedure, let’s talk about a criminal charge example. How a criminal is convicted is similar to the formal testing procedure.

Suppose a person is charged with a crime, and a jury will decide whether the person is guilty or not. We all know the rule: Even though the person is charged with the crime, at the beginning of the trial, the accuse is assumed to be innocent until the jury declares otherwise. Only if overwhelming evidence of the person’s guilt can be shown is the jury expected to declare the person guilty, otherwise the person is considered not guilty.

If we want to make a claim about whether the person is guilty or not, what are our \(H_0\) and \(H_1\)? Remember that the null hypothesis represents a skeptical perspective or a claim to be tested, or the current status of the parameter, so we have

  • \(H_0:\) The person is not guilty 🙂

This is how we write a hypothesis: start with \(H_0:\) followed by the statement. Being not guilty is the default status quo of anyone, although the jury may doubt or be skeptical of the person being not guilty. The prosecutors and police detectives are trying their best the collect enough strong evidence to proof beyond a reasonable doubt to the jury. Therefore the alternative hypothesis is

  • \(H_1:\) The person is guilty 😟

In the example, the evidence could be photos, videos, witnesses, fingerprints, DNA, and so on . How do we decide to keep \(H_0\) or to accept \(H_1\)? After all evidence including defense attorney and prosecutor’s arguments are presented to the jury, the decision rule is the jury’s voting . Finally, to close the case, we need a conclusion that is the verdict “guilty” or “Not enough evidence to convict” .

Please go through the entire criminal charge process again:

\(H_0\) and \(H_a\) => Evidence => Decision rule => Conclusion

The process is quite similar to the formal procedure for a hypothesis testing.

15.2 How to Formally Do a Statistical Hypothesis Testing

The entire hypothesis testing can be wrapped up in the following six steps. No worries if you don’t have any idea of it. We will learn this step by step using a test for the population mean \(\mu\).

  • Step 0: Check Method Assumptions

  • Step 1: Set the \(H_0\) and \(H_a\) in Symbolic Form from a Claim

  • Step 2: Set the Significance Level \(\alpha\)

  • Step 3: Calculate the Test Statistic (Evidence)

Decision Rule I: Critical Value Method

  • Step 4-c: Find the Critical Value

  • Step 5-c: Draw a Conclusion Using Critical Value Method

Decision Rule II: P-Value Method

  • Step 4-p: Find the P-Value

  • Step 5-p: Draw a Conclusion Using P-Value Method

  • Step 6: Restate the Conclusion in Nontechnical Terms, and Address the Original Claim

Let’s look at this example: Is the New Treatment Effective?

A population of patients with hypertension is normal and has mean blood pressure (BP) of 150. After 6 months of treatment, the BP of 25 patients from this population was recorded. The sample mean BP is \(\overline{x} = 147.2\) and the sample standard deviation is \(s = 5.5\).

Source: https://unsplash.com/photos/i1iqQRLULlg

Our goal is to determine whether a new treatment is effective in reducing BP. Let’s learn the testing procedure step by step using this example.


Step 0: Check Method Assumptions

Any statistical method is based on some assumptions. To use the method, and analyze our data appropriately, we have to make sure that the assumptions are satisfied. In this book, most of the distribution-based methods require

  • Random sample

  • The population is normally distributed and/or the sample size \(n > 30\).

Source: https://www.pinterest.ph/pin/633387417082544/

Example Step 0: Check Method Assumptions

  • From the question description, A population of hypertension group is normal .

Step 1: Set the \(H_0\) and \(H_1\) from a Claim

The first step of testing is to understand the \(H_0\) and \(H_1\) claims, and express them using mathematically using population parameters. The followings provdie three examples.

  • 🧑‍🏫 The mean IQ score of statistics professors is higher than 120.
    • \(\begin{align}&H_0: \mu \le 120 \\ &H_1: \mu > 120 \end{align}\)
  • 💵 The mean starting salary for Marquette graduates who didn’t take MATH 4720 is less than $60,000.
    • \(\begin{align} &H_0: \mu \ge 60000 \\ &H_1: \mu < 60000 \end{align}\)
  • 📺 The mean time between uses of a TV remote control by males during commercials equals 5 sec. 
    • \(\begin{align} &H_0: \mu = 5 \\ &H_1: \mu \ne 5 \end{align}\)

Keep in mind that the equality sign is always put in \(H_0\), and \(H_0\) and \(H_1\) are mutually exclusive. Also, the claims are for population parameters, not sample statistics. We are not sure the value of the parameter being tested, but we want to collect evidence, and see which claim about the parameter is supported by the evidence.

Example Step 1: Set the \(H_0\) and \(H_1\) from a Claim

The claim that the new treatment is effective in reducing BP means the mean BP is less than 150, which is an \(H_1\) claim. So we can write our \(H_0\) and \(H_1\) as

\(\small \begin{align} &H_0: \mu = 150 \\ &H_1: \mu < 150 \end{align}\)

where \(\mu\) is the mean blood pressure.


Step 2: Set the Significance Level \(\alpha\)

Next, we set the significance level \(\alpha\) that determines how rare or unlikely our evidence must be in order to represent sufficient evidence against \(H_0\). It tells us how strong the collected evidence must be in order to overturn the current claim. An \(\alpha\) level of 0.05 implies that evidence occurring with probability lower than 5% will be considered sufficient evidence to reject \(H_0\). Mathematically, \[\alpha = P(\text{Reject } H_0 \mid H_0 \text{ is true})\] As a result, \(\alpha = 0.05\) means that we incorrectly reject \(H_0\) 5 out of every 100 times we collect a sample and run the test.

Here is the idea. When we want to see if what we care about (the population parameter) is not as described as in the null hypothesis \(H_0\), we first assume or believe \(H_0\) is right, then based on this, we see if there is sufficient and strong evidence to conclude that it is probably not the case, and find the alternative hypothesis more reasonable.

Let’s explain \(\alpha\) by an example. Suppose we would like to test the claim that “The mean IQ of statistics professors is greater than 120.” Or in short \(H_0: \mu = 120\) vs. \(H_1: \mu > 120\). With large sample size, we can assume \(\overline{X}\) follows a normal distribution. Now, to test whether the mean IQ is greater than 120, we need to first treat the mean not being greater than 120 unless later we have sufficient evidence to say it is greater than 120. In particular, we need to do the test and analysis on the basis that the mean is under \(H_0\). That is, we first assume the mean is 120, or \(\mu = 120\), then see if the assumption really makes sense.

Because \(\overline{X}\) is normal, we do the test under the assumption that \(\overline{X} \sim N(120, \sigma^2)\) for some \(\sigma^2\), say \(9\). (Let’s focus on \(\mu\) and ignore \(\sigma\) at this moment). If \(\overline{X}\) has mean 120, and from our sample data we got the sample mean \(\overline{x} = 121\), do you think the claim \(H_0: \mu = 120\) make sense? How about you got \(\overline{x} = 127\)? Now \(\alpha\) comes into play. Let me ask you a question. What is the threshold or value of the sample mean \(\overline{x}\) that you think it is too large to believe that \(H_0: \mu = 120\) is a reasonable assumption or data generating mechanism? What is the threshold that makes you start to believe that \(H_1: \mu > 120\) makes more sense than \(H_0: \mu = 120\)? The significance level \(\alpha\) is such threshold value. With \(\alpha\) being specified, we know what is the corresponding sample mean threshold \(\overline{x}^*\), which is the one such that \(P(\overline{X} > \overline{x}^*) = \alpha\).

Figure 15.1 illustrates the significance level \(\alpha\). Once we decide \(\alpha\), we determines how rare or unlikely our sample mean \(\overline{x}\) must be in order to represent sufficient evidence against \(H_0: \mu = 120\). In this example, if our \(\overline{x}\) is greater than 125, we would think the evidence is strong enough to conclude that \(\mu = 120\) is not so reasonable because the chance of such value happening is no larger than \(\alpha\). We instead think \(H_1: \mu > 120\) makes more sense.

Figure 15.1: Illustration of significance level, alpha.

The entire rationale is the rare event rule.

Rare Event Rule

If, under a given assumption, the probability of a particular observed event is exceptionally small, we conclude that the assumption is probably not correct.

The level \(\alpha\) is related to the \(\alpha\) used in confidence intervals for defining a “critical value”.

Example Step 2: Set the Significance Level \(\alpha\)

There is no \(\alpha\) mentioned in the question description. Usually \(\alpha\) is set by researchers themselves. Let’s set \(\alpha= 0.05\). This means we are asking, “Is there a sufficient evidence at \(\alpha= 0.05\) that the new treatment is effective?


Step 3: Calculate the Test Statistic

Setting \(\alpha\) is kind of setting the threshold for determining whether our collected evidence is sufficient or strong enough or not. In this step, we are collecting our evidence. The evidence is collected from the information we have, which is the sample data. Sample data is the only source we have for the inference about the unknown parameter. So to do a test about the parameter, or decide whether a statement about the parameter makes sense, we let the data and evidence speak up.

The evidence used in the hypothesis testing is called test statistic: a sample statistic value used in making a decision about the \(H_0\). Suppose the test we are interested is \(H_0: \mu = \mu_0\) and \(\quad H_1: \mu < \mu_0\) where \(\mu_0\) is some population mean value that could be 150 lbs, 175 cm, 50 ounces, etc. When computing a test statistic, we assume \(H_0\) is true. Remember, we are trying to see if there is any strong evidence that is against \(H_0\). We should do our analysis in the world of \(H_0\) or the status quo. If the evidence is not sufficient, we stay in our current situation.

When \(\sigma\) is known, the test statistic for testings about \(\mu\) is

\[\boxed{ z_{test} = \frac{\overline{x} - \color{blue}{\mu_0}}{\sigma/\sqrt{n}} }\]

When \(\sigma\) is unknown, the test statistic for testings about \(\mu\) is

\[\boxed{ t_{test} = \frac{\overline{x} - \color{blue}{\mu_0}}{s/\sqrt{n}} }\]

Familiar with them? Those are \(z\) score and \(t\) score. Those are the sample statistics used for testing. When we calculate the test statistics, we need the value of \(\mu\). What value we should use? You are right if you use the value assumed in \(H_0\)! The test statistics are the evidence we use in testing. The evidence is collected under the assumption that \(\mu = \mu_0\). We collect any evidence to prove that a suspect committed a crime under the assumption that he is innocent, right? We shouldn’t look at any person through colored spectacles, or frame anyone by treating someone as criminal, then make up a fake story for what he’s never done.

Example Step 3: Calculate the Test Statistic

Since we don’t know the true \(\sigma\), and only know \(s\), we use \(t\) distribution and the test statistic is \[\small t_{test} = \frac{\overline{x} - \mu_0}{s/\sqrt{n}} = \frac{147.2 - 150}{5.5/\sqrt{25}} = -2.55.\] So if the true mean blood pressure is 150, our test statistic or evidence is about 2.55 standard deviations below the mean. Is this number too weird or uncommon to believe that the mean blood pressure is really 150? We need a decision rule, and that is what we are going to learn in the step 4.


Step 4-c: Find the Critical Value

In this step, we set the decision rule. There are two methods in testing, the critical-value method and the p-value method. The two methods are equivalent, leading to the same decision and conclusion. Let’s first talk about the critical-value method.

In step 2, we set the \(\alpha\), and in step 3, we collect the evidence. Now we need a way to decide whether the collected evidence is sufficient or not to reject the \(H_0\) claim. The critical value(s) is a value determined by the significance level \(\alpha\) that separates the rejection region or critical region, where we reject \(H_0\), from the values of the test statistic that do not lead to the rejection of \(H_0\).

Which critical value to be used depends on whether our test is a right-tailed, left-tailed or two-tailed. The right-tailed test, or right-sided test is the test with \(H_1: \mu > \mu_0\). When we are interested of \(\mu\) is greater than some value, say \(\mu_0\), in the sampling distribution, we will focus on the right hand side of the distribution, because the evidence, the test statistic calculated in the step 3, will usually be on the right hand side of the distribution, so is the critical value used in the decision rule. Similarly, The left-tailed test, or left-sided test is the test with \(H_1: \mu < \mu_0\). For a two-tailed or two-sided test, we have \(H_1: \mu \ne \mu_0\). In this case, we wonder \(\mu\) is larger or smaller than the assumed \(\mu_0\). So we need to pay attention to both sides of the sampling distribution.

Figure 15.2 illustrates rejection regions for the different types of hypothesis tests. Let’s assume \(\sigma\) is known as the unknown \(\sigma\) case is similar and we just replace the \(z\) score with the \(t\) score. Given the significance level \(\alpha\), for a right-tailed test, the critical value is \(z_{\alpha}\), the standard normal quantile so that \(P(Z > z_{\alpha}) = \alpha\), where \(Z \sim N(0, 1)\). For a right-tailed test, the critical value is \(-z_{\alpha}\), or in fact \(z_{1-\alpha}\), the standard normal quantile so that \(P(Z < -z_{\alpha}) = \alpha\) or \(P(Z > -z_{\alpha}) = 1-\alpha\). When the test is a two-tailed test, there are two critical values, one at the right-hand side, the other at the left-hand side of the distribution. Here, we need to split \(\alpha\) equally into \(\alpha/2\), and the critical value at the right-hand side is \(z_{\alpha/2}\) such that \(P(Z > z_{\alpha/2}) = \alpha\) and the critical value at the left-hand side is \(-z_{\alpha/2}\) such that \(P(Z < -z_{\alpha/2}) = \alpha/2\). Note that by definition, \(z_{\alpha}\) and \(t_{\alpha, n-1}\) are always positive and on the right hand side of the distribution.

Figure 15.2: Rejection regions for the different types of hypothesis tests. Source: https://towardsdatascience.com/everything-you-need-to-know-about-hypothesis-testing-part-i-4de9abebbc8a

The following table is the summary of the critical values under different cases. When \(\sigma\) is known, we use \(z\) scores, and when \(\sigma\) is unknown, we use \(t\) scores.

Condition     Right-tailed \((H_1: \mu > \mu_0)\) Left-tailed \((H_1: \mu < \mu_0)\) Two-tailed \((H_1: \mu \ne \mu_0)\)
\(\sigma\) known \(z_{\alpha}\) \(-z_{\alpha}\) \(-z_{\alpha/2}\) and \(z_{\alpha/2}\)
\(\sigma\) unknown \(t_{\alpha, n-1}\) \(-t_{\alpha, n-1}\) \(-t_{\alpha/2, n-1}\) and \(t_{\alpha/2, n-1}\)

Example Step 4-c: Find the Critical Value

Since the test is a left-tailed test, and \(\sigma\) is unknown and \(n = 25\), the critical value is \(-t_{\alpha, n-1}\) that is \(\small -t_{0.05, 25-1} = -t_{0.05, 24} = -1.711.\)


Step 5-c: Draw a Conclusion Using Critical Value

The critical value separates the the standard normal values into the rejection region and non-rejection region. For a right-tailed test, the rejection region is any \(z\) value greater than \(z_{\alpha}\), and the non-rejection region is any \(z\) value smaller than or equal to \(z_{\alpha}\). For a left-tailed test, the rejection region is any \(z\) value small than \(-z_{\alpha}\), and the non-rejection region is any \(z\) value greater than or equal to \(-z_{\alpha}\). For a two-tailed test, the rejection region is the union of any \(z\) value smaller than \(-z_{\alpha/2}\) and any \(z\) value greater than \(z_{\alpha/2}\).

If the test statistic \(z_{test}\) is in the rejection region, we reject \(H_0\). If \(z_{test}\) is not in the rejection region, we do not or fail to reject \(H_0\). Figure 15.3 is an example that we reject \(H_0\) in a right-tailed test. The test statistic is 2.5 which is greater than the critical value 1.645, so the test statistic falls in the rejection region.

Figure 15.3: Test statistic inside of critical region. Source: https://www.thoughtco.com/example-of-a-hypothesis-test-3126398

The rejection region for any type of tests is shown in the table below.

Condition     Right-tailed \((H_1: \mu > \mu_0)\) Left-tailed \((H_1: \mu < \mu_0)\) Two-tailed \((H_1: \mu \ne \mu_0)\)
\(\sigma\) known \(z_{test} > z_{\alpha}\) \(z_{test} < -z_{\alpha}\) \(\mid z_{test}\mid \, > z_{\alpha/2}\)
\(\sigma\) unknown \(t_{test} > t_{\alpha, n-1}\) \(t_{test} < -t_{\alpha, n-1}\) \(\mid t_{test}\mid \, > t_{\alpha/2, n-1}\)

Remember that a test statistic works as our evidence, and the critical value is a threshold to determine whether the evidence is strong enough. When the test statistic is more extreme than the critical value, it means that from our point of view, the chance of our evidence happening is way too small given the current rules of the game or under \(H_0\). Therefore, we don’t think we live in the world of \(H_0\), and it probably makes more sense to think we live in the world of \(H_1\), and it is commonplace to see these evidence happening.

Example Step 5-c: Draw a Conclusion Using Critical Value

We reject \(H_0\) if \(t_{test} < -t_{\alpha, n-1}\). Since \(\small t_{test} = \frac{\overline{x} - \mu_0}{s/\sqrt{n}} = \frac{147.2 - 150}{5.5/\sqrt{25}} = -2.55\) and \(\small -t_{0.05, 25-1} = -t_{0.05, 24} = -1.711\) , we have \(\small t_{test} = -2.55 < -1.711 = -t_{\alpha, n-1}\), so we reject \(H_0\).


Step 4-p: Find the P-Value

Another decision rule is the p-value method. The \(p\)-value measures the strength of the evidence against \(H_0\) provided by the data. The smaller the \(p\)-value, the greater the evidence against \(H_0\). As the name implies, the \(p\)-value is the probability of getting a test statistic value that is at least as extreme as the one obtained from the data, assuming that \(H_0\) is true. \((\mu = \mu_0)\). For example, \(p\)-value \(= P(Z \ge z_{test} \mid H_0)\) for a right-tailed test. We are more likely to get a \(p\)-value near 0 when \(H_0\) is false than when \(H_0\) is true. Because when \(H_0\) is true, \(z_{test}\) will be closer to zero or located around the center of the distribution (the \(\mu_0\) value assumed in \(H_0\)), and its p-value will be around 0.5. On the other hand, when \(H_0\) is false, or the true \(\mu\) is not \(\mu_0\), the test statistic \(z_{test}\) will be farther away from \(\mu_0\) and located at the either tail of the distribution. Therefore, its p-value will be small.

P-Value Illustration

Since p-value is a probability, in the distribution, it represents the area under the density curve for values that are at least as extreme as the test statistic’s value. Figure 15.4 shows the p-value for different tests. Note that the p-value for a two-tailed test depends on whether the test statistic is positive or negative. If the calculated test statistic is on the right (left) hand side, the p-value will be the right (left) tail area times two.

Figure 15.4: Illustration of p-values for different types of hypothesis tests

Mathematically, the p-value for any type of tests is shown in the table below.

Condition     Right-tailed \((H_1: \mu > \mu_0)\) Left-tailed \((H_1: \mu < \mu_0)\) Two-tailed \((H_1: \mu \ne \mu_0)\)
\(\sigma\) known \(P(Z > z_{test} \mid H_0)\) \(P(Z < z_{test} \mid H_0)\) \(2P(Z > \,\mid z_{test} \mid \, \mid H_0)\)
\(\sigma\) unknown \(P(T > t_{test} \mid H_0)\) \(P(T < t_{test} \mid H_0)\) \(2P(T > \, \mid t_{test} \mid \, \mid H_0)\)

Example Step 4-p: Find the P-Value

This is a left-tailed test, so the \(p\)-value is \(P(T < t_{test})=P(T < -2.55) =\) 0.01


Step 5-p: Draw a Conclusion Using P-Value Method

How do we use the p-value to make the decision? Well, here the p-value is like our evidence, and the significance level \(\alpha\) is the cut-off for measuring the strength of the evidence. If the \(p\)-value \(\le \alpha\) , we reject \(H_0\). If instead the \(p\)-value \(> \alpha\), we do not reject or fail to reject \(H_0\).

Yes, it is a pretty simple decision rule, but the \(p\)-value has been misinterpreted and misused for a long time. When we do a hypothesis testing, it is dangerous to simply compare the size of \(p\)-value and \(\alpha\), then jump into the conclusion. You can find more issues of p-value at XXX.

Example Step 5-p: Draw a Conclusion Using P-Value Method

We reject \(H_0\) if the \(p\)-value < \(\alpha\). Since \(p\)-value \(= 0.01 < 0.05 = \alpha\), we reject \(H_0\).


Both Methods Lead to the Same Conclusion

Remember I say both critical-value method and \(p\)-value method lead to the same conclusion? Figure 15.5 shows why. Test statistic and critical value are variable values, either \(z\) or \(t\) scores. The p-value and significance level \(\alpha\) are probabilities, either \(z\) or \(t\) probabilities. The p-value is computed from the test statistic, and \(\alpha\) defines the critical value. The more extreme test statistic implies the smaller p-value, and smaller \(\alpha\) means more extreme critical value. When we reject \(H_0\), the following three statements are equivalent:

  • test statistic is in the rejection region.
  • the test statistic is more extreme than the critical value
  • the p-value is smaller than \(\alpha\).

Figure 15.5: The conclusion is the same regardless of the method used (Critical Value or P-Value).

The following distribution shows the equivalence of the critical-value method and the p-value method in the blood pressure example.


Step 6: Restate the Conclusion in Nontechnical Terms, and Address the Original Claim

The final step in the entire hypothesis testing procedure is to make a verbal conclusion, and address the original claim. Figure 15.6 gives you a guideline of how we make a conclusion.

Figure 15.6: Conclusions based on testing results. Source: https://www.drdawnwright.com/category/statistics/

Here is a reminder. We never say we accept \(H_0\). Why can’t we say we “accept the null”? The reason is that we are assuming the null hypothesis is true or the situation we are currently in. We are trying to see if there is evidence against it. Therefore, the conclusion should be in terms of rejecting the null. We don’t accept \(H_0\) when we don’t have evidence against it because we are already in the world of \(H_0\).

Figure 15.7: Meme about hypothesis testing conclusions. Source: https://www.pinterest.com/pin/287878601159173631/

Example Step 6: Restate the Conclusion in Nontechnical Terms, and Address the Original Claim

We have a \(H_1\) claim and we reject \(H_0\), so we conclude that There is sufficient evidence to support the claim that the new treatment is effective.


Example Calculation in R

Below is a demonstration of how to work through the blood pressure example using R.

## create objects for any information we have
alpha <- 0.05; mu_0 <- 150; 
x_bar <- 147.2; s <- 5.5; n <- 25

## Test statistic
(t_test <- (x_bar - mu_0) / (s / sqrt(n))) 
[1] -2.545455
## Critical value
(t_cri <- qt(alpha, df = n - 1)) 
[1] -1.710882
## p-value
(p_val <- pt(t_test, df = n - 1)) 
[1] 0.008878158

The critical value is \(-t_{\alpha, n-1}\), or the quantile such that \(P(T_{n-1} < -t_{\alpha, n-1}) = \alpha\). Therefore, we use qt() to get the \(t\) value. Notice that the p-value is a probability that the Student’s t variable with degrees of freedom \(n-1\) is smaller (more extreme) than the test statistic. In R, we use pt() to get the probability. Without specifying the lower.tail argument in the function, by default, both qt() and pt() function focuses on the lower tail or left tail, which is what we need in this left-tail test.

15.3 Example: Two-tailed z-test

The milk price of a gallon of 2% milk is normally distributed with standard deviation of $0.10. Last week the mean price of a gallon of milk was 2.78. This week, based on a sample of size 25, the sample mean price of a gallon of milk was \(\overline{x} = 2.80\). Under \(\alpha = 0.05\), determine if the mean price is different this week.

Source: https://unsplash.com/photos/BYlHH_1j2GA


Step-by-Step

Step 1: Set the \(H_0\) and \(H_1\) from a Claim

Form the sentence “determine if the mean price is different this week”, we know the claim or what we are interested is an \(H_1\) claim. If we let \(\mu\) be the mean milk price this week, we have the test \(\small \begin{align}&H_0: \mu = 2.78 \\ &H_1: \mu \ne 2.78 \end{align}\) where 2.78 is the mean milk price last week.

Step 2: Set the Significance Level \(\alpha\)

\(\small \alpha = 0.05\)

Step 3: Calculate the Test Statistic

From the question we know that the population is normally distributed, and \(\sigma\) is known. So we use the \(z\)-test, and the test statistic is \(\small z_{test} = \frac{\overline{x} - \mu_0}{\sigma/\sqrt{n}} = \frac{2.8 - 2.78}{0.1/\sqrt{25}} = 1.00\)

Step 4-c: Find the Critical Value

Since it is a two-tailed test, we have two potential critical values. Because \(z_{test} > 0\) and on the right hand side of the standard normal distribution, we compare it with the critical value on the right, which is \(\small z_{0.05/2} = 1.96\).

Step 5-c: Draw a Conclusion Using Critical Value

This is a two-tailed test, and we reject \(H_0\) if \(|z_{test}| > z_{\alpha/2}\). Since \(\small |z_{test}| = 1 < 1.96 = z_{\alpha/2}\), we DO NOT reject \(H_0\).

Step 4-p: Find the P-Value

This is a two-tailed test, and the test statistic is on the right \((> 0)\), so the \(p\)-value is \(2P(Z > z_{test})=\) 0.317 .

Step 5-p: Draw a Conclusion Using P-Value Method

We reject \(H_0\) if \(p\)-value < \(\alpha\). Since \(p\)-value \(= 0.317 > 0.05 = \alpha\), we DO NOT reject \(H_0\).

The critical-value and p-value method are illustrated in Figure 15.8.

Figure 15.8: Illustration of Critical Value and P-Value methods

Step 6: Restate the Conclusion in Nontechnical Terms, and Address the Original Claim

There is insufficient evidence to support the claim that this week the mean price of milk is different from the price last week.


Calculation in R

Below is an example of how to perform the two-tailed \(z\)-test in R.

## create objects to be used
alpha <- 0.05; mu_0 <- 2.78; 
x_bar <- 2.8; sigma <- 0.1; n <- 25

## Test statistic
(z_test <- (x_bar - mu_0) / (sigma / sqrt(n))) 
[1] 1
## Critical value
(z_crit <- qnorm(alpha/2, lower.tail = FALSE)) 
[1] 1.959964
## p-value
(p_val <- 2 * pnorm(z_test, lower.tail = FALSE)) 
[1] 0.3173105

15.4 Testing Summary

Below is a table that summarizes what we have learned about hypothesis testing in this chapter.

Numerical Data, \(\sigma\) known Numerical Data, \(\sigma\) unknown
Parameter of Interest Population Mean \(\mu\) Population Mean \(\mu\)
Test Type One sample \(\color{blue}{z}\) test \(H_0: \mu = \mu_0\) One sample \(\color{blue}{t}\) test \(H_0: \mu = \mu_0\)
Confidence Interval \(\bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}\) \(\bar{x} \pm t_{\alpha/2, n-1} \frac{\color{blue}{s}}{\sqrt{n}}\)
Test Stat under \(H_0\) \(z_{test} = \frac{\overline{x} - \mu_0}{\frac{\sigma}{\sqrt{n}}}\) \(t_{test} = \frac{\overline{x} - \mu_0}{\frac{\color{blue}{s}}{\sqrt{n}}}\)
\(p\)-value under \(H_0\) \(H_1: \mu < \mu_0\)
\(p\)-value \(=P(Z \le z_{test})\)
\(H_1: \mu < \mu_0\)
\(p\)-value \(=P(T_{n-1} \le t_{test})\)
\(H_1: \mu > \mu_0\)
\(p\)-value \(=P(Z \ge z_{test})\)
\(H_1: \mu < \mu_0\)
\(p\)-value \(=P(T_{n-1} \ge t_{test})\)
\(H_1: \mu \ne \mu_0\)
\(p\)-value \(=2P(Z \ge \, \mid z_{test}\mid)\)
\(H_1: \mu \ne \mu_0\)
\(p\)-value \(=2P(T_{n-1} \ge \, \mid t_{test} \mid)\)

15.5 Type I and Type II Errors

It is important to remember that hypothesis testing is not perfect, meaning that we may make a wrong decision or conclusion. After all, the collected evidence may not be able to present the full picture of what the true population distribution is. There are two types of errors we may commit when doing hypothesis testing: Type I error and Type II error.

If in fact \(H_0\) is true, but we wrongly reject it, we commit the type I error. We shouldn’t reject it but we did. If \(H_0\) is false, but we don’t reject it, we make the type II error. We should have figured out that \(H_0\) does not make sense. The following table tells us when we make a correct decision and when we don’t. In practice, we will not know for certain if we made the correct decision or if we made one of these two errors because we never know the truth!

Decision \(H_0\) is true \(H_0\) is false
Reject \(H_0\) Type I error Correct decision
Do not reject \(H_0\) Correct decision Type II error

Back to the crime example that \(H_0:\) The person is not guilty v.s. \(H_1:\) The person is guilty . We can have a decision table like

Decision Truth is the person innocent Truth is the person guilty
Jury decides the person guilty Type I error Correct decision
Jury decides the person not guilty Correct decision Type II error

Is it worse to wrongly convict an innocent person (Type I error) or to let a perpetrator free (Type II error)? Both hugely negatively impact our society, and if possible, we should make the two errors as rarely as possible.

Figure 15.9: Example of Type I and Type II errors (https://www.statisticssolutions.com/wp-content/uploads/2017/12/rachnovblog.jpg)

It you still don’t get the idea of type I and type II errors, Figure 15.9 is a classical example of the two errors. Of course the null hypothesis is “not pregnant”, and the alternative hypothesis is “pregnant”. Claiming that a old man is expecting a baby is a type I error, and saying a pregnant woman not having a baby is a type II error.

In statistics, the probability of committing the type I error is in fact the significance level \(\alpha\).

\[\alpha = P(\text{type I error}) = P(\text{rejecting } H_0 \text{ when } H_0 \text{ is true})\]

If the evidence occurring with probability lower than 5%, it will be considered sufficient evidence to reject \(H_0\), even though \(H_0\) is actually the true mechanism giving rise to such evidence.

What is the probability of committing the type II error, the probability that we fail to reject \(H_0\) when \(H_0\) is a false statement? We call the probability \(\beta\):

\[\beta = P(\text{type II error}) = P(\text{failing to reject } H_0 \text{ when } H_0 \text{ is false})\]

\(\alpha\), \(\beta\) and sample size \(n\) are related. If we choose any two of them, the third is automatically determined. We would of course prefer \(\alpha\) to be small since we would not like to conclude in favor of the research hypothesis falsely. But given the sample size, small \(\alpha\) leads to a large \(\beta\). On the other hand, too small \(\alpha\) would most likely result in no discovery because we are gonna be conservative, set the threshold too high, and do not reject lots of \(H_0\) that should be rejected. In practice, we specify \(\alpha\) beforehand, and then select \(n\) that is practical, so the \(\beta\) is determined.

It would be great if we correctly reject \(H_0\) when \(H_0\) is actually false. We hope the probability of having this to be large. The probability is actually \(1-\beta\), which is called the power of a test. The power depends on the same factors as \(\beta\) does, including the size of \(\alpha\), the sample size, and the true value of parameter.

15.6 Exercises

  1. Here are summary statistics for randomly selected weights of newborn boys: \(n =207\), \(\bar{x} = 30.2\)hg (1hg = 100 grams), \(s = 7.3\)hg.

    1. With significance level 0.01, use the critical value method to test the claim that the population mean of birth weights of females is greater than 30hg.
    2. Do the test in (c) by using the p-value method.
  2. You are given the following hypotheses: \[\begin{align*} H_0&: \mu = 45 \\ H_A&: \mu \neq 45 \end{align*}\] We know that the sample standard deviation is 5 and the sample size is 24. For what sample mean would the p-value be equal to 0.05? Assume that all conditions necessary for inference are satisfied.

  3. Our one sample \(z\) test is \(H_0: \mu = \mu_0 \quad H_1: \mu < \mu_0\) with a significance level \(\alpha\).

    1. Describe how we reject \(H_0\) using the critical-value method and the \(p\)-value method.
    2. Why do the two methods lead to the same conclusion?

  1. Hypothesis testing is also called Null Hypothesis Statistical Testing (NHST), statistical testing or test of significance.↩︎