We finished the discussion of estimation, interval estimation in particular in the previous chapter. The other part of statistical inference is hypothesis testing. In this chapter, we first discuss the meaning a hypothesis in statistical analysis, followed by the testing procedures for population mean when is known and when is unknown. Pay attention to similarity and difference of estimation and testing.
16.1 Introduction
What is Hypothesis Testing?
In statistics, a hypothesis is a claim or statement about a property of a population, often the value of a population distribution parameter. For example,
The mean body temperature of humans is less than F. Here the mean body temperature is a property or characteristic of target population human beings. We can turn the verbal claim into a brief mathematical expression .
Marquette students’ IQ scores has standard deviation equal to 15. The IQ score standard deviation is a characteristic of the population Marquette students. Mathematically, we can write the claim as .
You can see that we usually focus on claims about a population distribution parameter.
The null hypothesis, denoted , is a statement that the value of a parameter is equal to some claim value, or the negation of the alternative hypothesis that will be discussed in a minute. Often represents a skeptical perspective or a claim to be tested, or the current status of the parameter. For example, the claim “the percentage of Marquette female students loving Japanese food is equal to 80%” is a claim because of the key word “equal”. Usually we are not very convinced that the claim is true, and in our analysis we want to test the claim, and see whether the evidence and information we collect is strong enough to make a conclusion that the percentage is not equal to 80%.
The alternative hypothesis, denoted or , is a claim that the parameter is less than, greater than or not equal to some value. It is usually our research hypothesis of some new scientific theory or finding. If we think the percentage of Marquette female students loving Japanese food is greater than 80%, this hypothesis is the claim. If after a formal testing procedure, we conclude that the percentage is greater than 80%, we sort of make a new research discovery that overturns the previous claim or status quo that the percentage is equal to 80%.
Let’s do one more exercise. Is the statement “On average, Marquette students consume less than 3 drinks per week.” a or claim? Because of the key word “less than”, it is a claim.
So what is hypothesis testing? Hypothesis testing1 is a procedure to decide whether or not to reject based on how much evidence there is against . If the evidence is strong enough, we reject in favor of .
Example
Before we jump into the formal hypothesis testing procedure, let’s talk about a criminal charge example. How a criminal is convicted is similar to the formal testing procedure.
Suppose a person is charged with a crime, and a jury will decide whether the person is guilty or not. We all know the rule: Even though the person is charged with the crime, at the beginning of the trial, the accuse is assumed to be innocent until the jury declares otherwise. Only if overwhelming evidence of the person’s guilt can be shown is the jury expected to declare the person guilty, otherwise the person is considered not guilty.
If we want to make a claim about whether the person is guilty or not, what are our and ? Remember that the null hypothesis represents a skeptical perspective or a claim to be tested, or the current status of the parameter, so we have
The person is not guilty 🙂
This is how we write a hypothesis: start with followed by the statement. Being not guilty is the default status quo of anyone, although the jury may doubt or be skeptical of the person being not guilty. The prosecutors and police detectives are trying their best the collect enough strong evidence to proof beyond a reasonable doubt to the jury. Therefore the alternative hypothesis is
The person is guilty 😟
In the example, the evidence could be photos, videos, witnesses, fingerprints, DNA, and so on . How do we decide to keep or to accept ? After all evidence including defense attorney and prosecutor’s arguments are presented to the jury, the decision rule is the jury’s voting . Finally, to close the case, we need a conclusion that is the verdict “guilty” or “Not enough evidence to convict” .
Please go through the entire criminal charge process again:
and => Evidence => Decision rule => Conclusion
The process is quite similar to the formal procedure for a hypothesis testing.
16.2 How to Formally Do a Statistical Hypothesis Testing
The entire hypothesis testing can be wrapped up in the following six steps. No worries if you don’t have any idea of it. We will learn this step by step using a test for the population mean .
Step 0: Check Method Assumptions
Step 1: Set the and in Symbolic Form from a Claim
Step 2: Set the Significance Level
Step 3: Calculate the Test Statistic (Evidence)
Decision Rule I: Critical Value Method
Step 4-c: Find the Critical Value
Step 5-c: Draw a Conclusion Using Critical Value Method
Decision Rule II: P-Value Method
Step 4-p: Find the P-Value
Step 5-p: Draw a Conclusion Using P-Value Method
Step 6: Restate the Conclusion in Nontechnical Terms, and Address the Original Claim
Let’s look at this example: Is the New Treatment Effective?
A population of patients with hypertension is normal and has mean blood pressure (BP) of 150. After 6 months of treatment, the BP of 25 patients from this population was recorded. The sample mean BP is and the sample standard deviation is .
Source: https://unsplash.com/photos/i1iqQRLULlg
Our goal is to determine whether a new treatment is effective in reducing BP. Let’s learn the testing procedure step by step using this example.
Step 0: Check Method Assumptions
Any statistical method is based on some assumptions. To use the method, and analyze our data appropriately, we have to make sure that the assumptions are satisfied. In this book, most of the distribution-based methods require
Random sample
The population is normally distributed and/or the sample size .
From the question description, A population of hypertension group is normal.
Step 1: Set the and from a Claim
The first step of testing is to understand the and claims, and express them using mathematically using population parameters. The followings provdie three examples.
🧑🏫 The mean IQ score of statistics professors is higher than 120.
💵 The mean starting salary for Marquette graduates who didn’t take MATH 4720 is less than $60,000.
📺 The mean time between uses of a TV remote control by males during commercials equals 5 sec.
Keep in mind that the equality sign is always put in , and and are mutually exclusive. Also, the claims are for population parameters, not sample statistics. We are not sure the value of the parameter being tested, but we want to collect evidence, and see which claim about the parameter is supported by the evidence.
Example Step 1: Set the and from a Claim
The claim that the new treatment is effective in reducing BP means the mean BP is less than 150, which is an claim. So we can write our and as
where is the mean blood pressure.
Step 2: Set the Significance Level
Next, we set the significance level that determines how rare or unlikely our evidence must be in order to represent sufficient evidence against . It tells us how strong the collected evidence must be in order to overturn the current claim. An level of 0.05 implies that evidence occurring with probability lower than 5% will be considered sufficient evidence to reject . Mathematically, As a result, means that we incorrectly reject 5 out of every 100 times we collect a sample and run the test.
Here is the idea. When we want to see if what we care about (the population parameter) is not as described as in the null hypothesis , we first assume or believe is right, then based on this, we see if there is sufficient and strong evidence to conclude that it is probably not the case, and find the alternative hypothesis more reasonable.
Let’s explain by an example. Suppose we would like to test the claim that “The mean IQ of statistics professors is greater than 120.” Or in short vs. . With large sample size, we can assume follows a normal distribution. Now, to test whether the mean IQ is greater than 120, we need to first treat the mean not being greater than 120 unless later we have sufficient evidence to say it is greater than 120. In particular, we need to do the test and analysis on the basis that the mean is under . That is, we first assume the mean is 120, or , then see if the assumption really makes sense.
Because is normal, we do the test under the assumption that for some , say . (Let’s focus on and ignore at this moment). If has mean 120, and from our sample data we got the sample mean , do you think the claim make sense? How about you got ? Now comes into play. Let me ask you a question. What is the threshold or value of the sample mean that you think it is too large to believe that is a reasonable assumption or data generating mechanism? What is the threshold that makes you start to believe that makes more sense than ? The significance level is such threshold value. With being specified, we know what is the corresponding sample mean threshold , which is the one such that .
Figure 16.1 illustrates the significance level . Once we decide , we determines how rare or unlikely our sample mean must be in order to represent sufficient evidence against . In this example, if our is greater than 125, we would think the evidence is strong enough to conclude that is not so reasonable because the chance of such value happening is no larger than . We instead think makes more sense.
Figure 16.1: Illustration of significance level, alpha.
The entire rationale is the rare event rule.
Rare Event Rule
If, under a given assumption, the probability of a particular observed event is exceptionally small, we conclude that the assumption is probably not correct.
The level is related to the used in confidence intervals for defining a “critical value”.
Example Step 2: Set the Significance Level
There is no mentioned in the question description. Usually is set by researchers themselves. Let’s set . This means we are asking, “Is there a sufficient evidence at that the new treatment is effective?”
Step 3: Calculate the Test Statistic
Setting is kind of setting the threshold for determining whether our collected evidence is sufficient or strong enough or not. In this step, we are collecting our evidence. The evidence is collected from the information we have, which is the sample data. Sample data is the only source we have for the inference about the unknown parameter. So to do a test about the parameter, or decide whether a statement about the parameter makes sense, we let the data and evidence speak up.
The evidence used in the hypothesis testing is called test statistic: a sample statistic value used in making a decision about the . Suppose the test we are interested is and where is some population mean value that could be 150 lbs, 175 cm, 50 ounces, etc. When computing a test statistic, we assume is true. Remember, we are trying to see if there is any strong evidence that is against . We should do our analysis in the world of or the status quo. If the evidence is not sufficient, we stay in our current situation.
When is known, the test statistic for testings about is
When is unknown, the test statistic for testings about is
Familiar with them? Those are score and score. Those are the sample statistics used for testing. When we calculate the test statistics, we need the value of . What value we should use? You are right if you use the value assumed in ! The test statistics are the evidence we use in testing. The evidence is collected under the assumption that . We collect any evidence to prove that a suspect committed a crime under the assumption that he is innocent, right? We shouldn’t look at any person through colored spectacles, or frame anyone by treating someone as criminal, then make up a fake story for what he’s never done.
Example Step 3: Calculate the Test Statistic
Since we don’t know the true , and only know , we use distribution and the test statistic is So if the true mean blood pressure is 150, our test statistic or evidence is about 2.55 standard deviations below the mean. Is this number too weird or uncommon to believe that the mean blood pressure is really 150? We need a decision rule, and that is what we are going to learn in the step 4.
Step 4-c: Find the Critical Value
In this step, we set the decision rule. There are two methods in testing, the critical-value method and the p-value method. The two methods are equivalent, leading to the same decision and conclusion. Let’s first talk about the critical-value method.
In step 2, we set the , and in step 3, we collect the evidence. Now we need a way to decide whether the collected evidence is sufficient or not to reject the claim. The critical value(s) is a value determined by the significance level that separates the rejection region or critical region, where we reject , from the values of the test statistic that do not lead to the rejection of .
Which critical value to be used depends on whether our test is a right-tailed, left-tailed or two-tailed. The right-tailed test, or right-sided test is the test with . When we are interested of is greater than some value, say , in the sampling distribution, we will focus on the right hand side of the distribution, because the evidence, the test statistic calculated in the step 3, will usually be on the right hand side of the distribution, so is the critical value used in the decision rule. Similarly, The left-tailed test, or left-sided test is the test with . For a two-tailed or two-sided test, we have . In this case, we wonder is larger or smaller than the assumed . So we need to pay attention to both sides of the sampling distribution.
Figure 16.2 illustrates rejection regions for the different types of hypothesis tests. Let’s assume is known as the unknown case is similar and we just replace the score with the score. Given the significance level , for a right-tailed test, the critical value is , the standard normal quantile so that , where . For a right-tailed test, the critical value is , or in fact , the standard normal quantile so that or . When the test is a two-tailed test, there are two critical values, one at the right-hand side, the other at the left-hand side of the distribution. Here, we need to split equally into , and the critical value at the right-hand side is such that and the critical value at the left-hand side is such that . Note that by definition, and are always positive and on the right hand side of the distribution.
Figure 16.2: Rejection regions for the different types of hypothesis tests. Source: https://towardsdatascience.com/everything-you-need-to-know-about-hypothesis-testing-part-i-4de9abebbc8a
The following table is the summary of the critical values under different cases. When is known, we use scores, and when is unknown, we use scores.
Condition
Right-tailed
Left-tailed
Two-tailed
known
and
unknown
and
Example Step 4-c: Find the Critical Value
Since the test is a left-tailed test, and is unknown and , the critical value is that is
Step 5-c: Draw a Conclusion Using Critical Value
The critical value separates the the standard normal values into the rejection region and non-rejection region. For a right-tailed test, the rejection region is any value greater than , and the non-rejection region is any value smaller than or equal to . For a left-tailed test, the rejection region is any value small than , and the non-rejection region is any value greater than or equal to . For a two-tailed test, the rejection region is the union of any value smaller than and any value greater than .
If the test statistic is in the rejection region, we reject . If is not in the rejection region, we do not or fail to reject . Figure 16.3 is an example that we reject in a right-tailed test. The test statistic is 2.5 which is greater than the critical value 1.645, so the test statistic falls in the rejection region.
Figure 16.3: Test statistic inside of critical region. Source: https://www.thoughtco.com/example-of-a-hypothesis-test-3126398
The rejection region for any type of tests is shown in the table below.
Condition
Right-tailed
Left-tailed
Two-tailed
known
unknown
Remember that a test statistic works as our evidence, and the critical value is a threshold to determine whether the evidence is strong enough. When the test statistic is more extreme than the critical value, it means that from our point of view, the chance of our evidence happening is way too small given the current rules of the game or under . Therefore, we don’t think we live in the world of , and it probably makes more sense to think we live in the world of , and it is commonplace to see these evidence happening.
Example Step 5-c: Draw a Conclusion Using Critical Value
We reject if . Since and , we have , so we reject .
Step 4-p: Find the P-Value
Another decision rule is the p-value method. The -value measures the strength of the evidence against provided by the data. The smaller the -value, the greater the evidence against . As the name implies, the -value is the probability of getting a test statistic value that is at least as extreme as the one obtained from the data, assuming that is true. . For example, -value for a right-tailed test. We are more likely to get a -value near 0 when is false than when is true. Because when is true, will be closer to zero or located around the center of the distribution (the value assumed in ), and its p-value will be around 0.5. On the other hand, when is false, or the true is not , the test statistic will be farther away from and located at the either tail of the distribution. Therefore, its p-value will be small.
P-Value Illustration
Since p-value is a probability, in the distribution, it represents the area under the density curve for values that are at least as extreme as the test statistic’s value. Figure 16.4 shows the p-value for different tests. Note that the p-value for a two-tailed test depends on whether the test statistic is positive or negative. If the calculated test statistic is on the right (left) hand side, the p-value will be the right (left) tail area times two.
Figure 16.4: Illustration of p-values for different types of hypothesis tests
Mathematically, the p-value for any type of tests is shown in the table below.
Condition
Right-tailed
Left-tailed
Two-tailed
known
unknown
Example Step 4-p: Find the P-Value
This is a left-tailed test, so the -value is 0.01
Step 5-p: Draw a Conclusion Using P-Value Method
How do we use the p-value to make the decision? Well, here the p-value is like our evidence, and the significance level is the cut-off for measuring the strength of the evidence. If the -value , we reject . If instead the -value , we do not reject or fail to reject .
Yes, it is a pretty simple decision rule, but the -value has been misinterpreted and misused for a long time. When we do a hypothesis testing, it is dangerous to simply compare the size of -value and , then jump into the conclusion. You can find more issues of p-value at XXX.
Example Step 5-p: Draw a Conclusion Using P-Value Method
We reject if the -value < . Since -value , we reject .
Both Methods Lead to the Same Conclusion
Remember I say both critical-value method and -value method lead to the same conclusion? Figure 16.5 shows why. Test statistic and critical value are variable values, either or scores. The p-value and significance level are probabilities, either or probabilities. The p-value is computed from the test statistic, and defines the critical value. The more extreme test statistic implies the smaller p-value, and smaller means more extreme critical value. When we reject , the following three statements are equivalent:
test statistic is in the rejection region.
the test statistic is more extreme than the critical value
the p-value is smaller than .
Figure 16.5: The conclusion is the same regardless of the method used (Critical Value or P-Value).
The following distribution shows the equivalence of the critical-value method and the p-value method in the blood pressure example.
Step 6: Restate the Conclusion in Nontechnical Terms, and Address the Original Claim
The final step in the entire hypothesis testing procedure is to make a verbal conclusion, and address the original claim. Figure 16.6 gives you a guideline of how we make a conclusion.
Figure 16.6: Conclusions based on testing results. Source: https://www.drdawnwright.com/category/statistics/
Here is a reminder. We never say we accept. Why can’t we say we “accept the null”? The reason is that we are assuming the null hypothesis is true or the situation we are currently in. We are trying to see if there is evidence against it. Therefore, the conclusion should be in terms of rejecting the null. We don’t accept when we don’t have evidence against it because we are already in the world of .
Figure 16.7: Meme about hypothesis testing conclusions. Source: https://www.pinterest.com/pin/287878601159173631/
Example Step 6: Restate the Conclusion in Nontechnical Terms, and Address the Original Claim
We have a claim and we reject , so we conclude that There is sufficient evidence to support the claim that the new treatment is effective.
The critical value is , or the quantile such that . Therefore, we use qt() to get the value. Notice that the p-value is a probability that the Student’s t variable with degrees of freedom is smaller (more extreme) than the test statistic. In R, we use pt() to get the probability. Without specifying the lower.tail argument in the function, by default, both qt() and pt() function focuses on the lower tail or left tail, which is what we need in this left-tail test.
Below is a demonstration of how to work through the blood pressure example using Python.
import numpy as npfrom scipy.stats import t
## create objects to be usedalpha =0.05; mu_0 =150x_bar =147.2; s =5.5; n =25## Calculate the t-test statistict_test = (x_bar - mu_0) / (s / np.sqrt(n))t_test
-2.5454545454545556
## Calculate the critical t valuet_crit = t.ppf(alpha, df=n-1)t_crit
-1.7108820799094282
## Calculate the p-valuep_val = t.cdf(t_test, df=n-1)p_val
0.008878157746280955
The critical value is , or the quantile such that . Therefore, we use t.ppf() to get the value. Notice that the p-value is a probability that the Student’s t variable with degrees of freedom is smaller (more extreme) than the test statistic. In Python, we use t.cdf() to get the probability. Both t.ppf() and t.cdf() function focuses on the lower tail or left tail, which is what we need in this left-tail test.
16.3 Example: Two-tailed z-test
The milk price of a gallon of 2% milk is normally distributed with standard deviation of $0.10. Last week the mean price of a gallon of milk was 2.78. This week, based on a sample of size 25, the sample mean price of a gallon of milk was . Under , determine if the mean price is different this week.
Source: https://unsplash.com/photos/BYlHH_1j2GA
Step-by-Step
Step 1: Set the and from a Claim
Form the sentence “determine if the mean price is different this week”, we know the claim or what we are interested is an claim. If we let be the mean milk price this week, we have the test where 2.78 is the mean milk price last week.
Step 2: Set the Significance Level
Step 3: Calculate the Test Statistic
From the question we know that the population is normally distributed, and is known. So we use the -test, and the test statistic is
Step 4-c: Find the Critical Value
Since it is a two-tailed test, we have two potential critical values. Because and on the right hand side of the standard normal distribution, we compare it with the critical value on the right, which is .
Step 5-c: Draw a Conclusion Using Critical Value
This is a two-tailed test, and we reject if . Since , we DO NOT reject .
Step 4-p: Find the P-Value
This is a two-tailed test, and the test statistic is on the right , so the -value is 0.317 .
Step 5-p: Draw a Conclusion Using P-Value Method
We reject if -value < . Since -value , we DO NOT reject .
The critical-value and p-value method are illustrated in Figure 16.8.
Figure 16.8: Illustration of Critical Value and P-Value methods
Step 6: Restate the Conclusion in Nontechnical Terms, and Address the Original Claim
There is insufficient evidence to support the claim that this week the mean price of milk is different from the price last week.
Below is a table that summarizes what we have learned about hypothesis testing in this chapter.
Numerical Data, known
Numerical Data, unknown
Parameter of Interest
Population Mean
Population Mean
Test Type
One sample test
One sample test
Confidence Interval
Test Stat under
-value under
-value
-value
-value
-value
-value
-value
16.5 Type I and Type II Errors
It is important to remember that hypothesis testing is not perfect, meaning that we may make a wrong decision or conclusion. After all, the collected evidence may not be able to present the full picture of what the true population distribution is. There are two types of errors we may commit when doing hypothesis testing: Type I error and Type II error.
If in fact is true, but we wrongly reject it, we commit the type I error. We shouldn’t reject it but we did. If is false, but we don’t reject it, we make the type II error. We should have figured out that does not make sense. The following table tells us when we make a correct decision and when we don’t. In practice, we will not know for certain if we made the correct decision or if we made one of these two errors because we never know the truth!
Decision
is true
is false
Reject
Type I error
Correct decision
Do not reject
Correct decision
Type II error
Back to the crime example that The person is not guilty v.s. The person is guilty . We can have a decision table like
Decision
Truth is the person innocent
Truth is the person guilty
Jury decides the person guilty
Type I error
Correct decision
Jury decides the person not guilty
Correct decision
Type II error
Is it worse to wrongly convict an innocent person (Type I error) or to let a perpetrator free (Type II error)? Both hugely negatively impact our society, and if possible, we should make the two errors as rarely as possible.
Figure 16.9: Example of Type I and Type II errors (https://www.statisticssolutions.com/wp-content/uploads/2017/12/rachnovblog.jpg)
It you still don’t get the idea of type I and type II errors, Figure 16.9 is a classical example of the two errors. Of course the null hypothesis is “not pregnant”, and the alternative hypothesis is “pregnant”. Claiming that a old man is expecting a baby is a type I error, and saying a pregnant woman not having a baby is a type II error.
In statistics, the probability of committing the type I error is in fact the significance level .
If the evidence occurring with probability lower than 5%, it will be considered sufficient evidence to reject , even though is actually the true mechanism giving rise to such evidence.
What is the probability of committing the type II error, the probability that we fail to reject when is a false statement? We call the probability :
, and sample size are related. If we choose any two of them, the third is automatically determined. We would of course prefer to be small since we would not like to conclude in favor of the research hypothesis falsely. But given the sample size, small leads to a large . On the other hand, too small would most likely result in no discovery because we are gonna be conservative, set the threshold too high, and do not reject lots of that should be rejected. In practice, we specify beforehand, and then select that is practical, so the is determined.
It would be great if we correctly reject when is actually false. We hope the probability of having this to be large. The probability is actually , which is called the power of a test. The power depends on the same factors as does, including the size of , the sample size, and the true value of parameter.
16.6 Statistical Power and Choosing Sample Size*
16.6.1 Power of a Hypothesis Test
. In statistics, is called the (statistical) power of the test. We hope a test has high power because it means the test is able to correctly reject when is false, like we hope all criminals get arrested! The power depends on the same factors as does, including the size of , the sample size, and the true value of parameter.
Now let’s learn type I error, type II error, and power through distributions. Here a right-tailed example v.s. is illustrated in Figure 16.10.
The distribution on the top is the distribution under , that is, the distribution has the mean . Since it is a right-tailed test, we reject in favor of if the test statistic is greater than the critical value on the right of the distribution that is indicated at the green vertical line. If we reject when is actually true, we commit a type I error, and the probability of committing a type I error, which is , is the green area on the right tail of the distribution.
Figure 16.10: Example of Power. Source
Now, suppose the true population mean is . That is, is false and is true, then the true distribution is the blue curve one centered at at the bottom. However, in reality most of the time we do not know the exact value of . Remember that we make our decision under the mechanism or assumption of . So we actually do our hypothesis testing under the distribution depicted by the purple-dashed curve. And we know this distribution because is the hypothesized value of we specify when we do the testing. If we fail to reject the false , we commit the type II error. The yellow area is the probability of committing the type II error. This yellow area is the probability that we fail to reject when is false and is true , which is . Although we don’t know the true value , for this right-tailed test, the left-tailed area whose value is smaller than the critical value is always the rejection region. The size of the yellow area, the chance of committing type II error, and the rejection region is smaller (larger) if is much (less) greater than .
Which part represents the power in the figure? The power is which is the green area under the blue density curve at the bottom. Remember that the total area under a probability density curve is 1, and the power is the total area minus , the yellow area.
probserrors =tex`\begin{aligned}&\text{Type I error: }\alpha = P(\text{rejecting } H_0 \vert H_0 \text{ true}) =${alpha.toPrecision(3)}\\[0.2em]&\text{Type II error: }\beta = P(\text{not rejecting } H_0 \vert H_0 \text{ not true}) =${beta.toPrecision(3)}\\[0.2em]&\text{Power: }1-\beta = P(\text{reject } H_0 \vert H_0 \text{ not true}) =${(1-beta).toPrecision(3)}\end{aligned}`
Type I error: α=P(rejecting H0∣H0 true)=0.0500Type II error: β=P(not rejecting H0∣H0 not true)=0.00000637Power: 1−β=P(reject H0∣H0 not true)=1.00
plt_pdf_err = Plot.plot({caption:html` <span style="color:steelblue">probability of type I error: α = Pr(reject H0 | H0 true)</span><br> <span style="color:#A9A9A9">probability of type II error: β = Pr(not rejecting H0 | H0 not true)</span> <br> <span style="color:#FFAC1C">test power: 1-β = Pr(reject H0 | H0 not true)</span> `,color: {legend:true },x: {label:"xBar",axis:true },y: {axis:false,domain: [0,1.02*maxpdf] },marks: [ Plot.ruleY([0]), Plot.areaY(pdfdata_err, {filter: d => d.x<= zcrit_right && d.x>= zcrit_left && d.type=="HA",x:"x",y:"pdf",fill:"#C0C0C0",opacity:0.3}), Plot.areaY(pdfdata_err, {filter: d => d.x> zcrit_right && d.type=="HA",x:"x",y:"pdf",fill:"orange",opacity:0.5}), Plot.areaY(pdfdata_err, {filter: d => d.x< zcrit_left && d.type=="HA",x:"x",y:"pdf",fill:"orange",opacity:0.5}), Plot.areaY(pdfdata_err, {filter: d => d.x>= zcrit_right && d.type=="H0",x:"x",y:"pdf",fill:"steelblue",opacity:0.7}), Plot.areaY(pdfdata_err, {filter: d => d.x<= zcrit_left && d.type=="H0",x:"x",y:"pdf",fill:"steelblue",opacity:0.7}), Plot.line(pdfdata_err, {x:"x",y:"pdf",stroke:"type",strokeWidth:2}) ] })
H0HA
probability of type I error: α = Pr(reject H0 | H0 true) probability of type II error: β = Pr(not rejecting H0 | H0 not true) test power: 1-β = Pr(reject H0 | H0 not true)
In this section we illustrate how to calculate the type II error rate and power. Suppose we randomly sampled 36 values from a normally distributed population with and unknown . At , we are going to test a right-tailed test
Suppose we are sampling from a normal distribution and is known. Then the test statistic is under . Since it is a right-tailed test, we reject if .
Now we are going to re-express the rejection region in terms of because it is going to help us calculate . The question is, for what values of will we reject ? Remember, we reject if the observed test statistic . We can isolate in the expression by multiplying and adding on both sides. Then we have .
Therefore, having evidence is equivalent to having evidence . Both represent the same rejection region. One represents the evidence using standard normal distribution as the benchmark, and the other uses the original hypothesized distribution as the benchmark. The figure below illustrates this idea. We use different scales to measure the same thing.
We are not able to calculate or power without knowing or assuming the true value of which is not because is based on the fact that is false. Suppose the true mean is . Let’s calculate .
Now .
Because we reject if , it means that we do not reject if . Plug into information into the probability, we have
In general, if the hypothesized mean value is , and the true mean value is , for one sample test, we have the formula for calculating as follows. The derivation is left as an exercise.
For one-tailed tests (either left-tailed or right-tailed),
For two-tailed tests,
Note that again to compute , we need the value of the true , which in reality is usually unknown. But we could definitely check how the value of affects through this formula.
16.6.3 Power Analysis
Back to the milk price example in Section 16.3. we have the test
and we do not reject .
The question here is
If , is the conclusion that price has not changed reasonable or acceptable? Let’s see the chance of making the wrong decision.
We check the probability that we do not reject when is false, i.e., the probability that we conclude that the mean milk price has not changed, but it actually did.
Since it is a two-tailed test, and , the type II error rate can be as high as If the actual mean milk price this week is or , (since ), there is about 3 in 10 chance of making a wrong conclusion. If we think the risk is too high, we need to collect more than 25 samples. With a fixed , we can increase the testing power or decrease type II error rate by increasing the sample size. This leaves a question: How do we choose a sample size that keeps at some low level?
Let’s go back to the formula of .
What will increase the power (decrease )?
and are further away: When and are far apart, the evidence that is not is stronger, and it’s more likely to reject when it is false. The chance of making type II error decreases. Look how decreases from the blue area to the red area when the true mean value increases from 56 to 60.
Larger : and are trading off. Everything held constant, if you increases , it means you use smaller critical values, and allows more type I errors or false discoveries. But at the same time, more rejecting decreases the cases that we don’t reject when it is false. In other words, type II error rate goes down. The figure below shows how changes when is increased from 0.05 to 0.2.
Smaller : When is small, given the same location of the distributions of and , or the same , the two distributions are more separated apart, and have smaller overlapped regions. The figure below shows how shrinks from the blue area to the red area when becomes smaller due to the fact that the two distributions become peaky and thin-tailed.
Larger sample size : When sample size gets large, more information is collected, and therefore the sampling distribution of the sample mean becomes more certain about its possible values. This results in the same effect of having a smaller .
To keep and at a specified level, the only way is to increase the sample size. With , the formula for required sample size is shown as below. The derivation is left as an exercise.
One-tailed test (either left-tailed or right-tailed):
Two-tailed test:
Example: Sample size
A cereal company sell boxes of cereal with the labeled weight of 16 oz. The production is based on the mean weight of 16.37 oz. So that only small portion of boxes have weight less than 16 oz. The box weight is normally distributed with . The percentage of boxes weighting less than 16 oz. is 5%. They suspect that due to some production defect the weight filled in the boxes have mean less than 16.37, and like to conduct a test under .
This is a claim and the test is
Question: How many boxes should be sampled in order to correctly discover that mean is less than 16.37 with the power of 0.99 if in fact the true mean weight is 16.27 or less?
. . We have , .
The formula is
, .
Thus,
They need at least 80 samples to conduct the test under the specified conditions (, ).
16.7 Exercises
Here are summary statistics for randomly selected weights of newborn boys: , hg (1hg = 100 grams), hg.
With significance level 0.01, use the critical value method to test the claim that the population mean of birth weights of females is greater than 30hg.
Do the test in (c) by using the p-value method.
You are given the following hypotheses: We know that the sample standard deviation is 5 and the sample size is 24. For what sample mean would the p-value be equal to 0.05? Assume that all conditions necessary for inference are satisfied.
Our one sample test is with a significance level .
Describe how we reject using the critical-value method and the -value method.
Why do the two methods lead to the same conclusion?
Hypothesis testing is also called Null Hypothesis Statistical Testing (NHST), statistical testing or test of significance.↩︎