22  Analysis of Variance

Comparing More Than Two Population Means

We have learned how to compare two population means in Chapter 16. However, in many research settings and questions, we want to compare three or more population means. For example,

  • Are there differences in the mean readings of 4 types of devices used to determine the pH of soil samples?

Source: Unsplash-Markus Spiske.

  • Do different treatments (None, Fertilizer, Irrigation, Fertilizer and Irrigation) affect the mean weights of poplar trees?

Source: Unsplash-Nareeta Martin.

In both cases, four means are compared. The two-sample \(z\) or \(t\) test cannot be used anymore. Our solution is the analysis of variance, or ANOVA.

22.1 ANOVA Rationale

One-Way Analysis of Variance

A factor (treatment) is a property or characteristic (categorical variable) that allows us to distinguish the different populations from one another. In fact, when we compare two population means, the two samples are from the populations separated by the categories of a factor. For example, we have gender as a factor that separates the human population into male and female populations, then we compare their mean salary. In our previous examples, type of device and treatment of trees are factors.

One-way ANOVA examines the effect of single categorical variable on the mean of a numerical variable. The ANOVA model is one-way because there is only one factor being considered. If two factors are examined to see how they affect the mean of a numerical variable, the model is called two-way ANOVA. The numerical variable in ANOVA or regression model (Chapter 23) is called the response variable because we want to see how the variable responses to the changes of factors or other variables.

Interestingly, we analyze variances to test the equality of three or more population means 🤔. It sounds counterintuitive, but later you will see why this makes sense.

Requirements

The one-way ANONA model has the following requirements or assumptions.

  • Each sub-population formed by the category of the factor is normally distributed. In the two-sample \(z\) or \(t\) test, we also have such assumption that the two samples are drawn independently from two normally distributed populations or at least both sample sizes are sufficiently large.

  • The populations have the same variance \(\sigma^2\). ANOVA does not ridiculously assume all the population variances are known. Instead, the model admits the variances are unknown, but assumes they are all equal. Also sounds ridiculous? Anyway, it is what it is, and this is one of the limitations of ANOVA, although equality of variances is a pretty loose requirement. It is also possible to transform our data, so that the transformed samples have similar magnitude of variance. Does this assumption remind you of something? The two-sample pooled \(t\)-test has the equality of variance assumption too. In this point of view, one-way ANOVA is a generalization of the two-sample pooled \(t\)-test. The next two requirements will be not surprising at all because they are also requirements of the two-sample pooled \(t\)-test.

  • The samples are random samples. Without mentioned explicitly, the inference methods and statistical models introduced in the book are based on random samples.

  • The samples are independent of each other. They are not matched or paired in any way.

Data and Math Notation for ANOVA

Source: Table 8.5 of SMD.

  • \(y_{ij}\): \(j\)-th observation from population \(i\). \(\color{blue}{(y_{24}:\text{ 4-th observation from population 2})}\)

  • \(n_i\): number of values in the \(i\)-th group/sample. (\(i = 1, \dots, k = 5\)) \(\color{blue}{\small (n_1 = n_2 = \dots = n_5 = 4)}\)

  • \(\bar{y}_{i\cdot}\): the average of values in the \(i\)-th sample, \(\bar{y}_{i\cdot} = \frac{\sum_{j=1}^{n_i}y_{ij}}{n_i}.\)

  • \(N\): total number of values in all samples combined, \(\small N = n_1 + \dots + n_k = \sum_{i=1}^kn_i\). \(\color{blue}{\small (N = 4 + \dots + 4 = 20)}\)

  • \(\bar{y}_{\cdot\cdot}\): the grand sample mean that is the average of all sample values combined, \(\bar{y}_{\cdot\cdot} = \frac{\sum_{i=1}^{k}\sum_{j=1}^{n_i}y_{ij}}{N} = \frac{\sum_{i=1}^{k}n_i\bar{y}_{i\cdot}}{n_1 + \dots + n_t}\)

Mathematically, suppose there are \(k\) populations/samples, and for group \(i =1, \dots, k\), \(n\) observations are drawn. Then

\[y_{ij} \stackrel{iid}{\sim} N(\mu_i, \sigma^2), ~~ j = 1, 2, \dots, n.\]

Our goal is to test whether or not \(\mu_1 = \mu_2 = \cdots = \mu_k.\) If we do not reject \(H_0: \mu_1 = \mu_2 = \cdots = \mu_k,\) then all samples are viewed as samples from a common normal distribution, i.e., \(N(\mu_i, \sigma^2)\) and \(\mu = \mu_1 = \mu_2 = \cdots = \mu_k\).

\[y_{ij} \stackrel{iid}{\sim} N(\mu, \sigma^2), ~~ j = 1, 2, \dots, n.\]

Rationale

The section is quite important because we discuss in detail why we use variance size to test whether or not population means are equal.

Suppose we have two data sets Data 1 and Data 2, and both have three groups to be compared. Interestingly, Data 1 and Data 2 have the same group sample means \(\bar{y}_1\), \(\bar{y}_2\) and \(\bar{y}_3\) denoted as red dots in Figure 22.1. However, they differ with regards to the variance within each group. Data 1 has small variance within samples, while Data 2 has large variance within samples. Within groups, the data points in Data 1 are quite tight and close each other, and therefore they are all close to their sample mean. On the contrary, the data points in Data 2 are quite distant each other, even they are in the same group. Such distribution pattern tells us that the populations from which the samples of Data 1 are drawn have small variance \(\sigma^2\).

Before we go into more details, let’s see if you have a great intuition.

For which data do you feel more confident in saying the population means \(\mu_1\), \(\mu_2\) and \(\mu_3\) are not all the same?

Figure 22.1: Boxplots illustrating the variance within samples.

If your answer is Data 1, congratulations you are correct!

The difference in sample means in Data 1 is more likely due to the true difference in population means (\(\mu_1, \mu_2, \mu_3\) not all the same). Because of the small variation within groups, in Data 1 a value drawn from group 1 is very unlikely to be drawn from another group because with the normality assumption, the chance to be drawn from another group is very tiny. Figure 22.2 clearly illustrates this idea. The samples of Data 1 are so well-separated that we are more confident to say they are drawn from three well-separated populations that are not overlapped each other, and have distinct population means \(\mu_1, \mu_2\), and \(\mu_3\).

If the variance within groups is large, as shown at the bottom of Figure 22.2, all three samples are mixed up together, as if they are all from the common population, even though they are in fact are from three distinct populations \(N(\mu_1, \sigma^2), N(\mu_2, \sigma^2), N(\mu_3, \sigma^2)\) and \(\mu_1 \ne \mu_2 \ne \mu_3\). Note that the three populations still have their own distinct population mean which is actually identical to the mean with small variance, but it is hard for us to learn this fact from their mixed random samples. Because the three samples are indistinguishable, we don’t have strong evidence to conclude that their corresponding population has its own population mean, and tend to conclude that the three samples are from a common population \(N(\mu, \sigma^2)\).

Figure 22.2: Populations with small and large variance within samples.

The three samples in Figure 22.3 are from a common population \(N(\mu, \sigma^2)\). The samples look pretty similar to the samples with large variance in Figure 22.2. In other words, either one common population or three populations with large variance can produce quite similar samples. With the samples only, we cannot tell which data generating mechanism is the true one generating such data.

Figure 22.3: Three samples drawn from a common normal population.

Note

Always keep in mind that all we have are samples, and we never know the true means \(\mu_1\), \(\mu_2\), and \(\mu_3\). In the figures, we assume we know the true populations, and see what the samples look like. Statistical inference is trickier. We want to have a decision rule, so that we know how much the samples are well separated is enough to say they are from three different populations.


Variation Between Samples & Variation Within Samples

There are two types of variation we need to consider in order to determine whether population means are identical. They are variation between samples and variation within samples.

We have discussed the variation within samples or variance within groups, which measures the variability of data points in each group. This is actually the sample point estimate of the population variance \(\sigma^2\) because the data points in the \(i\)-th group are assumed from the \(i\)-th population \(N(\mu_i, \sigma^2).\) There are \(k\) sample variance, one for each group. Later, we will learn how to combine them to get one single variance within samples as an estimate of \(\sigma^2\).

Variation between samples, on the other hand, measures variability of sample means. The farther away from each other the sample means are, the larger variation between samples. In our Data 1 and Data 2 example, their variation between samples are very close because the relative location of their sample means are basically the same. Figure 22.4 illustrating the variance between samples. Data 3 and 4 have the same variance within samples, but Data 3 have small variation between samples and Data 4 have large variation between samples. Clearly, the sample means in Data 4 are farther away from each other, comparing to Data 3.

Figure 22.4: Boxplots illustrating the variance between samples.

Let me ask you the same question for Data 3 and 4.

For which data do you feel more confident in saying the population means \(\mu_1\), \(\mu_2\) and \(\mu_3\) are not all the same?

Your answer should be Data 4. When the sample means \(\bar{y}_1\), \(\bar{y}_2\) and \(\bar{y}_3\) are far away from each other, and they serve as the unbiased point estimate of \(\mu_1\), \(\mu_2\) and \(\mu_3\), respectively, we tend to claim that \(\mu_1\), \(\mu_2\) and \(\mu_3\) are not identical.

We have an important finding. Whether or not there is a difference in population means depends on the relative size of variation between samples and variation within samples. When variability between samples is large in comparison to the variation within samples, like Data 1 and Data 4, we tend to conclude that the population means are not all the same. When variation between samples is small relatively to the variation within samples, like Data 2 and Data 3, it’s hard to exclude the possibility that all samples comes from the same population.

22.2 ANOVA Procedures

ANOVA is usually done by providing the ANOVA table below.

In this section, we are going to learn the meaning of every cell in the table, and how to use the table to do the testing of equality of population means.

The hypotheses is \[\begin{align} &H_0: \mu_1 = \mu_2 = \cdots = \mu_k\\ &H_1: \text{Population means are not all equal} \end{align}\]

Note that the alternative hypothesis is not \(H_1: \mu_1 \ne \mu_2 \ne \cdots \ne \mu_k\). This is just one scenario where \(H_0\) is not satisfied. Any \(\mu_i \ne \mu_j , i \ne j\) violates \(H_0\), and should be a possibility of \(H_1\).

We learned that whether or not there is a difference in population means depends on the relative size of variation between samples and variation within samples. Statistician Ronald Fisher found a way to define a variable which is the ratio of variance between samples to variance within samples, and the variable follows the \(F\) distribution: \[\frac{\text{variance between samples}}{\text{variance within samples}} \sim F_{df_B,\, df_W}\] The degrees of freedom \(df_B\) is paired with variance between samples and \(df_W\) is for variance within samples. Be careful the order matters.

ANONA uses F test. If the variance between samples is much larger than the variance within samples, then the \(F\) test statistic \(F_{test}\) will be much greater than 1, which may be over the \(F\) critical value, and \(H_0\) is rejected.

The key question is how variance between samples and variance within samples are defined so that the ratio is \(F\) distributed.

Variance Within Samples

One-way ANOVA is a generalization of the two-sample pooled \(t\)-test. In the two-sample pooled \(t\)-test with equal variance \(\sigma^2\), we have the pooled sample variance \[s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}\]

How about the pooled sample variance for \(k\) samples? ANOVA assumes the populations have the same variance such that \(\sigma_1^2 = \sigma_2^2 = \cdots = \sigma_k^2 = \sigma^2\). With the same logic, we can have the pooled sample variance from \(k\) samples \[\boxed{s_W^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2 + \cdots + (n_k-1)s_k^2}{n_1 + n_2 + \cdots + n_k - k}}\] where \(s_i^2\), \(i = 1, \dots ,k\), is the sample variance of \(i\)-th group. \(s_W^2\) represents a combined estimate of the common variance \(\sigma^2\). It measures variability of the observations within the \(k\) populations. Note that each \(s_i^2\) measures the variability within the \(i\)-th sample. So this pooled estimate is the variance within samples.

Variance Between Samples

The variance between samples measures variability among sample means for the \(k\) groups, which is defined as \[\boxed{s^2_{B} = \frac{\sum_{i=1}^k n_i (\bar{y}_{i\cdot} - \bar{y}_{\cdot\cdot})^2}{k-1}}\] where

  • \(\bar{y}_{i\cdot}\) is the \(i\)-th sample mean.
  • \(\bar{y}_{\cdot\cdot}\) is the grand sample mean with all data points in all groups combined.
  • \(n_i\) is the sample size of the \(i\)-th sample.

The variance between samples measures the magnitude of how large is the deviation from group means to the overall grand mean. When \(\bar{y}_{i\cdot}\)s are away from each other, \((\bar{y}_{i\cdot} - \bar{y}_{\cdot\cdot})^2\) will be large, so is \(s^2_{B}\).

In fact, not only \(s^2_{W}\) estimates \(\sigma^2\), \(s^2_{B}\) is also an estimate of \(\sigma^2\). If \(H_0\) is true \((\mu_1 = \cdots = \mu_k = \mu)\), any variation in the sample means is due to chance and randomness, so it shouldn’t be too large. \(\bar{y}_{1\cdot}, \cdots, \bar{y}_{k\cdot}\) should be close each other and should be close to \(\bar{y}_{\cdot \cdot}\). This leads to a small \(s^2_{B}\) and small ratio \(s^2_{B}/s^2_{W}\) that is our \(F\) test statistic. That’s why when \(\mu_1 = \cdots = \mu_k = \mu\), we have small \(F\) test statistic and tend to not reject \(H_0\).

ANOVA Table: Sum of Squares

In the ANOVA table, there is a column Sum of Squares. There are three types of sum of squares. Let’s learn what they are.

Total Sum of Squares (SST) measures the total variation around \(\bar{y}_{\cdot\cdot}\) in all of the sample data combined (ignoring the groups), which is defined as: \[\color{blue}{SST = \sum_{j=1}^{n_i}\sum_{i=1}^{k} \left(y_{ij} - \bar{y}_{\cdot\cdot}\right)^2}\] where \(y_{ij}\) is the \(j\)-th data point in the \(i\)-th group.

Sum of Squares Between Samples (SSB) measures the variation between sample means: \[\color{blue}{SSB = \sum_{i=1}^{k}n_i \left(\bar{y}_{i\cdot} - \bar{y}_{\cdot\cdot}\right)^2}\]

Sum of Squares Within Samples (SSW) measures the variation of any value, \(y_{ij}\), about its sample mean, \(\bar{y}_{i\cdot}\): \[\color{blue}{SSW = \sum_{i=1}^{k} \sum_{j=1}^{n_i} \left(y_{ij} - \bar{y}_{i\cdot}\right)^2 = \sum_{i=1}^{k} (n_i - 1)s_i^2}\]

Sum of Squares Identity

The three sum of squares are related by the identity \[SST = SSB + SSW.\]

Intuitively, for data points \(y_{ij}\), their squared distance from the grand sample mean \(\bar{y}_{\cdot\cdot}\) can be decomposed into two parts: (1) the squared distance between their own group sample mean and the grand sample mean, and (2) the their squared distance from their own group sample mean.

Note that the sum of squares statistics have associated degrees of freedom. More interestingly, the three degrees of freedom also form an identity. So

\[df_{T} = df_{B} + df_{W}\] where

  • \(df_{T} = N-1\) is the degrees of freedom of \(SST\)
  • \(df_{B} = k - 1\) is the degrees of freedom of \(SSB\)
  • \(df_{W} = N - k\) is the degrees of freedom of \(SSW\).

When a sum of squares divided by its degrees of freedom, we get its mean square (MS), i.e., \[\text{mean square} = \dfrac{\text{sum of squares}}{\text{degrees of freedom}}.\] We are particularly interested in the mean square between (MSB) and mean square within (MSW):

  • \(MSB = \frac{SSB}{k-1} = s^2_{B}\)
  • \(MSW = \frac{SSW}{N-k} = s^2_{W}\)

Please check the formula of \(SSB\) and \(SSW\). You will find that \(MSB\) is our variance between samples and \(MSW\) is our variance within samples! So our \(F\) test statistic is \[F_{test} = \frac{MSB}{MSW}.\]

Under \(H_0\), \(s^2_{B}/s_W^2\) is a statistic from \(F_{k-1, \, N-k}\) distribution. The first degrees of freedom is \(df_{B} = k - 1\), and the second is \(df_{W} = N - k\). They cannot be switched.

We reject \(H_0\) in favor of \(H_1\) if \(F_{test} > F_{\alpha, \, k - 1,\, N-k}\), or \(p\)-value \(P(F_{k - 1,\, N-k} > F_{test}) < \alpha\) for some significance level \(\alpha\).

ANOVA Table

We are done! We’ve talked about every cell in the ANOVA table, and we can use the table to do the test and make a conclusion about the equality of population means.

22.3 ANOVA Example

We hypothesize that a nutrient called “isoflavones” varies among three types of food: (1) cereals and snacks, (2) energy bars and (3) veggie burgers.

A sample of 5 is taken from each type of food and the amount of isoflavones is measured. Is there a sufficient evidence to conclude that the mean isoflavone levels vary among these food items at \(\alpha = 0.05\)?


We are going to learn to generate a ANOVA table using R.

Data

In order to use the R built-in function for ANOVA, we get to make sure the data matrix is in the right format. The original data set we get may be something like object data below, where each column represents the five isoflavones measurements of a food type. It is not a “wrong” data format, but just not what we need for doing ANOVA. The data frame data_anova is the data format needed for ANOVA. There are two columns, our response variable isoflavones measurement labelled y, and the factor or categorical variable that may affect the response value, which is food type labelled food. With this format, there will be totally \(N=15\) observations and rows. Please transform any data into this kind of data format before you do ANOVA.

data
   1  2  3
1  3 19 25
2 17 10 15
3 12  9 12
4 10  7  9
5  4  5  8
data_anova
    y    food
1   3 cereals
2  17 cereals
3  12 cereals
4  10 cereals
5   4 cereals
6  19  energy
7  10  energy
8   9  energy
9   7  energy
10  5  energy
11 25  veggie
12 15  veggie
13 12  veggie
14  9  veggie
15  8  veggie

The boxplot kind of gives us the isoflavones distribution by food type. It is hard to say whether the food type affects isoflavone level or not, and we need ANOVA to help us make the conclusion.

Figure 22.5: Boxplot of the Isoflavone Content in 3 Types of Food


Test Assumptions

Before implementing any statistical method, always check its method assumptions.

ANOVA requires

  • \(\sigma_1 = \sigma_2 = \sigma_3\)
  • Data are generated from a normal distribution for each type of food.

Well we did not learn to test the equality of more than two population variances, but believe me I did the test, and the three variances are not significantly different from each other. 1 Even the the variances are not all equal, the ANOVA performance will not be worse much. Equality of variances is not a strict requirement. George Box of UW-Madison showed that as long as the sample sizes are (nearly) equal, the largest variance can be up to 9 times the smallest one and the result of ANOVA will continue to be reliable. A general rule of thumb for equal variances is to compare the smallest and largest sample standard deviations. This is much like the rule of thumb for equal variances for the test for independent means. If the ratio of these two sample standard deviations falls within 0.5 to 2, then it may be that the assumption is not violated.

To check the normality, we can check their QQ plots. 2 Figure 22.6 shows that there is no obvious non-normal pattern although two data points are outside the blue 95% confidence region. The normality is not very restrictive as well. As long as the distribution is not very skewed, ANOVA works pretty well.

We say the ANOVA \(F\) test is robust to the violation of the two assumptions.

Figure 22.6: QQ plots for each type of food

ANOVA Testing

We are interested in the following test:

\(\begin{align}&H_0: \mu_1 = \mu_2 = \mu_3\\&H_1: \mu_is \text{ not all equal} \end{align}\)

where \(\mu_1\), \(\mu_2\), \(\mu_3\) stand for the population mean level of isoflavones of food cereals and snacks, energy bars, and veggie burgers respectively.

We could follow the regular six-step testing procedure, but generating the ANOVA table is more straightforward.

In R, we can do all the calculations and generate an ANOVA table using just one line of code. We first use the popular function lm() to implement ANOVA. The first argument is formula that has the form response ~ factor where response is the numeric response vector, and factor is the categorical factor vector. In our data data_anova, y is our response, and food is our factor. Note that if we just write y and food in the formula, R will render an error because R does not recognize y and food because they are not an R object but a column name of data data_anova. Therefore, we need to tell R where y and food are from by specifying the data set they are referred. If you want to specify the data name, you can get access to the response and factor vector by extracting them using data_anova$y and data_anova$food.

The word lm stands for linear model, and ANOVA is a linear model. If you just run lm(formula = data_anova$y ~ data_anova$food), it will show the linear model output related to ANOVA. We don’t need it at this moment, and we will discuss more about linear model in Regression Chapter 23.

To obtain the ANOVA table, we apply the function anova() to the lm object. In the output, food is the source of variation between samples. Residuals is for the source of variation within samples. F value is the \(F\) test statistic value, not the critical value. Pr(>F) is the \(p\)-value. Since \(p\)-value > 0.05, we do not reject \(H_0\). The evidence is not sufficient to reject the equality of population means.

anova(lm(formula = y ~ food, data = data_anova))
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value Pr(>F)
food       2   60.4  30.200  0.8282 0.4603
Residuals 12  437.6  36.467               
anova(lm(formula = data_anova$y ~ data_anova$food))
Analysis of Variance Table

Response: data_anova$y
                Df Sum Sq Mean Sq F value Pr(>F)
data_anova$food  2   60.4  30.200  0.8282 0.4603
Residuals       12  437.6  36.467               
Note

There is another way to generate the ANOVA table in R. We use summary() and aov() functions.

summary(aov(y ~ food, data = data_anova))
            Df Sum Sq Mean Sq F value Pr(>F)
food         2   60.4   30.20   0.828   0.46
Residuals   12  437.6   36.47               
# oneway.test(y ~ food, data = data_anova, var.equal = TRUE)

22.4 Unequal Variances

To be added.

22.5 ANOVA Mathematical Model

To be added.

22.6 Pairwise Comparison

To be added.

22.7 Two-Way ANOVA

To be added.


  1. The test I use is Brown-Forsythe test. It can be performed using the function onewaytests::bf.test().↩︎

  2. There are several tests for normality, such as Shapiro–Wilk test (stats::shapiro.test()), Anderson–Darling test (nortest::ad.test()), and Kolmogorov–Smirnov test (stats::ks.test()).↩︎