Two-way ANOVA with unequal numbers of

observations per cell.
TFG Documentation.

The data. In previous discussions of two-way ANOVA, we assumed that there

were n observations in each cell of the a × b design. If the sample sizes are unequal,
we will denote the number of observations on cell A = i, B = j as nij . The total
number of observations in the dataset is:

a X


i=1 j=1

Why data are unbalanced. When

A and B experimental factors that are

randomly assigned, the researchers will usually plan the experiment to achieve equal
n's per cell. But mishaps may occur, causing observations to be missing. When
imbalance is caused by missing data, the analyses described in this lecture will be
valid under an assumption that the missing values are missing at random (MAR).
In this context, MAR means that the probability that any given observation Yijk is
missing cannot depend on the actual value of Yijk , although it may be related to the
factors A or B . Formal discussions of MAR are found in texts dealing with missing
data, such as the book by Little and Rubin (2002, Wiley).
When one or both factors are not randomly assigned, as in an observational study,
the investigators may have no control over the nij 's, and imbalace will be the rule
rather than the exception.
OLS estimates for cell means are still the sample means. Using cell-means
notation, the model is:
Yijk = µij + ijk

for i = 1, ..., a. j = 1, ..., b and k = 1, .., nij . This is no dierent from unbalanced one-way ANOVA with k = ab groups. The OLS estimates for the cell means
µ11 , ..., µab are simply the sample averages within the cells:
1 X
ˆij = Yij· =
nij k=1

Thus the tted values are the Y¯ij· 's. The error sum of squares is:
SSErr =

a X
b X
i=1 j=1 k=1


Yijk − Y¯ij· 


But these test will have to be carried out as partial F-tests. we can still test for additivity (no interactions). you need to t two model. and the interaction (Line 3) whose SS's add up to SSReg .. The numerator for the F-statistic will be the change in the tted values.. Recall that in a partial F-test. the smaller one (null model) and the larger one (alternative model). In an intercept-only P P P model. = N −1 Therefore. That is. We will give examples of this shortly. the total sum of squares: SST ot = nij a X b X X Y¯ijk − Y¯. when the nij 's are unequal. i 2 j k Yijk . the regression sum of squares for this model is: SSReg = nij a X b X X Y¯ij· − Y¯.. X Yˆnull − Yˆalternative 2 divided by the number of extra parameters (the number of parameters in the larger model minus the number of parameters in the smaller model). Linear model formulation. this means that.. we can re-express each cell mean µij as the sum of main eects and interactions. The denominator of the partial F-statistic is the MSE from the larger model. after discussing how to estimate terms in the linear model. which we learned about in Stat 511. SSB and SSAB . i=1 j=1 As always. As a practical matter. for many hypothesis tests.. + αi + βj + (αβ)ij + ijk 2 (2) .and the error degrees of freedom is a X b X (nij − 1) = N − ab i=1 j=1 The regression sum of squares no longer partitions into independent pieces corresponding to main efects and interactions.. We can no longer write the ANOVA table with lines corres- ponding to the main eect of A (Line 1). every tted value would be equal to the grand mean Y¯.. we can rewrite the cell-means model (1) as Yijk = µ.. . we can no longer partition SSReg into SSA . For example. 2 = nij i=1 j=1 k=1 a X b X Y¯ij· − Y¯. the main eect of B (Line 2). 2 i=1 j=1 k=1 can still be written as SST ot = SSReg + SSErr . We can still test the same hypotheses that we tested in the balanced case. However. we may have to explicitly t two model and see how the SS's change in order to compute the numerator of the F-statistic for the desired test. Just as in the balanced-data case..

. βj = µ.The reason is that µ. But thos expresessions do not simplify as the did before.. we can still write: a µ ˆ.. The OLS estimates of the µij 's are still the Y¯ij 's..where a µ. Notice that these parameters are dened in exactly the same way as the were when the data were balanced.. − Y¯.. . b 1X µij − µ.. The estimated main eect for B = j is still.. The denition of these parameters does not depend on nij 's.j − µ.. − Y¯.. = a i=1 is the main eect of B = j . = b 1 XX µij ab i=1 j=1 is the mean of the cell means. However. weighted by the sample sizes nij 's. Similary... αˆi = b j=1 but this no longer simplies to Y¯i... the estimates of these parameters are now a little dierent.j. With unbalanced data. And the estimated interaction is still.. a i=1 but this no longer simplies to Y¯. − (ˆ (αβ) µ. but Y¯. αi = µi . . We can still substitute the Y¯ij 's for the µij 's into these expressions to obtain estimates for these parameters. . the estimated main eect of A = i is still b 1X µ ˆij − µ ˆ... = b 1 XX ¯ Yij· ab i=1 j=1 but now this expression does not simplify to Y¯. is a weighted average of the Y¯ij 's. ˆ ij = Y¯ij. a 1X βˆj = µ ˆij − µ ˆ.. is dened as an unweighted average of the µij 's.. = b j=1 is the main eect of a = i. and αβij = µij − (µ. + αˆi + βˆj ) 3 . − µ. a 1X µij − µ. + αi + βj ) is the interaction..

. But with unbalanced data. . With balanced data.. βˆb−1 .. + Y¯. α ˆ a−1 . we would omit the (a − 1)(b − 1) product terms from the design matrix and re-t the model. SSB and SSAB . How to compute estimates. + αi + βj + ijk then the estimates of the αi 's and βj 's will change. The remaining (a − 1)(b − 1) columns are the products of each eect code for A with each eect code for B . Another important change as we move from balanced to unbalanced data is that the estimates for one set of terms will change if other are removed from the model.. all of the computations are very easy to carry out using OLS software. The change in the regression sum of squares between these two models. αβ ˆ µ ˆ. we no longer have an orthogonal partition of SSReg into SSA . · · ·.. the estimated main eects are the same whether or not the interactions are present in the model. and j=1 still hold.. 4 . First. if we move from the full model (2) to an additive model (3) Yijk = µ. · · ·. Despite these complications with unbalanced data. The estimated coecients for this model will then be ˆ . − Y¯i. will become the numerator of the F-statistic for testing the null hypothesis of additivity.. consider the full model Yijk = µ.j.but this no longer simplies to Y¯ij. The estimates for the main eects will be dierent from what the were in the full model. α ˆ 1 . and the estimates for any set of terms may change as other terms are added to or removed from the model. we set up a design matrix as follows. The rst column of the design matrix will be a constant 1. we can still compute αˆ a as − a X i=1 a−1 P i=1 αβij = b X αβij = 0 j=1 α ˆ i and so on. · · ·. − Y¯. To t this model. For example.b−1 . The next b − 1 columns will be a set of eect codes to distinguish among the leves of factor B . b X βj = 0. The next a − 1 columns will be a set of eect codes to distinguish among the levels of factor A. divided by (a − 1)(b − 1). βˆ1 . αβ 11 a−1. + αi + βj + (αβ)ij + ijk To t this model. Because all of these constraints a X i=1 αi = 0.

however. In either case. Source SS A B AB SSA SSB SSAB SSErr Error df MS a−1 M SA b−1 M Sb (a − 1)(b − 1) M SAB N − ab M SErr This looks very much like the ANOVA table from the balanced two-way ANOVA. 2 and 3. you may be given an ANOVA table that looks like this : 1. the null hypothesis α1 = α2 = . 2. 2 and 3 will add up to the overall SSReg . however.ANOVA table? Depending on the software being used. And the partial SS for Line 3 will represent the improvement in t when the (αβ)ij 's are added to a model that already contains αi 's and βj 's. If interactions are present. we can test for additivity by the F-statistic comparing Line 3 to Line 4. The sequential SS's from Lines 1. They might be sequential (Type I) sums of squares computed by adding the terms to the model in the order listed. the interpretation of these tests is somewhat unusual when the interactions are not small and insignicant.. = αa = 0 and a comparison of Line 2 to Line 4 will be a valid test for no main eects of B. Or the might be partial (Type III) sums of squares wich represent the contribution of that term when it is added last. The partial SS's from Lines 1. H0 : α1 = α2 = . we need to be careful and understand what we are doing.. 3. For other tests. Line 2 will represent the improvement in t when the βj 's are added to a model with αi 's already presents.. 4. and Line 3 will be the improvement in t when the (αβ)ij 's are added to a model with αi 's and βj 's already present. the a comparison of Line 1 to Line 4 will be a valid test for no main eects of A. 2 and 3 will not add up to the overall SSReg .. = βb = 0 But as we discussed in a preious lecture. For Line 3. Line 1 will represent the improvement in t when the αi 's are added to a model with no predictors (an intercept-only model). In that case. = αa = 0 does not mean that the 5 . the partial and sequential SS's will be the same.. If the table reports partial (Type III) sums of squares. When interpreting this table. you need to understand what is actually being printed in the SS column for Lines 1.. H0 : β1 = β2 = . The partial SS for Line 1 will represent the improvement in t when the αi 's are added to a model that already contains βj 's and (αβ)ij 's. The partial SS for Line 2 will represent the improvement in t when the βj 's are added to a model that aready contains αi 's and (αβ)ij 's.

by taking unweighted averages of the µij 's. However. we can dene and contrasts the µij 's among the µ0i. you may be able to get the quantities that you need by tting the full model with Factor B introduced rst.. including ours. Or. Under ordinary circumstances. Contrasts. This is usually the sensible thing to do. you may need to t both of these models to compute the change in the regression SS when the main eects for A and the AB interactions are introduced. The correct way to test the null hypothesis that Factor A has no eects whatsoever is to compare the t of the full model (2) to the model without αi 's or (αβ)ij 's. Consider an arbitrary contrast among the µij 's. L= a X b X cij µij i=1 j=1 where PP i j cij = 0. Whether the data are balanced or unbalanced. we follow the same procedure as before. because in practice the are not used much anymore. Alternative denitions for the terms in the lineal model will lead to dierent expresions for the sums of squares due to A. it means that the eects of Factor A. become zero.Factor A has no eects. and the AB interaction. B. if your regression software prints out an ANOVA table with sequential SS's. The least-squares estimate of the contrast is ˆ= L a X b X cij Y¯ij i=1 j=1 and the variance of this estimate is ˆ = V (L) a X b X c2ij  i=1 j=1 σ2 nij  =σ 2 a X b  2  X cij i=1 j=1 The F-statistic for testing H0 : L = 0 is F = T2 = ˆ L SSL =   2 a P b cij S2 P 2 S nij i=1 j=1 6 nij . do it this way.s . the parameters of a statistical model should not be dened with respect to the nij 's. Dening the eects dierently. when averaged over the levels of B (an equally weighted average). That is. To derive the SS's for a contrast. Rather. Most modern textbooks. and among the µ.j 's.. some textbooks (especially older ones) describe coding schemes that eectively weight the µij 's in proportions determined by the nij 's. We will not cover those alternative denitions. With unbalanced data. we have dened the terms in the lineal model (2) in the same way.

's will describe the eect of Factor A averaged over the levels of B (unweighted average). This will be left as an exercise. the textbook gives some examples of contrast involving weighted averages where the weights are determined by the context of the problem.where SSL = ˆ L  2  a P b cij P nij i=1 j=1 is the sum of squares due to the contrast.j 's will describe the eect of Factor B averaged over the levels of A (unweighted average). Similary. At the end of Chapter 23.j 's. and a contrast among the µ. A contrast among the µi. 7 . we can derive the sums of squares for a contrast among the µij 's or a contrast among the µ.