# Two-way ANOVA with unequal numbers of

observations per cell.
TFG Documentation.

The data. In previous discussions of two-way ANOVA, we assumed that there

were n observations in each cell of the a × b design. If the sample sizes are unequal,
we will denote the number of observations on cell A = i, B = j as nij . The total
number of observations in the dataset is:
N=

a X
b
X

nij

i=1 j=1

Why data are unbalanced. When

A and B experimental factors that are

randomly assigned, the researchers will usually plan the experiment to achieve equal
n's per cell. But mishaps may occur, causing observations to be missing. When
imbalance is caused by missing data, the analyses described in this lecture will be
valid under an assumption that the missing values are missing at random (MAR).
In this context, MAR means that the probability that any given observation Yijk is
missing cannot depend on the actual value of Yijk , although it may be related to the
factors A or B . Formal discussions of MAR are found in texts dealing with missing
data, such as the book by Little and Rubin (2002, Wiley).
When one or both factors are not randomly assigned, as in an observational study,
the investigators may have no control over the nij 's, and imbalace will be the rule
rather than the exception.
OLS estimates for cell means are still the sample means. Using cell-means
notation, the model is:
Yijk = µij + ijk
(1)

for i = 1, ..., a. j = 1, ..., b and k = 1, .., nij . This is no dierent from unbalanced one-way ANOVA with k = ab groups. The OLS estimates for the cell means
µ11 , ..., µab are simply the sample averages within the cells:
nij
1 X
¯
Yijk
µ
ˆij = Yij· =
nij k=1

Thus the tted values are the Y¯ij· 's. The error sum of squares is:
SSErr =

nij
a X
b X
X
i=1 j=1 k=1

1

Yijk − Y¯ij· 

2

But these test will have to be carried out as partial F-tests. we can still test for additivity (no interactions). you need to t two model. and the interaction (Line 3) whose SS's add up to SSReg .. The numerator for the F-statistic will be the change in the tted values.. Recall that in a partial F-test. the smaller one (null model) and the larger one (alternative model). In an intercept-only P P P model. = N −1 Therefore. That is. We will give examples of this shortly. the total sum of squares: SST ot = nij a X b X X Y¯ijk − Y¯. when the nij 's are unequal. i 2 j k Yijk . the regression sum of squares for this model is: SSReg = nij a X b X X Y¯ij· − Y¯.. X Yˆnull − Yˆalternative 2 divided by the number of extra parameters (the number of parameters in the larger model minus the number of parameters in the smaller model). Linear model formulation. this means that.. we can re-express each cell mean µij as the sum of main eects and interactions. The denominator of the partial F-statistic is the MSE from the larger model. after discussing how to estimate terms in the linear model. which we learned about in Stat 511. SSB and SSAB . i=1 j=1 As always. As a practical matter. for many hypothesis tests.. + αi + βj + (αβ)ij + ijk 2 (2) .and the error degrees of freedom is a X b X (nij − 1) = N − ab i=1 j=1 The regression sum of squares no longer partitions into independent pieces corresponding to main efects and interactions.. We can no longer write the ANOVA table with lines corres- ponding to the main eect of A (Line 1). every tted value would be equal to the grand mean Y¯.. we can rewrite the cell-means model (1) as Yijk = µ.. . we can no longer partition SSReg into SSA . For example. 2 = nij i=1 j=1 k=1 a X b X Y¯ij· − Y¯. the main eect of B (Line 2). 2 i=1 j=1 k=1 can still be written as SST ot = SSReg + SSErr . We can still test the same hypotheses that we tested in the balanced case. However. we may have to explicitly t two model and see how the SS's change in order to compute the numerator of the F-statistic for the desired test. Just as in the balanced-data case..

. βj = µ.The reason is that µ. But thos expresessions do not simplify as the did before.. we can still write: a µ ˆ.. The OLS estimates of the µij 's are still the Y¯ij 's..where a µ. Notice that these parameters are dened in exactly the same way as the were when the data were balanced.. − Y¯.. . b 1X µij − µ.. The estimated main eect for B = j is still.. The denition of these parameters does not depend on nij 's.j − µ.. − Y¯.. = a i=1 is the main eect of B = j . = b 1 XX µij ab i=1 j=1 is the mean of the cell means. However. weighted by the sample sizes nij 's. Similary... αˆi = b j=1 but this no longer simplies to Y¯i... the estimates of these parameters are now a little dierent.j. With unbalanced data. And the estimated interaction is still.. a i=1 but this no longer simplies to Y¯. − (ˆ (αβ) µ. but Y¯. αi = µi . . We can still substitute the Y¯ij 's for the µij 's into these expressions to obtain estimates for these parameters. . the estimated main eect of A = i is still b 1X µ ˆij − µ ˆ... = b 1 XX ¯ Yij· ab i=1 j=1 but now this expression does not simplify to Y¯. is a weighted average of the Y¯ij 's. ˆ ij = Y¯ij. a 1X βˆj = µ ˆij − µ ˆ.. is dened as an unweighted average of the µij 's.. = b j=1 is the main eect of a = i. and αβij = µij − (µ. + αˆi + βˆj ) 3 . − µ. a 1X µij − µ. + αi + βj ) is the interaction..

. But with unbalanced data. . With balanced data.. βˆb−1 .. + Y¯. α ˆ a−1 . we would omit the (a − 1)(b − 1) product terms from the design matrix and re-t the model. SSB and SSAB . How to compute estimates. + αi + βj + ijk then the estimates of the αi 's and βj 's will change. The remaining (a − 1)(b − 1) columns are the products of each eect code for A with each eect code for B . Another important change as we move from balanced to unbalanced data is that the estimates for one set of terms will change if other are removed from the model.. all of the computations are very easy to carry out using OLS software. The change in the regression sum of squares between these two models. αβ ˆ µ ˆ. we no longer have an orthogonal partition of SSReg into SSA . · · ·.. the estimated main eects are the same whether or not the interactions are present in the model. and j=1 still hold.. 4 . First. if we move from the full model (2) to an additive model (3) Yijk = µ. · · ·. Despite these complications with unbalanced data. The estimated coecients for this model will then be ˆ . − Y¯i. will become the numerator of the F-statistic for testing the null hypothesis of additivity.. consider the full model Yijk = µ.j.but this no longer simplies to Y¯ij. The estimates for the main eects will be dierent from what the were in the full model. α ˆ 1 . and the estimates for any set of terms may change as other terms are added to or removed from the model. we set up a design matrix as follows. The rst column of the design matrix will be a constant 1. we can still compute αˆ a as − a X i=1 a−1 P i=1 αβij = b X αβij = 0 j=1 α ˆ i and so on. · · ·. − Y¯. To t this model. For example.b−1 . The next b − 1 columns will be a set of eect codes to distinguish among the leves of factor B . b X βj = 0. The next a − 1 columns will be a set of eect codes to distinguish among the levels of factor A. divided by (a − 1)(b − 1). βˆ1 . αβ 11 a−1. + αi + βj + (αβ)ij + ijk To t this model. Because all of these constraints a X i=1 αi = 0.