You are on page 1of 4

Biostatistics (2005), 6, 2, pp.

183186
doi:10.1093/biostatistics/kxi001

A note on joint versus gene-specific mixed model analysis


of microarray gene expression data
INA HOESCHELE , HUA LI
Virginia Bioinformatics Institute and Department of Statistics, Virginia Tech,
Blacksburg, VA 24061-0477, USA
inah@vt.edu

S UMMARY
Currently, linear mixed model analyses of expression microarray experiments are performed either in a
gene-specific or global mode. The joint analysis provides more flexibility in terms of how parameters are
fitted and estimated and tends to be more powerful than the gene-specific analysis. Here we show how to
implement the gene-specific linear mixed model analysis as an exact algorithm for the joint linear mixed
model analysis. The gene-specific algorithm is exact, when the mixed model equations can be partitioned
into unrelated components: One for all global fixed and random effects and the others for the gene-specific
fixed and random effects for each gene separately. This unrelatedness holds under three conditions: (1) any
gene must have the same number of replicates or probes on all arrays, but these numbers can differ among
genes; (2) the residual variance of the (transformed) expression data must be homogeneous or constant
across genes (other variance components need not be homogeneous) and (3) the number of genes in the
experiment is large. When these conditions are violated, the gene-specific algorithm is expected to be
nearly exact.
Keywords: Differential gene expression; Microarrays; Mixed model analysis.

1. I NTRODUCTION
Microarray gene expression experiments produce expression profiles of thousands or ten thousands of
genes simultaneously. The focus of this paper is on the application of linear mixed model methodology
to the detection and estimation of differential gene expression or, more generally, to the identification of
which factors of interest influence the expression of which of the arrayed genes and to the estimation of
their effects. Linear mixed model analysis (LMMA) of microarray data is either gene-specific (Wolfinger
et al., 2001) or global (e.g. Kerr and Churchill, 2001), i.e. the analysis is performed separately for each
gene or jointly for all genes, respectively. Gene-specific analyses can have low power as noticed, e.g. by
Wu et al. (2003) and Pfister-Genskow et al. (2004). Reasons for lower power of gene-specific analyses,
relative to the joint analysis, include the difference in degrees of freedom, joint versus gene-specific
estimation of the error variance and other variance components, and the use of different contrasts.
Conditions for equivalence between gene-specific and joint analyses have been stated in previous
contributions for the case of fixed linear models (e.g. Kerr, 2003; Wu et al., 2003). Here we discuss the
use of the gene-specific analysis as an algorithm for the joint analysis under a mixed linear model.
To whom correspondence should be addressed.

c The Author 2005. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oupjournals.org.


184

I. H OESCHELE AND H. L I
2. M ETHODS

We write a general linear mixed model for microarray data as follows


y = X1 1 + Z1 u1 + X2 2 + Z2 u2 + e,

(2.1)

where 1 and u1 contain the effects of the fixed (e.g. treatment) and random global factors, respectively, 2
and u2 contain the effects of the fixed and random gene-specific factors, respectively, and X1 , Z1 and X2 ,
Z2 are design matrices for the global and gene-specific factors, respectively. Gene-specific effects include
the gene main effects (fixed) and interactions between the global fixed and random factors with the gene
factor. It is usually appropriate to treat array effects as random, so e.g. let u1 = A be the vector of random
global array effects and let u2 = A G be the vector of random gene-specific array effects. For the mixed
model analysis, we need to specify the variancecovariance structure of the random factors. We typically
2
specify that E(A) = 0, Var(A) = G11 = I A2 , E(A G) = 0, E(e) = 0, Var(A G) = G22 = I AG
2
and Var(e) = R = Ie
, under the assumption that gene-specific
are constant across genes, or
g variances
g
2
2 , under the assumption that geneand Var(e) = R = i=1 Ini e(i)
Var(A G) = G22 = i=1 In a AG(i)
specific variances differ among genes, where n a is the number of arrays, n i is the number of observations
2 and 2 are the gene-specific array variance or variance
for gene i, A2 is the global array variance and AG
e
due to array-by-gene interaction and the residual variance, respectively. Inferences about the unknown
parameters (fixed effects and variance components) are obtained by utilizing the mixed model equations
(MME) (Goldberger, 1962; Henderson, 1963). The MME for the uncentered mixed model in (2.1) are
 1
 1
1
1
1
X
X
X1 R X1 X
X1 R y
1
1 R X2
1 R Z1
1 R Z2
X R1 X X R1 X

X R1 y


1
1

X
R
Z
X
R
Z

1
2
1
2
2

2 2
2
2
2
(2.2)
 1
=  1 ,
 R1 Z + G11 Z R1 Z + G12
1 X
Z1 R X1 Z

R
Z
R
y
Z

u
2
1
2
1
1
1
1
1
 1
 1
1
21 Z R1 Z + G22
1
Z
Z
u 2
2
2 R X1 Z2 R X2 Z2 R Z1 + G
2
2R y
where
1

G11
=
G21

G12
G22

G11
=
G21


G12
,
G22

where G12 = Cov(A, A G) = 0. Given known values of the variance components, -solutions
to the

MME are best linear unbiased estimates and u-solutions


are best linear unbiased predictions.
Wolfinger et al. (2001) proposed to implement the gene-specific analysis with a two-step procedure.
First, the data y are analyzed with a sub-model (normalization model) containing only global effects
(1 and u1 ), and then the predicted residuals from this model are analyzed with the second sub-model
(gene-model) containing only gene-specific effects (2 and u2 ). Fitting model (2.1), with centering of the
columns of X2 and Z2 (Rencher, 2000) or without 2 and u2 , implies a reparameterization from 1 to 1
and u1 to u1 , where an element in 1 or u1 now contains the average of the interactions of this factor
with the gene factor (G). We denote the centered matrices by X2 and Z2 . After centering, or equivalently, after fitting a mixed (normalization) model containing only global factors, the vector of global
array effects becomes A with elements Ak = Ak + (A G)k and with Var(A ) = G11 = I A2 and
Cov(A , A G) = G12 . The array variance component now is A2 = Var(Ak ) = Var(Ak + A G k ),
which in the simplest case of equal replicate or probe numbers across all genes simplifies to A2 =
2 , where n is the number of genes. Matrix G has non-zero elements between any A and
A2 + n1G AG
G
k
12
2 . The MME for the
(A G)kg for all g and a given k, which in the simplest case are equal to n1G AG
centered mixed model are obtained by replacing X2 and Z2 by X2 and Z2 and replacing G1 with
1  11


G12
G11 G12
G
1
.
G
=
=
G21 G22
G21 G22

Joint model analysis

185

It can be shown that there are non-zero elements in matrix G12 between Ak and (A G)kg for all
g, and that there are non-zero elements in matrix G22 between (A G)kg and (A G)kg for all g, g 
with g = g  . Therefore, the offdiagonal block in the coefficient matrix of the centered MME between
u1 = A and u2 = A G is no longer strictly zero, and the same holds for offdiagonal blocks between
any sub-vectors of u2 corresponding to two different genes. As a consequence, the global and the genespecific MME are not exactly unrelated so that the gene-specific analysis is no longer an exact algorithm
to perform the joint analysis. The degree of approximation approaches exactness when the number of
genes becomes large so that G approaches G. For the centered MME, all offdiagonal sub-matrices of

1
the coefficient matrix in (2.2) involving global and gene-specific design matrices (e.g. X
1 R X2 ) are
2
zero when R = Ie , but are not (exactly) zero for other R. Separating the MME (2.2) into n G + 1
(approximately) unrelated components greatly improves the efficiency of LMMA with joint estimation of
global and gene-specific variances.
3. R ESULTS
The gene-specific algorithm for the joint analysis is illustrated with a worked example using a small,
artificial data set and fixed and mixed models, which is presented as supplementary data available at
Biostatistics online. We also performed the joint analysis on a larger real data set and on data sets simulated with structure and parameters similar to those of the real data. The real data included 2640 cDNAs
printed in duplicate on 100 arrays, two cell populations and 10 biological replicates from each cell population [the analysis of this data set is described in detail by Pfister-Genskow et al. (2004)]. We analyzed
the real and simulated data with the ASReml software (Gilmour et al., 2002) and with our gene-specific
algorithm for the joint analysis and confirmed that the estimates of the variance components and the test
statistics were identical up to numerical accuracy [differences in the results between the joint analysis
and the original gene-specific analysis of Wolfinger et al. (2001) were reported in Pfister-Genskow et al.
(2004)].
4. D ISCUSSION
Joint analysis of the expression data on all genes in a microarray experiment has multiple advantages
over the common practice of analyzing the data on individual genes separately: (1) There is great flexibility on how the variancecovariance structure of the data is modeled, and different formulations can
be compared (homogeneous within-gene variance components, heterogeneous within-gene variance components estimated by shrinkage methods or by grouping of genes, etc.). (2) It allows us to evaluate the
significance of global treatment and technical factors and to evaluate gene-specific treatments contrasts
which include main effects (Black and Doerge, 2002). (3) As pointed out by Kerr (2003), joint analysis
also permits model evaluation and residual analysis. We note, however, that while for linear regression
models, there is well-established theory on residuals and influence, and outlier detection statistics or deletion diagnostics are commonly available in regression software packages, residual analysis and deletion
diagnostics are not yet performed routinely in mixed model analysis and are still underdeveloped for
mixed models despite recent progress (e.g. Haslett and Dillane, 2004).
We do not consider the joint and gene-specific analyses as alternative methods, but we view the genespecific analysis merely as an efficient algorithm for performing the joint analysis. We have shown that
the gene-specific algorithm provides (essentially) exact inferences for the joint LMMA under three conditions: (1) equal number of replicate spots or probes for any gene across all arrays (these numbers can
differ among genes), (2) homogeneous residual variance across genes (other variance components may be
heterogeneous) and (3) sufficiently large numbers of genes in the experiment (this condition is needed for
mixed but not for fixed models).

186

I. H OESCHELE AND H. L I
ACKNOWLEDGMENTS

This work was supported by NSF Plant Genome Cooperative Agreement DBI-0211863.
R EFERENCES
B LACK , M. A. AND D OERGE , R. W. (2002). Calculation of the minimum number of replicate spots required for
detection of significant gene expression fold change in microarray experiments. Bioinformatics 18, 16091616.
G ILMOUR , A. R., C ULLIS , B. R., W ELHAM , S. J. AND T HOMPSON , R. (2002). ASREML Reference Manual.
NSW, Australia: NSW Agriculture, Orange Agricultural Institute.
G OLDBERGER, A. S. (1962). Best linear unbiased prediction in the generalized linear regression model. Journal of
the American Statistical Association 57, 369375.
H ASLETT, J. AND D ILLANE , D. (2004). Application of delete = replace to deletion diagnostics for variance component estimation in the linear mixed model. Journal of the Royal Statistical Society, Series B 66, 131143.
H ENDERSON, C. R. (1963). Selection index and expected genetic advance. In Statistical Genetics and Plant Breeding
(NRC publication 982). Washington, DC: National Academy of Sciences, pp. 141163.
K ERR, M. K. (2003). Linear models for microarray data analysis: hidden similarities and differences. Journal of
Computational Biology 10, 891901.
K ERR , M. K.
183201.

AND

C HURCHILL , G. (2001). Experimental design for gene expression microarrays. Biostatistics 2,

P FISTER -G ENSKOW, M., C HILDS , L., M YERS , C., L ACSON , J., B ETTHAUSER , J., G OULEKE , P., Z HENG , Y.,
L ENO , G., F ORSBERG , E., YANG , X., H OESCHELE , I. AND E ILERTSEN , K. J. (2004). Analysis of individual
pre-implantation cattle embryos using cDNA microarrays: comparison of NT and IVF produced embryos using
an interwoven loop design and linear mixed model analysis. Biology of Reproduction 72. In press.
R ENCHER, A. C. (2000) Linear Models in Statistics. New York: John Wiley & Sons.
W OLFINGER , R. D., G IBSON , G., W OLFINGER , E. D., B ENNETT, L., H AMADEH , H., B USHEL , P., A FSHARI , C.
AND PAULES , R. S. (2001). Assessing gene significance from cDNA microarray expression data via mixed
models. Journal of Computational Biology 8, 625637.
W U , H., K ERR , M. K., C UI , X. Q. AND C HURCHILL , G. A. (2003). MAANOVA: A Software Package for the Analysis of Spotted cDNA Microarray Experiments. The Analysis of Gene Expression Data: Methods and Software.
New York: Springer. http://www.jax.org/research/churchill/.
[Received February 18, 2004; first revision June 17, 2004; second revision September 14, 2004;
accepted for publication October 11, 2004]