ABSTRACT

Motivation: Most biological traits may be correlated with the under-
lying gene expression patterns that are partially determined by DNA
sequence variation. The correlations between gene expressions and
quantitative traits are essential for understanding the functions of
genes and dissecting gene regulatory networks.
Results: In the present study, we adopted a novel statistical
method, called the stochastic expectation and maximization (SEM)
algorithm, to analyze the associations between gene expression
levels and quantitative trait values and identify genetic loci control-
ling the gene expression variations. In the first step, gene expres-
sion levels measured from microarray experiments were assigned to
two different clusters based on the strengths of their association with
the phenotypes of a quantitative trait under investigation. In the sec-
ond step, genes associated with the trait were mapped to genetic
loci of the genome. Because gene expressions are quantitative, the
genetic loci controlling the expression traits are called expression
quantitative trait loci (eQTL). We applied the same SEM algorithm to
a real dataset collected from a barley genetic experiment with both
quantitative traits and gene expression traits. For the first time, we
identified genes associated with eight agronomy traits of barley.
These genes were then mapped to seven chromosomes of the bar-
ley genome. The SEM algorithm and the result of the barley data
analysis are useful to scientists in the areas of bioinformatics and
plant breeding.
Availability and implementation: The R program for the SEM
algorithm can be downloaded from our website:
http://www.statgen.ucr.edu
Contact: shizhong.xu@ucr.edu
Supplementary information: Supplementary data are available in
the Bioinformatics online system.
1 INTRODUCTION

Differential expression analysis often applies to discrete pheno-
types (primarily dichotomous phenotypes). The phenotype is often
defined as "normal" or "affected". If a phenotype is measured
quantitatively, it is often converted into two or a few discrete or-

*
To whom correspondence should be addressed.
dered phenotype so that a differential expression analysis or an
analysis of variances (ANOVA) method can be applied (Cui, et al.,
2005; Kerr, et al., 2000; Wernisch, et al., 2003; Wolfinger, et al.,
2001). It is obvious that such discretization is subject to informa-
tion loss. The current microarray data analysis technique has not
been able to efficiently analyze the association of gene expression
with a continuous phenotype (Blalock et al., 2004; Jia and Xu
2005). Pearson correlation between gene expression and a quanti-
tative trait has been proposed (Blalock, et al., 2004; Kraft, et al.,
2003; Quackenbush, 2001). Blalock et al. (2004) ranked genes
according to the correlation coefficients of gene expression with
MMSE, a quantitative measurement of the severity of Alzheimer
disease (AD), and detected many genes that are associated with
AD. Kraft et al. (2003) used a within family correlation analysis to
remove the effect of family stratification. Pearson correlation is
intuitive and easy to calculate. However, it may not be optimal
because (1) the correlation coefficient may not be the best indicator
of the association, (2) higher order association cannot be detected,
(3) data are analyzed individually with one gene at a time, and (4)
the method cannot be extended to association of gene expression
with multiple continuous phenotypes. Potokina et al. (2004) inves-
tigated the association of gene expression with six malting quality
phenotypes (quantitative traits) of 10 barley cultivars. They com-
pared the distance matrix of each gene expression among the 10
cultivars with the distance matrix calculated from the phenotypes
of all six traits using the G-test statistic. The G-test statistic was
designed to measure the similarity between two matrices. For each
gene, there is a distance matrix (based on the expression levels).
For the phenotypes of six traits, there is another distance matrix.
The two matrices are compared for the similarity. If the similarity
is high, the gene is associated with all the six phenotypes. Eventu-
ally, the associations of the phenotypes with all the genes are
evaluated. The distance matrix comparison approach may have the
same flaws as the correlation analysis.
Recently, we proposed to use the regression coefficient of the
expression on a continuous phenotype as the indicator of the
strength of association (Jia and Xu, 2005). Instead of analyzing one
gene at a time, we took a model-based clustering approach to stud-
ying all genes simultaneously. Qu and Xu (2006) extended the
model-based clustering algorithm to capture genes having higher
order association with the phenotype. The model-based clustering
analysis (Jia and Xu, 2005; Qu and Xu, 2006) classifies genes into
several clusters and all clusters share the same variance-covariance
structure. The analysis is implemented via the expectation-
maximization (EM) algorithm (Dempster, et al., 1977). Several
A Stochastic Expectation and Maximization (SEM) Algorithm for
Detecting Quantitative Trait Associated Genes
Haimao Zhan
1
, Xin Chen
2
and Shizhong Xu
1,*

1
Department of Botany and Plant Sciences, University of California, Riverside, CA 92521 USA
2
Department of Statistics, University of California, Riverside, CA 92521, USA

Associate Editor: Dr. Joaquin Dopazo
© The Author (2010). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
Bioinformatics Advance Access published October 29, 2010

a
t

O
k
l
a
h
o
m
a

S
t
a
t
e

U
n
i
v
e
r
s
i
t
y

(
G
W
L
A
)

o
n

A
p
r
i
l

2
1
,

2
0
1
1
b
i
o
i
n
f
o
r
m
a
t
i
c
s
.
o
x
f
o
r
d
j
o
u
r
n
a
l
s
.
o
r
g
D
o
w
n
l
o
a
d
e
d

f
r
o
m

problems have been encountered for this method. One is the identi-
fiability problem where the cluster labels can be exchanged among
different clusters. The other problem occurs when two or more
clusters often have the same cluster mean. These two problems can
be avoided by introducing a small noise to the cluster means in
every iterative step of the EM algorithm (Qu and Xu, 2006). This
ad hoc modification of the EM algorithm lacks strong theoretical
foundation and cannot guarantee to produce the optimal result. In
this study, we proposed an alternative method with a rigorous theo-
retical basis to solve the problem. We used this new method to
detect expressed genes that are associated with multiple quantita-
tive traits of barley.
The gene expression levels are quantitative traits (Morley, et al.,
2004). Finding the genetic loci controlling the expressions can help
identify gene regulation networks (Cookson, et al., 2009). Trait
specific gene networks can be inferred by studying the genetic loci
controlling the expressed genes only associated with the trait under
investigation. In this study, we focus on a new method called the
stochastic expectation and maximization (SEM) algorithm (Celeux
and Diebolt, 1986) and its application to both eQTL mapping and
phenotype associated microarray data analysis. This new method is
then compared with the existing EM algorithm (Jia and Xu, 2005)
to demonstrate its superiority. A real dataset collected in the North
American Barley Genome Project (NABGP) is used for the dem-
onstration.
2 MATERIALS AND METHODS
2.1 Experimental data
The gene expression data were published by Luo et al. (2007) and
downloadable from the ArrayExpress: http://www.ebi.ac.uk/microarray-
as/aer/entry with accession number: E-TABM-112. The phenotypic values
of eight quantitative traits of barley were published by Hayes et al. (1993)
and downloadable from the following website:
http://wheat.pw.usda.gov/ggpages/SxM/phenotypes.html. Detailed descrip-
tion of the experiment can be found from the original study (Hayes, et al.,
1993). The experiment involved 150 double haploid (DH) lines derived
from the cross of two spring barley varieties, Morex and Steptoe. All the
150 DH lines were microarrayed for 22840 transcripts. The eight traits are
α -amylase, diastatic power, grain protein, grain yield, heading date, plant
height, lodging and malt extract. The phenotypes of the traits were meas-
ured in different environments (locations and years). The number of repli-
cated measurements ranged from 6 to 16 depending on different traits. Both
the single trait association and multiple trait joint association analyses were
conducted for all the eight traits using the average trait values across all
environments.
The original (raw) microarray data were normalized using the RMA al-
gorithm (Irizarry, et al., 2003) implemented in the GeneSpring GX 11
software package (Agilent Technologies, Santa Clara, CA). ArrayExpress
also provides the preprocessed dataset without log transformation. The
phenotypic values of each trait were rescaled so that the range of each trait
is between -1 and +1. The formula for the rescaling is

min
max min
2 1
k
k
X X
Z
X X

= −

(1)
where
k
X is the original phenotypic value for the kth line,
min
X and
max
X are the minimum and maximum values of the phenotypic value,
respectively, and
k
Z is the rescaled phenotypic value for 1, , k n = L ,
where n is the sample size (number of DH lines).
2.2 Linear Model
Denote the microarray data by a data matrix Y with n rows and m col-
umns, where n is the number of individuals subject to the microarray anal-
ysis and m is the number of microarrayed genes. Let
j
y be the jth column
of matrix Y, i.e., an 1 × n vector for the expression levels of gene j for all
the n subjects ( 1,..., j m = ). Let Z be an × n q matrix for the rescaled phe-
notypic values of q quantitative traits measured from all n individuals. Let
X be a × n p matrix for some factors not directly relevant to the quantita-
tive traits, for example, gender effect, age effect and so on. We now have
three sources of data, (1) Y the microarray data, (2) Z the phenotypic data,
and (3) X the cofactors not directly relevant to the association study. The
cofactors are not something of interest themselves, but may affect the gene
expressions. They are included in the model to reduce or eliminate the
interference on the association between Y and Z. The expressed levels of
gene j can be expressed using the following linear model,
= + +
j j j j
y X Z β γ ε (2)
where
j
β is a 1 × p vector for the effects of cofactors,
j
γ is a 1 × q vector
for the regression coefficients of gene j on all the q traits. The residual error
j
ε is an 1 × n vector with an assumed
2
(0, ) N Iσ distribution. This as-
sumption is very common in the linear model analysis. The DH plants are
indeed independent samples from the line cross of the barley experiment. In
the special case of one phenotype with no cofactors, 1 = = p q and X is a
vector of unity with a dimensionality of 1 × n .
We now assign prior distributions to the parameters included in the lin-
ear model. The prior distribution for
j
β is
~ ( , ) Σ
j
N
β β
β u (3)
where
β
u is a 1 × p vector of mean and Σ
β
is an unknown × p p positive
definite variance matrix. The prior distribution of
j
γ is a Gaussian mixture
with two components,

1 0
~ (0, ) (1 ) (0, ) Σ + − Σ
j
N N γ π π (4)
In the above Gaussian mixture,
0
Σ =
q
I ω is a known diagonal matrix with
a common
8
10

= ω across all the diagonal elements. In other words,
0
Σ is
a known positive definite matrix with values close to zero and the value can
be changed according to the investigator's preference. For the other cluster,
1
Σ is an unknown positive definite variance matrix. This Gaussian mixture
prior divides all the genes into two clusters, one (cluster 1) being associated
with the phenotypes and the other (cluster 0) not associated with the pheno-
type. Variable π ( 0 1 < < π ) is a prior probability that a gene randomly
selected from the pool belongs to cluster 1. A gene classified into cluster
one is claimed to be associated with the traits while genes classified into
cluster zero are not associated with the traits. The actual parameters in-
volved in the problem are denoted by

{ }
2
1
, , , , = Σ Σ
β β
θ u σ π
(5)
We are also interested in
j
ρ , the posterior probability of gene j being asso-
ciated with the traits. The relationship between the posterior
j
ρ and the
prior π will be presented later.
Let ~ Burnoulli( )
j j
δ ρ , for 0 1 < <
j
ρ , be an indicator variable for the
cluster membership of gene j. It is defined as

1 if belongs to cluster 1
0 if belongs to cluster 0
¦
=
´
¹
j
j
j
δ (6)
Given
j
δ , the genetic effect
j
γ has the following distribution,

1 0
~ (0, ) (1 ) (0, ) Σ + − Σ
j j j
N N γ δ δ (7)

a
t

O
k
l
a
h
o
m
a

S
t
a
t
e

U
n
i
v
e
r
s
i
t
y

(
G
W
L
A
)

o
n

A
p
r
i
l

2
1
,

2
0
1
1
b
i
o
i
n
f
o
r
m
a
t
i
c
s
.
o
x
f
o
r
d
j
o
u
r
n
a
l
s
.
o
r
g
D
o
w
n
l
o
a
d
e
d

f
r
o
m

Table 1. Numbers of genes associated with individual traits in the barley
microarray data analysis using the two-cluster SEM algorithm and the
three-cluster EM algorithm.
Trait SEM Algorithm

EM Algorithm
Number Proportion Number Proportion
Alpha-amylase 644 2.82% 233 1.02%
Diastatic power 467 2.04% 195 0.85%
Grain protein 888 3.89% 257 1.13%
Grain yield 457 2.00% 139 0.61%
Heading date 401 1.76% 166 0.73%
Height 784 3.43% 264 1.16%
Lodging 650 2.85% 173 0.76%
Malt extract 526 2.30% 216 0.95%

Note that this conditional distribution is not a Gaussian mixture because the
membership is already known. The indicator variable
j
δ tells which Gaus-
sian component
j
γ belongs to. Also note that the parameter vector does not
include
j
β and
j
γ . Under the random model framework,
j
β and
j
γ are
treated as missing values. If they are integrated out, the density of
j
y
given the cluster membership is normal with mean and variance shown in
the following normal density
{ }
2
( | , ) | , = Σ + Θ +
T T
j j j j
p y N y X X X Z Z I
β β
θ δ u σ (8)
where

1 0
(1 )
j j j
δ δ Θ = Σ + − Σ (9)
and the notation { } | , N y a b stands for the normal density of variable y
with mean a and variance b. Given the cluster membership, {) } =
j
δ δ , the
log likelihood function for the entire dataset is

1
( | ) ln ( | , )
=
=


m
j j
j
L p y θ δ θ δ (10)
The MLE of parameters are obtained through a two-step approach. The first
step is to estimate the parameters by maximizing the above log likelihood
function given { }
j
δ δ = through the regular expectation-maximization
(EM) algorithm (Dempster, et al., 1977). This step is called the EM step.
The second step is to stochastically simulate {) } =
j
δ) δ from its conditional
posterior distribution. This step is called the stochastic step. The two steps
are repeated iteratively until a stationary distribution is reached for each
parameter, an algorithm called the stochastic expectation and maximization
(SEM) algorithm (Celeux and Diebolt, 1986). The stochastic step and the
EM step are performed sequentially, not in parallel.
2.3 Stochastic sampling
The density of
j
y defined in equation (8) can be split into the following
two densities,

2
1 1
( | ) ( | , 1) ( | , ) = = = Σ + Σ +
T T
j j j j
p y p y N y X X X Z Z I
β β
θ θ δ u σ (11)
and

2
0 0
( | ) ( | , 0) ( | , ) = = = Σ + Σ +
T T
j j j j
p y p y N y X X X Z Z I
β β
θ θ δ u σ (12)
The posterior probability that 1 =
j
δ is

1
1 0
( | )
E( | , )
( | ) (1 ) ( | )
= =
+ −
j
j j j
j j
p y
y
p y p y
π θ
ρ δ θ
π θ π θ
(13)
Because
j
δ is a Bernoulli variable, it is sampled from
( ) Bernoulli( | ) =
j j j
p δ δ ρ (14)
distribution. Once { } =
j
δ δ are sampled for all genes in the stochastic
process, we can proceed with the EM algorithm described below.
2.4 EM algorithm
The EM algorithm for the Gaussian mixture model is standard (Dempster,
et al., 1977) and thus we only provide the EM steps without proof. Denote
the variance covariance matrix of
j
γ conditional on
j
δ by

1 0
var( | ) (1 ) Θ = = Σ + − Σ
j j j j j
γ δ δ δ (15)
Let us define

2
= Σ + Θ +
T T
j j
V X X Z Z I
β
σ (16)
We now provide the formulas for updating each parameter using the EM
algorithm. Given
j
δ , the updated proportion of genes coming from cluster
1 is

1
1
=
=

m
j
j
m
π δ (17)
The population mean
β
u is updated using

1
1 1
1 1

− −
= =

=


∑ ∑
m m
T T
j j j
j j
X V X X V y
β
u (18)
The variance-covariance matrix of
j
β is denoted by Σ
β
and updated using

1 1
1 1
( ) ( ) ( ) var( )
= =
Σ = = +

∑ ∑
m m
T T
j j j j j
j j
E E E
m m
β
β β β β β (19)
where

1
E( ) ( )

= Σ −
T
j j j
X V y X
β β
β u (20)
and

1
var( )

= Σ −Σ Σ
T
j j
X V X
β β β
β (21)
Given
j
δ , the unknown variance-covariance matrix of
j
γ is
1 0
(1 ) Θ = Σ + − Σ
j j j
δ δ . However, the corresponding matrix
0
Σ is a con-
stant. Therefore, we only need to update
1
Σ using all
j
γ that come from
cluster one. The updated equation for
1
Σ is

1
1 1
1 1
( ) ( ) ( ) var( )
m m
T T
j j j j j j j
j j
E E E
m m
= =
Σ = = +

∑ ∑
δ γ γ δ γ γ γ
π π
(22)
where

1
1
E( ) ( )

= Σ −
T
j j j
Z V y X
β
γ u (23)
and

1
1 1 1
var( )

= Σ −Σ Σ
T
j j
Z V Z γ (24)

a
t

O
k
l
a
h
o
m
a

S
t
a
t
e

U
n
i
v
e
r
s
i
t
y

(
G
W
L
A
)

o
n

A
p
r
i
l

2
1
,

2
0
1
1
b
i
o
i
n
f
o
r
m
a
t
i
c
s
.
o
x
f
o
r
d
j
o
u
r
n
a
l
s
.
o
r
g
D
o
w
n
l
o
a
d
e
d

f
r
o
m


Fig. 1. ROC curve comparing the SEM and EM algorithms in the
simulation study.
The residual error variance is updated using

( ) ( )
2
1
1
( ) ( )
=
= − − − −

m
T
j j j j j
j
y X y X XE ZE
mn
β β
σ u u β δ γ (25)
The E-step of the EM algorithm consists of calculat-
ing ( ),
j
E β ( ),
j
E γ var( )
j
β and var( ).
j
γ The M-step consists of calculating
{ }
2
1
, , , , = Σ Σ
β β
θ u σ π . So far all parameter have been updated. We can
now combine the stochastic steps with the EM steps to compete the analy-
sis. Note again that the stochastic and EM steps are performed sequentially
and repeated many times until a stationary distribution for each parameter
is reached.
2.5 SEM estimate
The SEM algorithm differs from the classical EM algorithm in that the
parameters do not converge to some fixed values; rather, they converge to a
stationary distribution due to the stochastic process of { } =
j
δ δ . We can
monitor the converging process for each parameter. Once all parameters
have converged, we start to collect the posterior sample for θ . Unlike the
posterior sample for the fully Bayesian analysis, the observations with the
posterior sample for the SEM algorithm are not correlated. The posterior
sample size, denoted by T, does not have to be large; 100 = T seems to be
sufficient. Let { }
( ) ( ) ( ) ( ) 2( ) ( )
1
, , , , = Σ Σ
t t t t t t
β β
θ u σ π be the t-th observation in the
posterior sample (after convergence), the estimate parameter vector of the
SEM algorithm is

( )
1
1
ˆ
=
=

T
t
t
T
θ) θ (26)
The most important quantity of the SEM analysis is not the entire vector of
θ ; rather, it is

( ) ( )
1 1
1 1
ˆ
= =
= ≈
∑ ∑
T T
t t
j j j
t t
T T
ρ ρ δ (27)
that is most important because it represents the posterior probability that
gene j belongs to cluster 1, i.e., gene j is associated with the quantitative
traits. These ˆ
j
ρ ’s are ranked in a descending order. The top proportion of
genes is selected as candidate genes associated with the phenotypes. The
proportion is defined by the investigator in an arbitrary manner. Some
objective criteria, e.g., false discovery rate (FDR), may be used, but it is not
the focus of this study. Here, we simply chose an arbitrary cut-off value of
0.9 ≥
j
ρ to declare significant association.
2.6 Expression quantitative trait locus (eQTL) map-
ping
The gene expression levels can be treated as quantitative traits and QTL
mapping can be performed on each transcript, so called eQTL mapping
(Kendziorski, et al., 2006). The problem with the eQTL analysis is that the
large number of expression traits make eQTL mapping very difficult. The
SEM algorithm developed for the quantitative trait associated microarray
data analysis can be extended to eQTL mapping with limited modification.
This section describes the application of the SEM algorithm to eQTL map-
ping.
Consider Q markers with known map positions and the genotypes for all
the n individuals. We will study the association of all the m transcripts
simultaneously with the kth marker for 1, , = L k Q . The approach is similar
to the interval mapping in which one marker is studied at a time (Lander
and Botstein, 1989). The entire eQTL mapping will take Q separate analy-
ses. Using the same model as given in equation (2), we now replace the
phenotype Z by the numerically coded genotype of marker k denoted by
k
Z so that
= + +
j j k jk j
y X Z β γ ε (28)
where
jk
γ is the QTL effect for transcript j at marker k. The
k
Z variable is
defined as

1 1
2 2
1 for
1 for
+ ¦
=
´

¹
k
A A
Z
A A
(29)
where
1 1
A A is first genotype and
1 1
A A is the second genotype of marker k.
The barley population under study is a doubled haploid population and thus
only two genotypes exist for each locus.
We now have to analyze the data Q times, one for each marker. Previ-
ously, we have a single π for the proportion of genes associated with the
phenotype. We now have Q such π) 's to indicate the proportions of tran-
scripts associated with all markers, denoted by a vector

1
[ ] = L
Q
π π π (30)
In addition, the posterior probability of gene j associated with marker k is
denoted by
jk
ρ . These parameters { } ,
k jk
π ρ are important to the eQTL
mapping. The SEM algorithm remains the same as before except that we
must analyze the data Q times (one for each marker).

3 RESULTS
3.1 Single trait association
Since each trait was measured from multiple environments, we
took the average of the phenotypic values across the environments
as the phenotypic values that entered the linear model for analysis.
Therefore, the Z matrix for each trait analysis was an
1 150 1 n× = × vector for the average phenotypic values (rescaled
between -1 and +1). The number of transcripts (genes) measured in
the experiment was 22840 m= . Since no other cofactor existed
except the intercept,
j
β is a scalar with dimension 1 p = . For the
single trait analysis, 1 q = for each trait analysis. Each of the five
parameters
{ }
2
1 β β
θ u σ π = Σ Σ is of single dimension.
Both the (two-cluster) SEM algorithm developed here and the
three-cluster EM algorithm of Jia and Xu (2005) were used for the
single trait association study. The EM algorithm of Jia and Xu
(2005) classified each gene into one of three clusters. The three

a
t

O
k
l
a
h
o
m
a

S
t
a
t
e

U
n
i
v
e
r
s
i
t
y

(
G
W
L
A
)

o
n

A
p
r
i
l

2
1
,

2
0
1
1
b
i
o
i
n
f
o
r
m
a
t
i
c
s
.
o
x
f
o
r
d
j
o
u
r
n
a
l
s
.
o
r
g
D
o
w
n
l
o
a
d
e
d

f
r
o
m

Table 2. The estimated regression coefficients for the top ten genes jointly associated with all eight traits. The partial regression coefficients of gene
on the eight traits are denoted by
1 8
, , γ γ L .
Rank Probe set ID
1
γ
2
γ
3
γ
4
γ
5
γ
6
γ
7
γ
8
γ F-test p-value
1 Contig1132_s_at -2.99 0.99 3.42 -1.99 -0.25 0.75 -0.40 -0.05 945.744 <0.0001
2 Contig124_at 2.32 -0.68 -2.93 1.76 0.38 -0.55 0.41 0.12 581.689 <0.0001
3 Contig2163_at -0.12 -0.77 -1.88 0.62 -0.89 0.45 2.15 1.62 543.998 <0.0001
4 Contig11524_at 0.71 -0.83 1.18 -0.10 2.30 0.61 1.62 -1.63 492.936 <0.0001
5 Contig2279_at 0.10 0.35 -0.72 0.80 -0.82 -2.05 -1.51 4.94 481.062 <0.0001
6 Contig4769_at 2.67 -0.18 -2.32 1.23 0.55 -0.82 0.28 0.21 435.547 <0.0001
7 Contig23772_s_at 1.81 -0.67 -2.53 1.50 0.13 -0.55 0.59 0.003 421.314 <0.0001
8 HVSMEf0020F06r2_at 0.55 -0.74 0.92 -0.16 2.16 0.75 1.33 -1.45 396.722 <0.0001
9 Contig11126_at -0.44 -0.21 0.62 -0.86 0.88 1.91 1.09 -4.10 348.061 <0.0001
10 Contig4526_at -0.45 -1.17 1.11 0.88 1.27 1.38 0.58 -1.35 347.999 <0.0001

clusters shared a co mmon variance of the regression coefficient but
with three different means. The three cluster means were restricted
with negative value for the first cluster, zero for the second cluster
and positive value for the third cluster. Genes classified into either
the first or the third cluster are differentially expressed genes. The
criterion of detection for each gene was that the posterior probabil-
ity of being in cluster 2 (the neutral cluster) was less than 0.1. This
is equivalent to the criterion of 0.9
j
ρ ≥ in the SEM analysis.
The numbers of genes associated with each of the eight traits are
listed in Table 1 for the SEM and the EM algorithms. We can see
that the SEM algorithm consistently detected more genes than the
EM algorithm. The result of the SEM algorithm shows that more
genes are associated with the height and grain protein than other
traits. The heading date trait has the least number of associated
genes. The estimated parameters for all the eight traits obtained
from the separate SEM analyses are listed in Table S1.Lists of all
the detected genes associated with the traits and gene annotations
are given in Table S2 (Sheet1-Sheet8) for interested readers.
In order to compare the performance of the SEM model and the
EM model, we carried out a simulation experiment based on the
SEM model by using the estimated parameters obtained from the
barley data (grain yield) analysis. The ROC curve (Figure 1) shows
that both the SEM and the EM models have high sensitivity and
specificity, but the SEM model performs better and has higher
sensitivity. It is well known that the EM algorithm tends to con-
verge to some sub-optimal values which are close to the initial
values. We set different parameters and simulated several datasets
based on the EM model to test the convergence of parameters.
When parameters are precisely estimated, the EM model is able to
identify most of the associated genes. If parameters converge to
some local optimal values that are different from the true values,
the sensitivity is quite low. However, the SEM model can identify
most associated genes in both cases and has high sensitivity and
specificity, which are similar to the EM results when parameters
were estimated well. We also performed permutation for the real
data analysis by permuting the phenotypes (grain yield) to test the
specificity of the two methods. After 100 permutations by the grain
yield, we took average of the posterior probabilities generated fro m
the 100 permutations for each gene and we found that no gene had
probabilities exceeding the cut-off point (0.9), which indicates that
the SEM method also has good specificity in real data analysis.
We choose the grain yield as an example to demonstrate the
converging process of the estimated parameters. The trace plots
(parameters against the iteration) are depicted in Figure 2 for all
the five parameters and the regression coefficient of gene
AF250937_s_at. From the trace plots, we can see that all parame-
ters have converged in about 10 iterations. After the parameters
converged to their stationary distributions, the parameters fluctu-
ated around the mean values and the means are the SEM estimates
of the parameters.
We also presented the predicted regression coefficients obtained
with the SEM analysis and the scatter plots of the observed genes
expressions for the genes associated with each trait (Figure S1).
Some genes have positive associations with the traits, e.g.,
AF250937_s_at and Contig6445_at, and some have negative asso-
ciations with the traits, e.g., Contig23592_at and Contig3295_at.
3.2 Multiple traits association study
For the joint association study of eight traits, the dimensionality of
the parameters increased to 8 p = and 8 q = . Therefore,
β
u is an
8 1 × vector,
β
Σ is a 8 8 × matrix and
1
Σ is an 8 8 × matrix. The
rest of the parameters, π and
2
σ , remain scalars. The Z matrix
is an 150 8 × matrix for all the eight traits measured from all the
150 DH lines. Only the SEM algorithm was used in this analysis
because the EM algorithm of Jia and Xu (2005) cannot be applied
to multiple trait association study.
Using the same criterion of 0.9
j
ρ ≥ to declare significance as-
sociation, we detected a total of 1646 genes that are jointly associ-
ated with the eight traits, accounting for 7.2% of all the 22840
genes included in the analysis. The list of associated genes is given
in the supplemental data (Table S3) for interested readers. The
estimated parameters are ˆ 0.089 π) = and
2
ˆ 0.067 σ = for the pro-
portion and residual error variance, respectively. The estimated
β
u
and
β
Σ are ˆ 7.978
β
u = and
ˆ
4.652
β
Σ = , respectively. The esti-
mated variance matrix of the differentially expressed cluster is

a
t

O
k
l
a
h
o
m
a

S
t
a
t
e

U
n
i
v
e
r
s
i
t
y

(
G
W
L
A
)

o
n

A
p
r
i
l

2
1
,

2
0
1
1
b
i
o
i
n
f
o
r
m
a
t
i
c
s
.
o
x
f
o
r
d
j
o
u
r
n
a
l
s
.
o
r
g
D
o
w
n
l
o
a
d
e
d

f
r
o
m


Fig. 2. The SEM convergence processes of five parameters for the
grain yield trait and the regression coefficient ( γ) ) of gene
AF250937_s_at (the gene having strong association with grain yield).
1
0.107 0.002 0.027 0.023 0.027 0.074 0.033 0.023
0.002 0.052 0.026 0.009 0.007 0.002 0.013 0.005
0.027 0.026 0.204 0.095 0.0003 0.017 0.026 0.012
0.023 0.009 0.095 0.084 0.015 0.009 0.026 0.008
ˆ
0.027 0.007 0.0003 0.015 0.
− −
− −
− − − − −
− − −
Σ =
096 0.039 0.025 0.024
0.074 0.002 0.017 0.009 0.039 0.198 0.097 0.048
0.033 0.013 0.026 0.026 0.025 0.097 0.193 0.054
0.023 0.005 0.012 0.008 0.024 0.048 0.054 0.151







− −

− − − −

− − − − −


− − − − −


Note that the proportion of genes detected (7.2%) in the experi-
ment is not the same as ˆ 8.9% π = because the former depends on
the cut-off point ( 0.9
j
ρ ≥ ) used for gene declaration whereas the
latter represents the probability that a randomly selected gene be-
longs to the associated cluster and it does not depend on the cut-off
point. In the simulation study, we used parameters estimated from
the barley data to simulate 5000 genes (100 individuals) based on
the SEM model. The SEM algorithm identified all associated
genes, which indicated that SEM does have high sensitivity and
specificity. By randomly permuting the eight traits, we tested the
specificity of SEM in real data analysis. Among the total of 22840
genes, 521 genes still had significant effects in the permuted data.
The false positive is 521/ 22840 0.022811 =) , reasonably low.
Interestingly, all genes associated with the 8 traits in the single
trait analysis were also detected in the joint analysis, demonstrating
the high efficiency of the joint analysis. The estimated regression
coefficients for the top ten genes jointly associated with all traits
are listed in Table 2 along with the F-test statistics and the p-
values. The F-test statistic was calculated using

( )
1
1
1 1 1
1
ˆ ˆ ˆ ˆ ˆ
ˆ ˆ


= Σ − Σ Σ
T T
j j j j
F Z V Z
q
γ γ (31)
where

1
1
ˆ ˆ
ˆ ˆ ( )

= Σ −
T
j j j
Z V y X
β
γ u (32)
and

2
1
ˆ ˆ ˆ
ˆ = Σ + Σ +
T T
j
V X X Z Z I
β
σ (33)
The p-value was calculated using

8,
ˆ
value 1 ( )

− = −
j
p F F (34)
where
8,
( )

F x is the cumulative distribution function of the central
F distribution with numerator degree of freedom q = 8 and the
denominator degree of freedom ∞.
3.3 Expression quantitative trait locus (eQTL)
mapping
The purpose of the eQTL mapping is to identify the locations of
the genome that control the expressions of the transcripts. We used
the results of the previous analysis to reduce the number of tran-
scripts for the eQTL analysis. For example, out of the 22840 tran-
scripts, we identified 888 genes that are associated with the grain
protein trait. The eQTL mapping was then targeted to these 888
transcripts. This has dramatically reduced the number of transcripts
for eQTL mapping related to the grain protein trait. Recall that
Table 1 gives the number of transcripts associated with each of the
eight traits. The eQTL mapping for each trait was only conducted
on the identified transcripts. The barley genome contains seven
chromosomes. The total number of SNP markers investigated was
Q = 495 with an average marker interval less than 2 centiMorgan,
covering the entire barley genome.
The entire eQTL analysis took about nine hours of Intel Core
Duo CP U P8400, 2.27GHz in an Hp P avilion dv4 computer. Fig-
ure 3 (a-d) shows the plots of the proportions of transcripts associ-
ated with markers for four of the eight traits. There is too much
information obtained from the eQTL analysis. Figure S2 (e-h)
shows the plots of the remaining four traits. Here, we used grain
protein and yield traits as examples to describe the plots. For the
grain protein trait, three chromosomes (2, 3, 5) seem to control
more genes than other chromosomes. For example, the central
region of chromosome 2 contains almost 50% of the 888 tran-
scripts. This region is considered as a hot spot. For the yield trait,
chromosome 3 is the only one containing more transcripts. The hot
spot is located in the middle of the chromosome and it controls the
expression of about 80% of the 457 transcripts.

a
t

O
k
l
a
h
o
m
a

S
t
a
t
e

U
n
i
v
e
r
s
i
t
y

(
G
W
L
A
)

o
n

A
p
r
i
l

2
1
,

2
0
1
1
b
i
o
i
n
f
o
r
m
a
t
i
c
s
.
o
x
f
o
r
d
j
o
u
r
n
a
l
s
.
o
r
g
D
o
w
n
l
o
a
d
e
d

f
r
o
m

Other information about this eQTL analysis is provided in Table
S4 (Sheet1-Sheet8) and Table S5 (Sheet1-Sheet8). The additional
information includes the eQTL effects for each transcript across
the genome (Table S5). The posterior probability of each transcript
associated with each marker (Table S4). The supplemental tables
can serve as a reference database for barley biologists to further
study the gene networks for the eight quantitative traits.
4 DISCUSSION
We adopted a new statistical method (SEM) for quantitative trait
associated microarray data analysis. We used the method to ana-
lyze 22840 microarrayed genes associated with eight quantitative
traits in barley. Many genes have been identified to be associated
with these traits. The actual functions of these genes in barley are
not known prior to this study. These genes are provided in Table
S2 (Sheet1-Sheet8) and Table S3. For example, among the 22840
genes, 888 are related to grain protein content. This dataset pro-
vides much information for barley biologists to further study these
genes. The functions of some genes are known in rice. For exam-
ple, according to BLASTX results, transcript 22767
(rbah35f01_s_at) is the code for the Cyclopropane-fatty-acyl-
phospholipid synthase in rice. This gene is strongly associated to
the grain protein in barley, with an F-test statistic of 1436.718 and
a p-value of zero.
The same SEM algorithm for phenotype associated microarray
data analysis has been applied to eQTL mapping with virtually no
modification. The eQTL mapping conducted was still an interval
mapping approach where one marker is analyzed at a time. Ho w-
ever, all the transcripts were analyzed simultaneously. This is al-
ready a significant improvement over the traditional interval map-
ping for QTL where one transcript was analyzed at a time
(Kendziorski, et al., 2006). Results of the eQTL analysis are pro-
vided in Table S4 (Sheet1-Sheet8) and Table S5 (Sheet1-Sheet8).
This dataset will help barley biologists to infer gene networks for
these quantitative traits. Transcripts simultaneously associated with
one marker belong to the same network (or pathway) because they
are all controlled by the segregation of the same locus. For exam-
ple, marker ABC152D (83.1 cM) on chromosome 2 controls the
expression of about 50% of the transcripts associated with the
grain protein. These transcripts are allegedly to be in the same
pathway for the development of grain protein.

Fig. 3. Proportions of transcripts associated with markers for the first four traits (Amylase, Diastatic power, Grain protein and Heading date).
The chromosomes are separated by the dotted vertical reference lines.

a
t

O
k
l
a
h
o
m
a

S
t
a
t
e

U
n
i
v
e
r
s
i
t
y

(
G
W
L
A
)

o
n

A
p
r
i
l

2
1
,

2
0
1
1
b
i
o
i
n
f
o
r
m
a
t
i
c
s
.
o
x
f
o
r
d
j
o
u
r
n
a
l
s
.
o
r
g
D
o
w
n
l
o
a
d
e
d

f
r
o
m

Theoretically, the method can analyze all markers simultane-
ously using a single model. This is the all-transcript-and-all-marker
model. Practically, however, it is difficult to handle the large ma-
trix with a dimensionality repeatedly in the SEM algorithm. The
theory is identical to the joint analysis of all the eight traits (a ma-
trix). Further study on this simultaneous analysis is needed for the
SEM algorithm. The MCMC implemented Bayesian eQTL map-
ping (Jia and Xu, 2007) can be adopted here, but it is a sampling
based method and is time consuming in terms of computing time.
This study focused on the SEM algorithm for phenotype associated
microarray analysis with the eQTL mapping as an example of ex-
tension to other problems.
Finally, the data were analyzed using an R program, which can
be downloaded from our laboratory website (www.statgen.ucr.edu)
under the "Phenotype Associated Microarray" section. A sample
dataset (subset of the barley data) is also provided in the website
for interested users to test the method. Users may customize the
code to analyze their own data using the SEM algorithm.
ACKNOWLEDGEMENTS
Funding: This project was supported by the Agriculture and Food
Research Initiative (AFRI) of the USDA National Institute of Food
and Agriculture under the Plant Genome, Genetics and Breeding
Program 2007-35300-18285 to SX.

REFERENCES
Blalock, E.M., Geddes, J.W., Chen, K.C., Porter, N.M., Markesbery, W.R. and Land-
field, P.W. (2004) Incipient Alzheimer's disease: microarray correlation analyses
reveal major transcriptional and tumor suppressor responses, Proc Natl Acad Sci
U S A, 101, 2173-2178.
Celeux, G. and Diebolt, J. (1986) The SEM algorithm: A probabilistic teacher algo-
rithm derived from the EM algorithm for the mixture problem, Comput. Statist.
Quart., 2, 73-82.
Cookson, W., Liang, L., Abecasis, G., Moffatt, M. and Lathrop, M. (2009) Mapping
complex disease traits with global gene expression, Nat Rev Genet, 10, 184-194.
Cui, X., Hwang, J.T., Qiu, J., Blades, N.J. and Churchill, G.A. (2005) Improved
statistical tests for differential gene expression by shrinking variance components
estimates, Biostatistics, 6, 59-75.
Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum Likelihood from
Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society,
39, 1-38.
Hayes, P.M., Liu, B.H., Knapp, S.J., Chen, F., Jones, B., Blake, T., Franckowiak, J.,
Rasusson, D., Sorrells, M., Ullrich, S.E., Wesenberg, D. and Kleinhofs, A. (1993)
Quantitative trait locus effects and environmental interaction in a sample of North
American barley germ plasm, Theor Appl Genet, 87, 392-401.
Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf,
U. and Speed, T.P. (2003) Exploration, normalization, and summaries of high
density oligonucleotide array probe level data, Biostatistics, 4, 249-264.
Jia, Z. and Xu, S. (2005) Clustering expressed genes on the basis of their association
with a quantitative phenotype, Genet Res, 86, 193-207.
Jia, Z. and Xu, S. (2007) Mapping quantitative trait loci for expression abundance,
Genetics, 176, 611-623.
Kendziorski, C.M., Chen, M., Yuan, M., Lan, H. and Attie, A.D. (2006) Statistical
methods for expression quantitative trait loci (eQTL) mapping, Biometrics, 62,
19-27.
Kerr, M.K., Martin, M. and Churchill, G.A. (2000) Analysis of variance for gene
expression microarray data, J Comput Biol, 7, 819-837.
Kraft, P., Schadt, E., Aten, J. and Horvath, S. (2003) A family-based test for correla-
tion between gene expression and trait values, Am J Hum Genet, 72, 1323-1330.
Lander, E.S. and Botstein, D. (1989) Mapping Mendelian Factors Underlying Quanti-
tative Traits Using RFLP Linkage Maps, Genetics, 121, 185-199.
Luo, Z.W., Potokina, E., Druka, A., Wise, R., Waugh, R. and Kearsey, M.J. (2007)
SFP genotyping from affymetrix arrays is robust but largely detects cis-acting ex-
pression regulators, Genetics, 176, 789-800.
Morley, M., Molony, C.M., Weber, T.M., Devlin, J.L., Ewens, K.G., Spielman, R.S.
and Cheung, V.G. (2004) Genetic analysis of genome-wide variation in human
gene expression, Nature, 430, 743-747.
Potokina, E., Caspers, M., Prasad, M., Kota, R., Zhang, H., Sreenivasulu, N., Wang,
M. and Graner, A. (2004) Functional association between malting quality trait
components and cDNA array based expression patterns in barley (Hordeum vul-
gare L.), Molecular Breeding, 14, 153-170.
Qu, Y. and Xu, S. (2006) Quantitative trait associated microarray gene expression data
analysis, Mol Biol Evol, 23, 1558-1573.
Quackenbush, J. (2001) Computational analysis of microarray data, Nat Rev Genet, 2,
418-427.
Wernisch, L., Kendall, S.L., Soneji, S., Wietzorrek, A., Parish, T., Hinds, J., Butcher,
P.D. and Stoker, N.G. (2003) Analysis of whole-genome microarray replicates us-
ing mixed models, Bioinformatics, 19, 53-61.
Wolfinger, R.D., Gibson, G., Wolfinger, E.D., Bennett, L., Hamadeh, H., Bushel, P.,
Afshari, C. and Paules, R.S. (2001) Assessing gene significance from cDNA mi-
croarray expression data via mixed models, J Comput Biol, 8, 625-637.

a
t

O
k
l
a
h
o
m
a

S
t
a
t
e

U
n
i
v
e
r
s
i
t
y

(
G
W
L
A
)

o
n

A
p
r
i
l

2
1
,

2
0
1
1
b
i
o
i
n
f
o
r
m
a
t
i
c
s
.
o
x
f
o
r
d
j
o
u
r
n
a
l
s
.
o
r
g
D
o
w
n
l
o
a
d
e
d

f
r
o
m

gender effect.problems have been encountered for this method. 2009). The cofactors are not something of interest themselves. Σ1 ) + (1 − δ j ) N (0. γ j ~ δ j N (0. 2011 β j ~ N (µβ . A gene classified into cluster one is claimed to be associated with the traits while genes classified into cluster zero are not associated with the traits.gov/ggpages/SxM/phenotypes. They are included in the model to reduce or eliminate the interference on the association between Y and Z. Let X be a n × p matrix for some factors not directly relevant to the quantitative traits. one (cluster 1) being associated with the phenotypes and the other (cluster 0) not associated with the phenotype. Detailed description of the experiment can be found from the original study (Hayes. et al. Σ1 is an unknown positive definite variance matrix. diastatic power. Σ1 . The residual error ε j is an n × 1 vector with an assumed N (0. n . et al. L ..uk/microarrayas/aer/entry with accession number: E-TABM-112. The phenotypic values of each trait were rescaled so that the range of each trait is between -1 and +1. We now have three sources of data. In this study. heading date. et al. CA). π } (5) We are also interested in ρ j . 1986) and its application to both eQTL mapping and phenotype associated microarray data analysis. One is the identifiability problem where the cluster labels can be exchanged among different clusters. This ad hoc modification of the EM algorithm lacks strong theoretical foundation and cannot guarantee to produce the optimal result. p = q = 1 and X is a vector of unity with a dimensionality of n × 1 .. These two problems can be avoided by introducing a small noise to the cluster means in every iterative step of the EM algorithm (Qu and Xu. The eight traits are α -amylase. 2003) implemented in the GeneSpring GX 11 software package (Agilent Technologies. Given δ j . be an indicator variable for the cluster membership of gene j. The expressed levels of gene j can be expressed using the following linear model. We used this new method to detect expressed genes that are associated with multiple quantitative traits of barley.. (1993) and downloadable from the following website: http://wheat. γ j is a q × 1 vector for the regression coefficients of gene j on all the q traits.usda. For the other cluster. but may affect the gene expressions. This Gaussian mixture prior divides all the genes into two clusters. respectively.ebi. The gene expression data were published by Luo et al. Iσ 2 ) distribution. X min and X max are the minimum and maximum values of the phenotypic value.pw. Let δ j ~ Burnoulli( ρ j ) . where n is the sample size (number of DH lines). and Z k is the rescaled phenotypic value for k = 1. The number of replicated measurements ranged from 6 to 16 depending on different traits. Trait specific gene networks can be inferred by studying the genetic loci controlling the expressed genes only associated with the trait under investigation.. age effect and so on. Both the single trait association and multiple trait joint association analyses were conducted for all the eight traits using the average trait values across all environments. The experiment involved 150 double haploid (DH) lines derived from the cross of two spring barley varieties. Σ0 = ω I q is a known diagonal matrix with a common ω = 10 −8 across all the diagonal elements. et al. (2007) and downloadable from the ArrayExpress: http://www. The actual parameters involved in the problem are denoted by θ = {µ β . the posterior probability of gene j being associated with the traits. 2. In other words. It is defined as (1) δj = 1 if j belongs to cluster 1 0 if j belongs to cluster 0 (6) where X k is the original phenotypic value for the kth line..1 MATERIALS AND METHODS Experimental data where µβ is a p × 1 vector of mean and Σ β is an unknown p × p positive definite variance matrix.2 Linear Model Denote the microarray data by a data matrix Y with n rows and m columns. The phenotypic values of eight quantitative traits of barley were published by Hayes et al. The prior distribution of γ j is a Gaussian mixture with two components. (1) Y the microarray data. Σ0 ) (7) .. The DH plants are indeed independent samples from the line cross of the barley experiment. we proposed an alternative method with a rigorous theoretical basis to solve the problem.. Santa Clara.ac. 1993). Σ β ) (3) 2 2. Σ 0 is a known positive definite matrix with values close to zero and the value can be changed according to the investigator's preference. A real dataset collected in the North American Barley Genome Project (NABGP) is used for the demonstration.html. i. for example.. This new method is then compared with the existing EM algorithm (Jia and Xu. ArrayExpress also provides the preprocessed dataset without log transformation. Finding the genetic loci controlling the expressions can help identify gene regulation networks (Cookson. 2005) to demonstrate its superiority. 2004). The phenotypes of the traits were measured in different environments (locations and years). and (3) X the cofactors not directly relevant to the association study. Let y j be the jth column of matrix Y.org at Oklahoma State University (GWLA) on April 21. we focus on a new method called the stochastic expectation and maximization (SEM) algorithm (Celeux and Diebolt.. The gene expression levels are quantitative traits (Morley. Variable π ( 0 < π < 1 ) is a prior probability that a gene randomly selected from the pool belongs to cluster 1. The relationship between the posterior ρ j and the prior π will be presented later. plant height. lodging and malt extract. The other problem occurs when two or more clusters often have the same cluster mean. The original (raw) microarray data were normalized using the RMA algorithm (Irizarry. σ 2 .e. Σ 0 ) (4) In the above Gaussian mixture. The prior distribution for β j is Downloaded from bioinformatics. All the 150 DH lines were microarrayed for 22840 transcripts.oxfordjournals. for 0 < ρ j < 1 . The formula for the rescaling is Zk = 2 X k − X min −1 X max − X min γ j ~ π N (0. Let Z be an n × q matrix for the rescaled phenotypic values of q quantitative traits measured from all n individuals. Morex and Steptoe. where n is the number of individuals subject to the microarray analysis and m is the number of microarrayed genes. Σ β . y j = X β j + Zγ j + ε j (2) where β j is a p × 1 vector for the effects of cofactors. 2006). m ). (2) Z the phenotypic data. This assumption is very common in the linear model analysis. We now assign prior distributions to the parameters included in the linear model. the genetic effect γ j has the following distribution. grain yield. In the special case of one phenotype with no cofactors. an n × 1 vector for the expression levels of gene j for all the n subjects ( j = 1. grain protein. In this study. Σ1 ) + (1 − π ) N (0.

2011 The MLE of parameters are obtained through a two-step approach. an algorithm called the stochastic expectation and maximization (SEM) algorithm (Celeux and Diebolt. δ j = 0) = N ( y j | X µβ .76% 3. δ j )    j =1 m (10) where Downloaded from bioinformatics. Under the random model framework.43% 2. the updated proportion of genes coming from cluster 1 is π= (8) 1 m ∑δ j m j =1 (17) The population mean µβ is updated using (9) µ β =  ∑ X TV j−1 X   ∑ X TV j−1 y j   j =1   j =1   m   −1 m  (18) and the notation N { y | a.85% 2. et al. the log likelihood function for the entire dataset is ) The variance-covariance matrix of β j is denoted by Σ β and updated using Σβ = 1 m 1 m ∑ E (β j β T ) = m ∑  E (β j ) E (β Tj ) + var(β j )  j   m j =1 j =1 (19) L(θ | δ ) = ∑ ln  p ( y j | θ . 1977) and thus we only provide the EM steps without proof. δ = {δ j } . we can proceed with the EM algorithm described below. Once δ = {δ j } are sampled for all genes in the stochastic process. The two steps are repeated iteratively until a stationary distribution is reached for each parameter. However. not in parallel.org at Oklahoma State University (GWLA) on April 21.3 Stochastic sampling where The density of y j defined in equation (8) can be split into the following two densities. X Σ β X T + Z Θ j Z T + Iσ 2 } where Θ j = δ j Σ1 + (1 − δ j )Σ 0 V j = X Σβ X T + Z Θ j Z T + Iσ 2 (16) We now provide the formulas for updating each parameter using the EM algorithm.76% 0. πm ∑δ j =1 m j E (γ j γ T ) = j 1 πm ∑δ j =1 m j  E (γ j ) E (γ T ) + var(γ j )  j   (22) p1 ( y j | θ ) = p( y j | θ . Proportion 1. y j ) = π p1 ( y j | θ ) π p1 ( y j | θ ) + (1 − π ) p0 ( y j | θ ) (13) Because δ j is a Bernoulli variable. Therefore. b} stands for the normal density of variable y with mean a and variance b. Also note that the parameter vector does not include β j and γ j . 1977). (14) Trait SEM Algorithm Number Proportion 2. the density of y j given the cluster membership is normal with mean and variance shown in the following normal density p ( y j | θ . et al.04% 3.00% 1. β j and γ j are treated as missing values.89% 2.02% 0.13% 0.85% 1. This step is called the stochastic step. δ j = 1) = N ( y j | X µβ . If they are integrated out. This step is called the EM step.61% 0.oxfordjournals. The first step is to estimate the parameters by maximizing the above log likelihood function given δ = {δ j } through the regular expectation-maximization (EM) algorithm (Dempster. the corresponding matrix Σ 0 is a constant. X Σβ X + Z Σ1Z + Iσ ) T T 2 (11) and E(γ j ) = Σ1Z TV j−1 ( y j − X µβ ) (23) and p0 ( y j | θ ) = p( y j | θ . The indicator variable δ j tells which Gaussian component γ j belongs to.82% 2. The updated equation for Σ1 is Σ1 = 1 2. The stochastic step and the EM step are performed sequentially.Note that this conditional distribution is not a Gaussian mixture because the membership is already known. Denote the variance covariance matrix of γ j conditional on δ j by Θ j = var(γ j | δ j ) = δ j Σ1 + (1 − δ j )Σ 0 (15) Let us define .95% Alpha-amylase Diastatic power Grain protein Grain yield Heading date Height Lodging Malt extract 644 467 888 457 401 784 650 526 2. Given δ j . X Σβ X T + Z Σ0 Z T + Iσ 2 ) The posterior probability that δ j = 1 is (12) var(γ j ) = Σ1 − Σ1Z T Vj−1Z Σ1 (24) ρ j = E(δ j | θ . it is sampled from p(δ j ) = Bernoulli(δ j | ρ j ) Table 1. ) ) E(β j ) = Σ β X TV j−1 ( y j − X µβ ) and (20) var(β j ) = Σβ − Σβ X TV j−1 X Σβ (21) Given δ j .30% EM Algorithm Number 233 195 257 139 166 264 173 216 distribution.4 EM algorithm The EM algorithm for the Gaussian mixture model is standard (Dempster. The second step is to stochastically simulate δ = {δ j } from its conditional posterior distribution. Numbers of genes associated with individual traits in the barley microarray data analysis using the two-cluster SEM algorithm and the three-cluster EM algorithm...16% 0.73% 1. δ j ) = N { y j | X µ β . we only need to update Σ1 using all γ j that come from cluster one. Given the cluster membership. the unknown variance-covariance matrix of γ j is Θ j = δ j Σ1 + (1 − δ j )Σ 0 . 1986).

The problem with the eQTL analysis is that the large number of expression traits make eQTL mapping very difficult. may be used. The proportion is defined by the investigator in an arbitrary manner. Note again that the stochastic and EM steps are performed sequentially and repeated many times until a stationary distribution for each parameter is reached. one for each marker. We now have to analyze the data Q times. This section describes the application of the SEM algorithm to eQTL mapping. We can monitor the converging process for each parameter. So far all parameter have been updated.org at Oklahoma State University (GWLA) on April 21. the observations with the posterior sample for the SEM algorithm are not correlated. The SEM algorithm remains the same as before except that we must analyze the data Q times (one for each marker). The number of transcripts (genes) measured in the experiment was m = 22840 . We can now combine the stochastic steps with the EM steps to compete the analysis.9 to declare significant association. σ 2(t ) . we start to collect the posterior sample for θ . The approach is similar to the interval mapping in which one marker is studied at a time (Lander and Botstein. (29) 2. we simply chose an arbitrary cut-off value of ρ j ≥ 0. These parameters {π k . T = 100 seems to be ( ( sufficient. Downloaded from bioinformatics. The top proportion of genes is selected as candidate genes associated with the phenotypes. they converge to a stationary distribution due to the stochastic process of δ = {δ j } . Since no other cofactor existed except the intercept. e. gene j is associated with the quantitative ˆ traits. the posterior probability of gene j associated with marker k is denoted by ρ jk . These ρ j ’s are ranked in a descending order.6 Expression quantitative trait locus (eQTL) mapping The gene expression levels can be treated as quantitative traits and QTL mapping can be performed on each transcript. The EM algorithm of Jia and Xu (2005) classified each gene into one of three clusters. false discovery rate (FDR). .e. Q . The M-step consists of calculating θ = {µ β . we now replace the phenotype Z by the numerically coded genotype of marker k denoted by Z k so that Fig. Each of the five parameters θ = {µ β Σ β Σ1 σ 2 π } is of single dimension. E (γ j ).5 SEM estimate where A1 A1 is first genotype and A1 A1 is the second genotype of marker k. so called eQTL mapping (Kendziorski. The barley population under study is a doubled haploid population and thus only two genotypes exist for each locus. et al. denoted by T. σ 2 . Σ(βt ) . 1989). ρ jk } are important to the eQTL mapping.. Σ β . 2011 3 3. rather. π (t ) } be the t-th observation in the posterior sample (after convergence). it is ˆ ρj = 1 T (t ) 1 T ( t ) ∑ ρ j ≈ T ∑δ j T t =1 t =1 (27) that is most important because it represents the posterior probability that gene j belongs to cluster 1. Unlike the posterior sample for the fully Bayesian analysis. Consider Q markers with known map positions and the genotypes for all the n individuals. rather. The three 2. Let θ ( t ) = {µ βt ) . q = 1 for each trait analysis. denoted by a vector ) The SEM algorithm differs from the classical EM algorithm in that the parameters do not converge to some fixed values. the Z matrix for each trait analysis was an n × 1 = 150 × 1 vector for the average phenotypic values (rescaled between -1 and +1). the estimate parameter vector of the SEM algorithm is ˆ θ= ) π = [π 1 L π Q ] (30) In addition. Some objective criteria. Since each trait was measured from multiple environments. Here. Using the same model as given in equation (2). For the single trait analysis. does not have to be large. The SEM algorithm developed for the quantitative trait associated microarray data analysis can be extended to eQTL mapping with limited modification.1 RESULTS Single trait association 1 T (t ) ∑θ T t =1 (26) The most important quantity of the SEM analysis is not the entire vector of θ . π } . Therefore.L . Previously. Σ1t ) . we have a single π for the proportion of genes associated with the phenotype. var( β j ) and var(γ j ).The residual error variance is updated using y j = X β j + Z k γ jk + ε j (28) σ2 = T 1 ∑ ( y j − X µ β ) ( y j − X µ β − XE ( β j ) − δ j ZE (γ j ) ) mn j =1 m (25) where γ jk is the QTL effect for transcript j at marker k. We now have Q such π 's to indicate the proportions of transcripts associated with all markers. Once all parameters have converged. i.. Both the (two-cluster) SEM algorithm developed here and the three-cluster EM algorithm of Jia and Xu (2005) were used for the single trait association study.g. ROC curve comparing the SEM and EM algorithms in the simulation study. we took the average of the phenotypic values across the environments as the phenotypic values that entered the linear model for analysis. but it is not the focus of this study. β j is a scalar with dimension p = 1 . The posterior sample size. 2006). Σ1 . We will study the association of all the m transcripts simultaneously with the kth marker for k = 1. The entire eQTL mapping will take Q separate analyses. The Z k variable is defined as  +1 for A1 A1 Zk =   −1 for A2 A2 The E-step of the EM algorithm consists of calculating E ( β j )..oxfordjournals. 1.

but the SEM model performs better and has higher sensitivity.75 1.71 0.067 for the proportion and residual error variance.88 1.77 -0. 3.45 -4. The result of the SEM algorithm shows that more genes are associated with the height and grain protein than other traits. µβ is an 8 × 1 vector.72 -2.35 -0.10 2.62 1. This is equivalent to the criterion of ρ j ≥ 0.0001 <0. the SEM model can identify most associated genes in both cases and has high sensitivity and specificity.12 1. we detected a total of 1646 genes that are jointly associated with the eight traits.936 481. the EM model is able to identify most of the associated genes.003 -1. We choose the grain yield as an example to demonstrate the converging process of the estimated parameters. When parameters are precisely estimated. which indicates that the SEM method also has good specificity in real data analysis.0001 <0. The estimated parameters for all the eight traits obtained from the separate SEM analyses are listed in Table S1.28 0.722 348. the sensitivity is quite low. The estimated regression coefficients for the top ten genes jointly associated with all eight traits.2% of all the 22840 genes included in the analysis.59 1. We also performed permutation for the real data analysis by permuting the phenotypes (grain yield) to test the specificity of the two methods. and some have negative associations with the traits. The estimated µβ ˆ ˆ and Σ β are µ β = 7.42 -2.18 -0. Only the SEM algorithm was used in this analysis because the EM algorithm of Jia and Xu (2005) cannot be applied to multiple trait association study. If parameters converge to some local optimal values that are different from the true values.99 1.23 1. The list of associated genes is given in the supplemental data (Table S3) for interested readers.75 -0. We also presented the predicted regression coefficients obtained with the SEM analysis and the scatter plots of the observed genes expressions for the genes associated with each trait (Figure S1). remain scalars.40 0. The ROC curve (Figure 1) shows that both the SEM and the EM models have high sensitivity and specificity. The criterion of detection for each gene was that the posterior probability of being in cluster 2 (the neutral cluster) was less than 0. The Z matrix is an 150 × 8 matrix for all the eight traits measured from all the 150 DH lines.61 -2. which are similar to the EM results when parameters were estimated well. After the parameters converged to their stationary distributions. Contig23592_at and Contig3295_at.0001 <0. 2011 clusters shared a common variance of the regression coefficient but with three different means.80 1.50 -0. Some genes have positive associations with the traits.13 2.82 -0.0001 <0.998 492.g.58 γ8 -0.55 0.21 0.314 396.35 F-test 945. From the trace plots.089 and σ 2 = 0. Genes classified into either the first or the third cluster are differentially expressed genes.88 γ5 -0.45 γ2 0.91 1. L. the parameters fluctuated around the mean values and the means are the SEM estimates of the parameters.0001 <0.0001 <0.9 in the SEM analysis. γ 8 . Rank 1 2 3 4 5 6 7 8 9 10 Probe set ID Contig1132_s_at Contig124_at Contig2163_at Contig11524_at Contig2279_at Contig4769_at Contig23772_s_at HVSMEf0020F06r2_at Contig11126_at Contig4526_at γ1 -2.86 0.689 543.44 -0.38 γ7 -0.547 421.93 -1.88 1. The partial regression coefficients of gene on the eight traits are denoted by γ 1 .g.83 0. In order to compare the performance of the SEM model and the EM model. After 100 permutations by the grain yield. zero for the second cluster and positive value for the third cluster.2 Multiple traits association study For the joint association study of eight traits.62 -0.33 1.18 -0.45 0.Lists of all the detected genes associated with the traits and gene annotations are given in Table S2 (Sheet1-Sheet8) for interested readers. The numbers of genes associated with each of the eight traits are listed in Table 1 for the SEM and the EM algorithms. It is well known that the EM algorithm tends to converge to some sub-optimal values which are close to the initial values.09 0.67 1.55 0. The rest of the parameters.oxfordjournals.. The heading date trait has the least number of associated genes.55 -0. π and σ 2 .744 581.9 to declare significance association.05 0. e.0001 <0.51 0.68 -0.38 -0..99 2.41 2.82 0.32 -2. Using the same criterion of ρ j ≥ 0. respectively.1. The trace plots (parameters against the iteration) are depicted in Figure 2 for all the five parameters and the regression coefficient of gene AF250937_s_at.92 0.74 -0.9).81 0.0001 <0. respectively.53 0.10 -1.21 -1. We can see that the SEM algorithm consistently detected more genes than the EM algorithm.org at Oklahoma State University (GWLA) on April 21.999 p-value <0.16 0. The three cluster means were restricted with negative value for the first cluster.62 -1.17 γ3 3.32 -0.63 4.89 2. The ˆ ˆ estimated parameters are π = 0.94 0. However.061 347.76 0.30 -0.67 -0.16 -0. the dimensionality of the parameters increased to p = 8 and q = 8 . e.62 -1. The estimated variance matrix of the differentially expressed cluster is ) .062 435. AF250937_s_at and Contig6445_at. Σ β is a 8 × 8 matrix and Σ1 is an 8 × 8 matrix.978 and Σ β = 4.0001 Downloaded from bioinformatics.15 1. accounting for 7.12 0.27 γ6 0.55 0.99 -0. We set different parameters and simulated several datasets based on the EM model to test the convergence of parameters. we can see that all parameters have converged in about 10 iterations.25 0.652 . Therefore.05 -0. we carried out a simulation experiment based on the SEM model by using the estimated parameters obtained from the barley data (grain yield) analysis. we took average of the posterior probabilities generated from the 100 permutations for each gene and we found that no gene had probabilities exceeding the cut-off point (0.0001 <0.11 γ4 -1.Table 2.10 0.

 0.026 0.023  −0.026 0.015 −0.9 ) used for gene declaration whereas the latter represents the probability that a randomly selected gene belongs to the associated cluster and it does not depend on the cut-off point.025 −0. We used the results of the previous analysis to reduce the number of transcripts for the eQTL analysis.017 −0.097 −0.oxfordjournals.107 0.27GHz in an Hp Pavilion dv4 computer.008  −0. 3. we tested the specificity of SEM in real data analysis.0003 0. we used parameters estimated from the barley data to simulate 5000 genes (100 individuals) based on the SEM model.024   −0.023 0.193 −0. The total number of SNP markers investigated was Q = 495 with an average marker interval less than 2 centiMorgan. There is too much information obtained from the eQTL analysis. out of the 22840 transcripts.013 0. For example.084 0. we identified 888 genes that are associated with the grain protein trait.org at Oklahoma State University (GWLA) on April 21. In the simulation study. The eQTL mapping for each trait was only conducted on the identified transcripts. Figure S2 (e-h) shows the plots of the remaining four traits.204 −0. Among the total of 22840 genes. The false positive is 521 / 22840 = 0.027 0.002 0. .054  −0.002   −0. the central region of chromosome 2 contains almost 50% of the 888 transcripts.Interestingly.048   0.096 0.023 0.009 0.025 −0.026 −0. This region is considered as a hot spot. ) The purpose of the eQTL mapping is to identify the locations of the genome that control the expressions of the transcripts.007 0. all genes associated with the 8 traits in the single trait analysis were also detected in the joint analysis.048  0.097  0.039 0. we used grain protein and yield traits as examples to describe the plots. covering the entire barley genome.005   −0.095 0.024 −0.002 −0.054 0.012 −0. 3. The barley genome contains seven chromosomes. For example. 5) seem to control more genes than other chromosomes.026 −0. 521 genes still had significant effects in the permuted data. 2. The SEM algorithm identified all associated genes.052 −0.198   0.9% because the former depends on the cut-off point ( ρ j ≥ 0.007 0. The F-test statistic was calculated using ˆ 1 ˆ ˆ ˆ ˆ Fj = γˆT Σ1 − Σ1Z TV j−1Z Σ1 j q where ( ) −1 γˆ j (31) ˆ ˆ ˆ γˆ j = Σ1Z TV j−1 ( y j − X µβ ) and (32) ˆ ˆ ˆ ˆ V j = X Σ β X T + Z Σ1Z T + Iσ 2 The p-value was calculated using (33) Downloaded from bioinformatics. three chromosomes (2.095 0. The entire eQTL analysis took about nine hours of Intel Core Duo CPU P8400.009 ˆ Σ1 =   0. The SEM convergence processes of five parameters for the grain yield trait and the regression coefficient ( γ ) of gene ) Expression quantitative trait locus (eQTL) mapping AF250937_s_at (the gene having strong association with grain yield).033 −0.∞ ( x) is the cumulative distribution function of the central F distribution with numerator degree of freedom q = 8 and the denominator degree of freedom ∞ .027 −0.039   −0. 2011 ˆ p − value = 1 − F8.015 0.026 −0.002 0.017  0.151   Note that the proportion of genes detected (7. demonstrating the high efficiency of the joint analysis.009 0.033 0.012   0. reasonably low. Here. chromosome 3 is the only one containing more transcripts. Figure 3 (a-d) shows the plots of the proportions of transcripts associated with markers for four of the eight traits.008 −0.074 0.027 −0. For the grain protein trait.026 0. ∞ ( Fj ) (34) where F8. The hot spot is located in the middle of the chromosome and it controls the expression of about 80% of the 457 transcripts. which indicated that SEM does have high sensitivity and specificity. For the yield trait.022811 . Recall that Table 1 gives the number of transcripts associated with each of the eight traits. By randomly permuting the eight traits.005 −0.013 −0. This has dramatically reduced the number of transcripts for eQTL mapping related to the grain protein trait.3 Fig.074  0.2%) in the experiˆ ment is not the same as π = 8.0003 0.009 −0.023 0. 2. The eQTL mapping was then targeted to these 888 transcripts. The estimated regression coefficients for the top ten genes jointly associated with all traits are listed in Table 2 along with the F-test statistics and the pvalues.027 0.

This dataset provides much information for barley biologists to further study these genes.oxfordjournals. 2011 Fig. 3. The same SEM algorithm for phenotype associated microarray data analysis has been applied to eQTL mapping with virtually no modification. Many genes have been identified to be associated with these traits. Results of the eQTL analysis are provided in Table S4 (Sheet1-Sheet8) and Table S5 (Sheet1-Sheet8). Proportions of transcripts associated with markers for the first four traits (Amylase. transcript 22767 (rbah35f01_s_at) is the code for the Cyclopropane-fatty-acylphospholipid synthase in rice. The additional information includes the eQTL effects for each transcript across the genome (Table S5). These genes are provided in Table S2 (Sheet1-Sheet8) and Table S3. Other information about this eQTL analysis is provided in Table S4 (Sheet1-Sheet8) and Table S5 (Sheet1-Sheet8). The eQTL mapping conducted was still an interval mapping approach where one marker is analyzed at a time. Diastatic power. The supplemental tables can serve as a reference database for barley biologists to further study the gene networks for the eight quantitative traits.org at Oklahoma State University (GWLA) on April 21. However. all the transcripts were analyzed simultaneously. et al. according to BLASTX results. marker ABC152D (83.. 888 are related to grain protein content. This gene is strongly associated to the grain protein in barley. with an F-test statistic of 1436. 4 DISCUSSION We adopted a new statistical method (SEM) for quantitative trait associated microarray data analysis. This is already a significant improvement over the traditional interval mapping for QTL where one transcript was analyzed at a time (Kendziorski. The chromosomes are separated by the dotted vertical reference lines. The actual functions of these genes in barley are not known prior to this study. These transcripts are allegedly to be in the same pathway for the development of grain protein. Grain protein and Heading date). The functions of some genes are known in rice. We used the method to analyze 22840 microarrayed genes associated with eight quantitative traits in barley. The posterior probability of each transcript associated with each marker (Table S4).Downloaded from bioinformatics.1 cM) on chromosome 2 controls the expression of about 50% of the transcripts associated with the grain protein. Transcripts simultaneously associated with one marker belong to the same network (or pathway) because they are all controlled by the segregation of the same locus. This dataset will help barley biologists to infer gene networks for these quantitative traits. For example. For example. . among the 22840 genes. For example. 2006).718 and a p-value of zero.

and Speed. Hobbs.M. Markesbery.J. Celeux.. Wolfinger. Afshari. Nat Rev Genet. B. T. Quackenbush. N. 8. 176. and Kleinhofs.. X. 62. S. 2011 ACKNOWLEDGEMENTS Funding: This project was supported by the Agriculture and Food Research Initiative (AFRI) of the USDA National Institute of Food and Agriculture under the Plant Genome. Parish. Quart. Hayes. Lander. Sorrells. the data were analyzed using an R program. Y. P. E.T. C. N. Am J Hum Genet. S.. Scherf. A.E. G. M. Statist. (2005) Improved statistical tests for differential gene expression by shrinking variance components estimates. Wesenberg.. Nature.. Mol Biol Evol. D. Prasad. but it is a sampling based method and is time consuming in terms of computing time. M. Wietzorrek.D. 611-623. Blake. E. 23... The MCMC implemented Bayesian eQTL mapping (Jia and Xu. the method can analyze all markers simultaneously using a single model. Chen. 418-427. Kendall. (2000) Analysis of variance for gene expression microarray data.. and Churchill.. Potokina. 86. 2007) can be adopted here. M. 193-207.A. Dempster. 59-75. and Lathrop. Ullrich.. Molecular Breeding. (2004) Incipient Alzheimer's disease: microarray correlation analyses reveal major transcriptional and tumor suppressor responses.P. Kerr. it is difficult to handle the large matrix with a dimensionality repeatedly in the SEM algorithm. (2003) Exploration. L. 14. 87. 101. Liu. 184-194. F.L. R. V.. C. Abecasis. T. Genet Res.. G. 185-199.. Knapp.. and Kearsey. M. T. S. and Rubin. Biostatistics. A.. Molony. P. 2173-2178.. B. G. D. 19-27.. Y. S.C. G. (1989) Mapping Mendelian Factors Underlying Quantitative Traits Using RFLP Linkage Maps... and Churchill. Zhang. P. and Attie. Wang.D. K. Laird.M. Spielman. 625-637. which can be downloaded from our laboratory website (www. 73-82. Biometrics. Jia.. Geddes. J Comput Biol. (2001) Assessing gene significance from cDNA microarray expression data via mixed models. Genetics. W.. Franckowiak. A sample dataset (subset of the barley data) is also provided in the website for interested users to test the method.. J. Wise.. Downloaded from bioinformatics. Biostatistics. 2. K. This is the all-transcript-and-all-marker model. E. 6. Schadt. (2003) Analysis of whole-genome microarray replicates using mixed models. Butcher. Bioinformatics.. R. and Paules. Jia. 7. Genetics.. 176. Porter... G.H. A.. and Landfield. 19.. and Horvath.W. (2007) Mapping quantitative trait loci for expression abundance..B. Lan. A. 743-747. (1993) Quantitative trait locus effects and environmental interaction in a sample of North American barley germ plasm. L. K. N.K.M.M... B. Z. Morley. and Stoker. and Diebolt. (2007) SFP genotyping from affymetrix arrays is robust but largely detects cis-acting expression regulators. Moffatt. J. E. (2003) A family-based test for correlation between gene expression and trait values. Further study on this simultaneous analysis is needed for the SEM algorithm. Hinds.. Jones. 72... Blades. and Botstein. N. C. (2005) Clustering expressed genes on the basis of their association with a quantitative phenotype.. P. Theor Appl Genet.M. (2006) Quantitative trait associated microarray gene expression data analysis. Soneji. J. Proc Natl Acad Sci U S A.. Z. 789-800.G. Ewens. Wolfinger.. Finally. 249-264. J. Antonellis. F. (2004) Functional association between malting quality trait components and cDNA array based expression patterns in barley (Hordeum vulgare L. J. Irizarry. S. J Comput Biol. Kendziorski. 392-401..S. Beazer-Barclay.statgen. Rasusson.. 819-837.). Liang. Weber. Yuan. 1-38. 1558-1573. and Xu. Kota. H. L. Luo. P. 53-61. Collin. Qu.D.. M.S.A. M... A. Bennett.A. The theory is identical to the joint analysis of all the eight traits (a matrix). 153-170. S.org at Oklahoma State University (GWLA) on April 21. Chen. Genetics and Breeding Program 2007-35300-18285 to SX. W. This study focused on the SEM algorithm for phenotype associated microarray analysis with the eQTL mapping as an example of extension to other problems. Comput. Genetics.. Practically. U. and Xu. A. (2009) Mapping complex disease traits with global gene expression.. and summaries of high density oligonucleotide array probe level data. Devlin.. Potokina. and Xu. 1323-1330. Journal of the Royal Statistical Society. normalization. E. REFERENCES Blalock. Cookson. 4. Druka.. Kraft.ucr. M.J..P. M. Z. 39.. 10. R.. E. D. (2001) Computational analysis of microarray data. J. Wernisch. M.. Nat Rev Genet. N. R. M. 121. Users may customize the code to analyze their own data using the SEM algorithm.edu) under the "Phenotype Associated Microarray" section. (2006) Statistical methods for expression quantitative trait loci (eQTL) mapping.. 2.W. M. Bushel. . S.D. R.. H. Caspers. Chen..G. J. Qiu... Martin. M..oxfordjournals. however.W. (1986) The SEM algorithm: A probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. 430.R. Gibson.J.G. Waugh. and Graner. D..M.L. Cui. (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm.Theoretically. H.S. (2004) Genetic analysis of genome-wide variation in human gene expression.. S..M. J. R. Hwang.D. Hamadeh. J. Sreenivasulu. R. T. and Cheung.J. Aten.

Sign up to vote on this title
UsefulNot useful