This action might not be possible to undo. Are you sure you want to continue?

# ABSTRACT

**Motivation: Most biological traits may be correlated with the under-
**

lying gene expression patterns that are partially determined by DNA

sequence variation. The correlations between gene expressions and

quantitative traits are essential for understanding the functions of

genes and dissecting gene regulatory networks.

Results: In the present study, we adopted a novel statistical

method, called the stochastic expectation and maximization (SEM)

algorithm, to analyze the associations between gene expression

levels and quantitative trait values and identify genetic loci control-

ling the gene expression variations. In the first step, gene expres-

sion levels measured from microarray experiments were assigned to

two different clusters based on the strengths of their association with

the phenotypes of a quantitative trait under investigation. In the sec-

ond step, genes associated with the trait were mapped to genetic

loci of the genome. Because gene expressions are quantitative, the

genetic loci controlling the expression traits are called expression

quantitative trait loci (eQTL). We applied the same SEM algorithm to

a real dataset collected from a barley genetic experiment with both

quantitative traits and gene expression traits. For the first time, we

identified genes associated with eight agronomy traits of barley.

These genes were then mapped to seven chromosomes of the bar-

ley genome. The SEM algorithm and the result of the barley data

analysis are useful to scientists in the areas of bioinformatics and

plant breeding.

Availability and implementation: The R program for the SEM

algorithm can be downloaded from our website:

http://www.statgen.ucr.edu

Contact: shizhong.xu@ucr.edu

Supplementary information: Supplementary data are available in

the Bioinformatics online system.

1 INTRODUCTION

Differential expression analysis often applies to discrete pheno-

types (primarily dichotomous phenotypes). The phenotype is often

defined as "normal" or "affected". If a phenotype is measured

quantitatively, it is often converted into two or a few discrete or-

*

To whom correspondence should be addressed.

dered phenotype so that a differential expression analysis or an

analysis of variances (ANOVA) method can be applied (Cui, et al.,

2005; Kerr, et al., 2000; Wernisch, et al., 2003; Wolfinger, et al.,

2001). It is obvious that such discretization is subject to informa-

tion loss. The current microarray data analysis technique has not

been able to efficiently analyze the association of gene expression

with a continuous phenotype (Blalock et al., 2004; Jia and Xu

2005). Pearson correlation between gene expression and a quanti-

tative trait has been proposed (Blalock, et al., 2004; Kraft, et al.,

2003; Quackenbush, 2001). Blalock et al. (2004) ranked genes

according to the correlation coefficients of gene expression with

MMSE, a quantitative measurement of the severity of Alzheimer

disease (AD), and detected many genes that are associated with

AD. Kraft et al. (2003) used a within family correlation analysis to

remove the effect of family stratification. Pearson correlation is

intuitive and easy to calculate. However, it may not be optimal

because (1) the correlation coefficient may not be the best indicator

of the association, (2) higher order association cannot be detected,

(3) data are analyzed individually with one gene at a time, and (4)

the method cannot be extended to association of gene expression

with multiple continuous phenotypes. Potokina et al. (2004) inves-

tigated the association of gene expression with six malting quality

phenotypes (quantitative traits) of 10 barley cultivars. They com-

pared the distance matrix of each gene expression among the 10

cultivars with the distance matrix calculated from the phenotypes

of all six traits using the G-test statistic. The G-test statistic was

designed to measure the similarity between two matrices. For each

gene, there is a distance matrix (based on the expression levels).

For the phenotypes of six traits, there is another distance matrix.

The two matrices are compared for the similarity. If the similarity

is high, the gene is associated with all the six phenotypes. Eventu-

ally, the associations of the phenotypes with all the genes are

evaluated. The distance matrix comparison approach may have the

same flaws as the correlation analysis.

Recently, we proposed to use the regression coefficient of the

expression on a continuous phenotype as the indicator of the

strength of association (Jia and Xu, 2005). Instead of analyzing one

gene at a time, we took a model-based clustering approach to stud-

ying all genes simultaneously. Qu and Xu (2006) extended the

model-based clustering algorithm to capture genes having higher

order association with the phenotype. The model-based clustering

analysis (Jia and Xu, 2005; Qu and Xu, 2006) classifies genes into

several clusters and all clusters share the same variance-covariance

structure. The analysis is implemented via the expectation-

maximization (EM) algorithm (Dempster, et al., 1977). Several

A Stochastic Expectation and Maximization (SEM) Algorithm for

Detecting Quantitative Trait Associated Genes

Haimao Zhan

1

, Xin Chen

2

and Shizhong Xu

1,*

1

Department of Botany and Plant Sciences, University of California, Riverside, CA 92521 USA

2

Department of Statistics, University of California, Riverside, CA 92521, USA

Associate Editor: Dr. Joaquin Dopazo

© The Author (2010). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Bioinformatics Advance Access published October 29, 2010

a

t

O

k

l

a

h

o

m

a

S

t

a

t

e

U

n

i

v

e

r

s

i

t

y

(

G

W

L

A

)

o

n

A

p

r

i

l

2

1

,

2

0

1

1

b

i

o

i

n

f

o

r

m

a

t

i

c

s

.

o

x

f

o

r

d

j

o

u

r

n

a

l

s

.

o

r

g

D

o

w

n

l

o

a

d

e

d

f

r

o

m

problems have been encountered for this method. One is the identi-

fiability problem where the cluster labels can be exchanged among

different clusters. The other problem occurs when two or more

clusters often have the same cluster mean. These two problems can

be avoided by introducing a small noise to the cluster means in

every iterative step of the EM algorithm (Qu and Xu, 2006). This

ad hoc modification of the EM algorithm lacks strong theoretical

foundation and cannot guarantee to produce the optimal result. In

this study, we proposed an alternative method with a rigorous theo-

retical basis to solve the problem. We used this new method to

detect expressed genes that are associated with multiple quantita-

tive traits of barley.

The gene expression levels are quantitative traits (Morley, et al.,

2004). Finding the genetic loci controlling the expressions can help

identify gene regulation networks (Cookson, et al., 2009). Trait

specific gene networks can be inferred by studying the genetic loci

controlling the expressed genes only associated with the trait under

investigation. In this study, we focus on a new method called the

stochastic expectation and maximization (SEM) algorithm (Celeux

and Diebolt, 1986) and its application to both eQTL mapping and

phenotype associated microarray data analysis. This new method is

then compared with the existing EM algorithm (Jia and Xu, 2005)

to demonstrate its superiority. A real dataset collected in the North

American Barley Genome Project (NABGP) is used for the dem-

onstration.

2 MATERIALS AND METHODS

2.1 Experimental data

The gene expression data were published by Luo et al. (2007) and

downloadable from the ArrayExpress: http://www.ebi.ac.uk/microarray-

as/aer/entry with accession number: E-TABM-112. The phenotypic values

of eight quantitative traits of barley were published by Hayes et al. (1993)

and downloadable from the following website:

http://wheat.pw.usda.gov/ggpages/SxM/phenotypes.html. Detailed descrip-

tion of the experiment can be found from the original study (Hayes, et al.,

1993). The experiment involved 150 double haploid (DH) lines derived

from the cross of two spring barley varieties, Morex and Steptoe. All the

150 DH lines were microarrayed for 22840 transcripts. The eight traits are

α -amylase, diastatic power, grain protein, grain yield, heading date, plant

height, lodging and malt extract. The phenotypes of the traits were meas-

ured in different environments (locations and years). The number of repli-

cated measurements ranged from 6 to 16 depending on different traits. Both

the single trait association and multiple trait joint association analyses were

conducted for all the eight traits using the average trait values across all

environments.

The original (raw) microarray data were normalized using the RMA al-

gorithm (Irizarry, et al., 2003) implemented in the GeneSpring GX 11

software package (Agilent Technologies, Santa Clara, CA). ArrayExpress

also provides the preprocessed dataset without log transformation. The

phenotypic values of each trait were rescaled so that the range of each trait

is between -1 and +1. The formula for the rescaling is

min

max min

2 1

k

k

X X

Z

X X

−

= −

−

(1)

where

k

X is the original phenotypic value for the kth line,

min

X and

max

X are the minimum and maximum values of the phenotypic value,

respectively, and

k

Z is the rescaled phenotypic value for 1, , k n = L ,

where n is the sample size (number of DH lines).

2.2 Linear Model

Denote the microarray data by a data matrix Y with n rows and m col-

umns, where n is the number of individuals subject to the microarray anal-

ysis and m is the number of microarrayed genes. Let

j

y be the jth column

of matrix Y, i.e., an 1 × n vector for the expression levels of gene j for all

the n subjects ( 1,..., j m = ). Let Z be an × n q matrix for the rescaled phe-

notypic values of q quantitative traits measured from all n individuals. Let

X be a × n p matrix for some factors not directly relevant to the quantita-

tive traits, for example, gender effect, age effect and so on. We now have

three sources of data, (1) Y the microarray data, (2) Z the phenotypic data,

and (3) X the cofactors not directly relevant to the association study. The

cofactors are not something of interest themselves, but may affect the gene

expressions. They are included in the model to reduce or eliminate the

interference on the association between Y and Z. The expressed levels of

gene j can be expressed using the following linear model,

= + +

j j j j

y X Z β γ ε (2)

where

j

β is a 1 × p vector for the effects of cofactors,

j

γ is a 1 × q vector

for the regression coefficients of gene j on all the q traits. The residual error

j

ε is an 1 × n vector with an assumed

2

(0, ) N Iσ distribution. This as-

sumption is very common in the linear model analysis. The DH plants are

indeed independent samples from the line cross of the barley experiment. In

the special case of one phenotype with no cofactors, 1 = = p q and X is a

vector of unity with a dimensionality of 1 × n .

We now assign prior distributions to the parameters included in the lin-

ear model. The prior distribution for

j

β is

~ ( , ) Σ

j

N

β β

β u (3)

where

β

u is a 1 × p vector of mean and Σ

β

is an unknown × p p positive

definite variance matrix. The prior distribution of

j

γ is a Gaussian mixture

with two components,

1 0

~ (0, ) (1 ) (0, ) Σ + − Σ

j

N N γ π π (4)

In the above Gaussian mixture,

0

Σ =

q

I ω is a known diagonal matrix with

a common

8

10

−

= ω across all the diagonal elements. In other words,

0

Σ is

a known positive definite matrix with values close to zero and the value can

be changed according to the investigator's preference. For the other cluster,

1

Σ is an unknown positive definite variance matrix. This Gaussian mixture

prior divides all the genes into two clusters, one (cluster 1) being associated

with the phenotypes and the other (cluster 0) not associated with the pheno-

type. Variable π ( 0 1 < < π ) is a prior probability that a gene randomly

selected from the pool belongs to cluster 1. A gene classified into cluster

one is claimed to be associated with the traits while genes classified into

cluster zero are not associated with the traits. The actual parameters in-

volved in the problem are denoted by

{ }

2

1

, , , , = Σ Σ

β β

θ u σ π

(5)

We are also interested in

j

ρ , the posterior probability of gene j being asso-

ciated with the traits. The relationship between the posterior

j

ρ and the

prior π will be presented later.

Let ~ Burnoulli( )

j j

δ ρ , for 0 1 < <

j

ρ , be an indicator variable for the

cluster membership of gene j. It is defined as

1 if belongs to cluster 1

0 if belongs to cluster 0

¦

=

´

¹

j

j

j

δ (6)

Given

j

δ , the genetic effect

j

γ has the following distribution,

1 0

~ (0, ) (1 ) (0, ) Σ + − Σ

j j j

N N γ δ δ (7)

a

t

O

k

l

a

h

o

m

a

S

t

a

t

e

U

n

i

v

e

r

s

i

t

y

(

G

W

L

A

)

o

n

A

p

r

i

l

2

1

,

2

0

1

1

b

i

o

i

n

f

o

r

m

a

t

i

c

s

.

o

x

f

o

r

d

j

o

u

r

n

a

l

s

.

o

r

g

D

o

w

n

l

o

a

d

e

d

f

r

o

m

Table 1. Numbers of genes associated with individual traits in the barley

microarray data analysis using the two-cluster SEM algorithm and the

three-cluster EM algorithm.

Trait SEM Algorithm

EM Algorithm

Number Proportion Number Proportion

Alpha-amylase 644 2.82% 233 1.02%

Diastatic power 467 2.04% 195 0.85%

Grain protein 888 3.89% 257 1.13%

Grain yield 457 2.00% 139 0.61%

Heading date 401 1.76% 166 0.73%

Height 784 3.43% 264 1.16%

Lodging 650 2.85% 173 0.76%

Malt extract 526 2.30% 216 0.95%

Note that this conditional distribution is not a Gaussian mixture because the

membership is already known. The indicator variable

j

δ tells which Gaus-

sian component

j

γ belongs to. Also note that the parameter vector does not

include

j

β and

j

γ . Under the random model framework,

j

β and

j

γ are

treated as missing values. If they are integrated out, the density of

j

y

given the cluster membership is normal with mean and variance shown in

the following normal density

{ }

2

( | , ) | , = Σ + Θ +

T T

j j j j

p y N y X X X Z Z I

β β

θ δ u σ (8)

where

1 0

(1 )

j j j

δ δ Θ = Σ + − Σ (9)

and the notation { } | , N y a b stands for the normal density of variable y

with mean a and variance b. Given the cluster membership, {) } =

j

δ δ , the

log likelihood function for the entire dataset is

1

( | ) ln ( | , )

=

=

∑

m

j j

j

L p y θ δ θ δ (10)

The MLE of parameters are obtained through a two-step approach. The first

step is to estimate the parameters by maximizing the above log likelihood

function given { }

j

δ δ = through the regular expectation-maximization

(EM) algorithm (Dempster, et al., 1977). This step is called the EM step.

The second step is to stochastically simulate {) } =

j

δ) δ from its conditional

posterior distribution. This step is called the stochastic step. The two steps

are repeated iteratively until a stationary distribution is reached for each

parameter, an algorithm called the stochastic expectation and maximization

(SEM) algorithm (Celeux and Diebolt, 1986). The stochastic step and the

EM step are performed sequentially, not in parallel.

2.3 Stochastic sampling

The density of

j

y defined in equation (8) can be split into the following

two densities,

2

1 1

( | ) ( | , 1) ( | , ) = = = Σ + Σ +

T T

j j j j

p y p y N y X X X Z Z I

β β

θ θ δ u σ (11)

and

2

0 0

( | ) ( | , 0) ( | , ) = = = Σ + Σ +

T T

j j j j

p y p y N y X X X Z Z I

β β

θ θ δ u σ (12)

The posterior probability that 1 =

j

δ is

1

1 0

( | )

E( | , )

( | ) (1 ) ( | )

= =

+ −

j

j j j

j j

p y

y

p y p y

π θ

ρ δ θ

π θ π θ

(13)

Because

j

δ is a Bernoulli variable, it is sampled from

( ) Bernoulli( | ) =

j j j

p δ δ ρ (14)

distribution. Once { } =

j

δ δ are sampled for all genes in the stochastic

process, we can proceed with the EM algorithm described below.

2.4 EM algorithm

The EM algorithm for the Gaussian mixture model is standard (Dempster,

et al., 1977) and thus we only provide the EM steps without proof. Denote

the variance covariance matrix of

j

γ conditional on

j

δ by

1 0

var( | ) (1 ) Θ = = Σ + − Σ

j j j j j

γ δ δ δ (15)

Let us define

2

= Σ + Θ +

T T

j j

V X X Z Z I

β

σ (16)

We now provide the formulas for updating each parameter using the EM

algorithm. Given

j

δ , the updated proportion of genes coming from cluster

1 is

1

1

=

=

∑

m

j

j

m

π δ (17)

The population mean

β

u is updated using

1

1 1

1 1

−

− −

= =

=

∑ ∑

m m

T T

j j j

j j

X V X X V y

β

u (18)

The variance-covariance matrix of

j

β is denoted by Σ

β

and updated using

1 1

1 1

( ) ( ) ( ) var( )

= =

Σ = = +

∑ ∑

m m

T T

j j j j j

j j

E E E

m m

β

β β β β β (19)

where

1

E( ) ( )

−

= Σ −

T

j j j

X V y X

β β

β u (20)

and

1

var( )

−

= Σ −Σ Σ

T

j j

X V X

β β β

β (21)

Given

j

δ , the unknown variance-covariance matrix of

j

γ is

1 0

(1 ) Θ = Σ + − Σ

j j j

δ δ . However, the corresponding matrix

0

Σ is a con-

stant. Therefore, we only need to update

1

Σ using all

j

γ that come from

cluster one. The updated equation for

1

Σ is

1

1 1

1 1

( ) ( ) ( ) var( )

m m

T T

j j j j j j j

j j

E E E

m m

= =

Σ = = +

∑ ∑

δ γ γ δ γ γ γ

π π

(22)

where

1

1

E( ) ( )

−

= Σ −

T

j j j

Z V y X

β

γ u (23)

and

1

1 1 1

var( )

−

= Σ −Σ Σ

T

j j

Z V Z γ (24)

a

t

O

k

l

a

h

o

m

a

S

t

a

t

e

U

n

i

v

e

r

s

i

t

y

(

G

W

L

A

)

o

n

A

p

r

i

l

2

1

,

2

0

1

1

b

i

o

i

n

f

o

r

m

a

t

i

c

s

.

o

x

f

o

r

d

j

o

u

r

n

a

l

s

.

o

r

g

D

o

w

n

l

o

a

d

e

d

f

r

o

m

Fig. 1. ROC curve comparing the SEM and EM algorithms in the

simulation study.

The residual error variance is updated using

( ) ( )

2

1

1

( ) ( )

=

= − − − −

∑

m

T

j j j j j

j

y X y X XE ZE

mn

β β

σ u u β δ γ (25)

The E-step of the EM algorithm consists of calculat-

ing ( ),

j

E β ( ),

j

E γ var( )

j

β and var( ).

j

γ The M-step consists of calculating

{ }

2

1

, , , , = Σ Σ

β β

θ u σ π . So far all parameter have been updated. We can

now combine the stochastic steps with the EM steps to compete the analy-

sis. Note again that the stochastic and EM steps are performed sequentially

and repeated many times until a stationary distribution for each parameter

is reached.

2.5 SEM estimate

The SEM algorithm differs from the classical EM algorithm in that the

parameters do not converge to some fixed values; rather, they converge to a

stationary distribution due to the stochastic process of { } =

j

δ δ . We can

monitor the converging process for each parameter. Once all parameters

have converged, we start to collect the posterior sample for θ . Unlike the

posterior sample for the fully Bayesian analysis, the observations with the

posterior sample for the SEM algorithm are not correlated. The posterior

sample size, denoted by T, does not have to be large; 100 = T seems to be

sufficient. Let { }

( ) ( ) ( ) ( ) 2( ) ( )

1

, , , , = Σ Σ

t t t t t t

β β

θ u σ π be the t-th observation in the

posterior sample (after convergence), the estimate parameter vector of the

SEM algorithm is

( )

1

1

ˆ

=

=

∑

T

t

t

T

θ) θ (26)

The most important quantity of the SEM analysis is not the entire vector of

θ ; rather, it is

( ) ( )

1 1

1 1

ˆ

= =

= ≈

∑ ∑

T T

t t

j j j

t t

T T

ρ ρ δ (27)

that is most important because it represents the posterior probability that

gene j belongs to cluster 1, i.e., gene j is associated with the quantitative

traits. These ˆ

j

ρ ’s are ranked in a descending order. The top proportion of

genes is selected as candidate genes associated with the phenotypes. The

proportion is defined by the investigator in an arbitrary manner. Some

objective criteria, e.g., false discovery rate (FDR), may be used, but it is not

the focus of this study. Here, we simply chose an arbitrary cut-off value of

0.9 ≥

j

ρ to declare significant association.

2.6 Expression quantitative trait locus (eQTL) map-

ping

The gene expression levels can be treated as quantitative traits and QTL

mapping can be performed on each transcript, so called eQTL mapping

(Kendziorski, et al., 2006). The problem with the eQTL analysis is that the

large number of expression traits make eQTL mapping very difficult. The

SEM algorithm developed for the quantitative trait associated microarray

data analysis can be extended to eQTL mapping with limited modification.

This section describes the application of the SEM algorithm to eQTL map-

ping.

Consider Q markers with known map positions and the genotypes for all

the n individuals. We will study the association of all the m transcripts

simultaneously with the kth marker for 1, , = L k Q . The approach is similar

to the interval mapping in which one marker is studied at a time (Lander

and Botstein, 1989). The entire eQTL mapping will take Q separate analy-

ses. Using the same model as given in equation (2), we now replace the

phenotype Z by the numerically coded genotype of marker k denoted by

k

Z so that

= + +

j j k jk j

y X Z β γ ε (28)

where

jk

γ is the QTL effect for transcript j at marker k. The

k

Z variable is

defined as

1 1

2 2

1 for

1 for

+ ¦

=

´

−

¹

k

A A

Z

A A

(29)

where

1 1

A A is first genotype and

1 1

A A is the second genotype of marker k.

The barley population under study is a doubled haploid population and thus

only two genotypes exist for each locus.

We now have to analyze the data Q times, one for each marker. Previ-

ously, we have a single π for the proportion of genes associated with the

phenotype. We now have Q such π) 's to indicate the proportions of tran-

scripts associated with all markers, denoted by a vector

1

[ ] = L

Q

π π π (30)

In addition, the posterior probability of gene j associated with marker k is

denoted by

jk

ρ . These parameters { } ,

k jk

π ρ are important to the eQTL

mapping. The SEM algorithm remains the same as before except that we

must analyze the data Q times (one for each marker).

3 RESULTS

3.1 Single trait association

Since each trait was measured from multiple environments, we

took the average of the phenotypic values across the environments

as the phenotypic values that entered the linear model for analysis.

Therefore, the Z matrix for each trait analysis was an

1 150 1 n× = × vector for the average phenotypic values (rescaled

between -1 and +1). The number of transcripts (genes) measured in

the experiment was 22840 m= . Since no other cofactor existed

except the intercept,

j

β is a scalar with dimension 1 p = . For the

single trait analysis, 1 q = for each trait analysis. Each of the five

parameters

{ }

2

1 β β

θ u σ π = Σ Σ is of single dimension.

Both the (two-cluster) SEM algorithm developed here and the

three-cluster EM algorithm of Jia and Xu (2005) were used for the

single trait association study. The EM algorithm of Jia and Xu

(2005) classified each gene into one of three clusters. The three

a

t

O

k

l

a

h

o

m

a

S

t

a

t

e

U

n

i

v

e

r

s

i

t

y

(

G

W

L

A

)

o

n

A

p

r

i

l

2

1

,

2

0

1

1

b

i

o

i

n

f

o

r

m

a

t

i

c

s

.

o

x

f

o

r

d

j

o

u

r

n

a

l

s

.

o

r

g

D

o

w

n

l

o

a

d

e

d

f

r

o

m

Table 2. The estimated regression coefficients for the top ten genes jointly associated with all eight traits. The partial regression coefficients of gene

on the eight traits are denoted by

1 8

, , γ γ L .

Rank Probe set ID

1

γ

2

γ

3

γ

4

γ

5

γ

6

γ

7

γ

8

γ F-test p-value

1 Contig1132_s_at -2.99 0.99 3.42 -1.99 -0.25 0.75 -0.40 -0.05 945.744 <0.0001

2 Contig124_at 2.32 -0.68 -2.93 1.76 0.38 -0.55 0.41 0.12 581.689 <0.0001

3 Contig2163_at -0.12 -0.77 -1.88 0.62 -0.89 0.45 2.15 1.62 543.998 <0.0001

4 Contig11524_at 0.71 -0.83 1.18 -0.10 2.30 0.61 1.62 -1.63 492.936 <0.0001

5 Contig2279_at 0.10 0.35 -0.72 0.80 -0.82 -2.05 -1.51 4.94 481.062 <0.0001

6 Contig4769_at 2.67 -0.18 -2.32 1.23 0.55 -0.82 0.28 0.21 435.547 <0.0001

7 Contig23772_s_at 1.81 -0.67 -2.53 1.50 0.13 -0.55 0.59 0.003 421.314 <0.0001

8 HVSMEf0020F06r2_at 0.55 -0.74 0.92 -0.16 2.16 0.75 1.33 -1.45 396.722 <0.0001

9 Contig11126_at -0.44 -0.21 0.62 -0.86 0.88 1.91 1.09 -4.10 348.061 <0.0001

10 Contig4526_at -0.45 -1.17 1.11 0.88 1.27 1.38 0.58 -1.35 347.999 <0.0001

clusters shared a co mmon variance of the regression coefficient but

with three different means. The three cluster means were restricted

with negative value for the first cluster, zero for the second cluster

and positive value for the third cluster. Genes classified into either

the first or the third cluster are differentially expressed genes. The

criterion of detection for each gene was that the posterior probabil-

ity of being in cluster 2 (the neutral cluster) was less than 0.1. This

is equivalent to the criterion of 0.9

j

ρ ≥ in the SEM analysis.

The numbers of genes associated with each of the eight traits are

listed in Table 1 for the SEM and the EM algorithms. We can see

that the SEM algorithm consistently detected more genes than the

EM algorithm. The result of the SEM algorithm shows that more

genes are associated with the height and grain protein than other

traits. The heading date trait has the least number of associated

genes. The estimated parameters for all the eight traits obtained

from the separate SEM analyses are listed in Table S1.Lists of all

the detected genes associated with the traits and gene annotations

are given in Table S2 (Sheet1-Sheet8) for interested readers.

In order to compare the performance of the SEM model and the

EM model, we carried out a simulation experiment based on the

SEM model by using the estimated parameters obtained from the

barley data (grain yield) analysis. The ROC curve (Figure 1) shows

that both the SEM and the EM models have high sensitivity and

specificity, but the SEM model performs better and has higher

sensitivity. It is well known that the EM algorithm tends to con-

verge to some sub-optimal values which are close to the initial

values. We set different parameters and simulated several datasets

based on the EM model to test the convergence of parameters.

When parameters are precisely estimated, the EM model is able to

identify most of the associated genes. If parameters converge to

some local optimal values that are different from the true values,

the sensitivity is quite low. However, the SEM model can identify

most associated genes in both cases and has high sensitivity and

specificity, which are similar to the EM results when parameters

were estimated well. We also performed permutation for the real

data analysis by permuting the phenotypes (grain yield) to test the

specificity of the two methods. After 100 permutations by the grain

yield, we took average of the posterior probabilities generated fro m

the 100 permutations for each gene and we found that no gene had

probabilities exceeding the cut-off point (0.9), which indicates that

the SEM method also has good specificity in real data analysis.

We choose the grain yield as an example to demonstrate the

converging process of the estimated parameters. The trace plots

(parameters against the iteration) are depicted in Figure 2 for all

the five parameters and the regression coefficient of gene

AF250937_s_at. From the trace plots, we can see that all parame-

ters have converged in about 10 iterations. After the parameters

converged to their stationary distributions, the parameters fluctu-

ated around the mean values and the means are the SEM estimates

of the parameters.

We also presented the predicted regression coefficients obtained

with the SEM analysis and the scatter plots of the observed genes

expressions for the genes associated with each trait (Figure S1).

Some genes have positive associations with the traits, e.g.,

AF250937_s_at and Contig6445_at, and some have negative asso-

ciations with the traits, e.g., Contig23592_at and Contig3295_at.

3.2 Multiple traits association study

For the joint association study of eight traits, the dimensionality of

the parameters increased to 8 p = and 8 q = . Therefore,

β

u is an

8 1 × vector,

β

Σ is a 8 8 × matrix and

1

Σ is an 8 8 × matrix. The

rest of the parameters, π and

2

σ , remain scalars. The Z matrix

is an 150 8 × matrix for all the eight traits measured from all the

150 DH lines. Only the SEM algorithm was used in this analysis

because the EM algorithm of Jia and Xu (2005) cannot be applied

to multiple trait association study.

Using the same criterion of 0.9

j

ρ ≥ to declare significance as-

sociation, we detected a total of 1646 genes that are jointly associ-

ated with the eight traits, accounting for 7.2% of all the 22840

genes included in the analysis. The list of associated genes is given

in the supplemental data (Table S3) for interested readers. The

estimated parameters are ˆ 0.089 π) = and

2

ˆ 0.067 σ = for the pro-

portion and residual error variance, respectively. The estimated

β

u

and

β

Σ are ˆ 7.978

β

u = and

ˆ

4.652

β

Σ = , respectively. The esti-

mated variance matrix of the differentially expressed cluster is

a

t

O

k

l

a

h

o

m

a

S

t

a

t

e

U

n

i

v

e

r

s

i

t

y

(

G

W

L

A

)

o

n

A

p

r

i

l

2

1

,

2

0

1

1

b

i

o

i

n

f

o

r

m

a

t

i

c

s

.

o

x

f

o

r

d

j

o

u

r

n

a

l

s

.

o

r

g

D

o

w

n

l

o

a

d

e

d

f

r

o

m

Fig. 2. The SEM convergence processes of five parameters for the

grain yield trait and the regression coefficient ( γ) ) of gene

AF250937_s_at (the gene having strong association with grain yield).

1

0.107 0.002 0.027 0.023 0.027 0.074 0.033 0.023

0.002 0.052 0.026 0.009 0.007 0.002 0.013 0.005

0.027 0.026 0.204 0.095 0.0003 0.017 0.026 0.012

0.023 0.009 0.095 0.084 0.015 0.009 0.026 0.008

ˆ

0.027 0.007 0.0003 0.015 0.

− −

− −

− − − − −

− − −

Σ =

096 0.039 0.025 0.024

0.074 0.002 0.017 0.009 0.039 0.198 0.097 0.048

0.033 0.013 0.026 0.026 0.025 0.097 0.193 0.054

0.023 0.005 0.012 0.008 0.024 0.048 0.054 0.151

− −

− − − −

− − − − −

− − − − −

Note that the proportion of genes detected (7.2%) in the experi-

ment is not the same as ˆ 8.9% π = because the former depends on

the cut-off point ( 0.9

j

ρ ≥ ) used for gene declaration whereas the

latter represents the probability that a randomly selected gene be-

longs to the associated cluster and it does not depend on the cut-off

point. In the simulation study, we used parameters estimated from

the barley data to simulate 5000 genes (100 individuals) based on

the SEM model. The SEM algorithm identified all associated

genes, which indicated that SEM does have high sensitivity and

specificity. By randomly permuting the eight traits, we tested the

specificity of SEM in real data analysis. Among the total of 22840

genes, 521 genes still had significant effects in the permuted data.

The false positive is 521/ 22840 0.022811 =) , reasonably low.

Interestingly, all genes associated with the 8 traits in the single

trait analysis were also detected in the joint analysis, demonstrating

the high efficiency of the joint analysis. The estimated regression

coefficients for the top ten genes jointly associated with all traits

are listed in Table 2 along with the F-test statistics and the p-

values. The F-test statistic was calculated using

( )

1

1

1 1 1

1

ˆ ˆ ˆ ˆ ˆ

ˆ ˆ

−

−

= Σ − Σ Σ

T T

j j j j

F Z V Z

q

γ γ (31)

where

1

1

ˆ ˆ

ˆ ˆ ( )

−

= Σ −

T

j j j

Z V y X

β

γ u (32)

and

2

1

ˆ ˆ ˆ

ˆ = Σ + Σ +

T T

j

V X X Z Z I

β

σ (33)

The p-value was calculated using

8,

ˆ

value 1 ( )

∞

− = −

j

p F F (34)

where

8,

( )

∞

F x is the cumulative distribution function of the central

F distribution with numerator degree of freedom q = 8 and the

denominator degree of freedom ∞.

3.3 Expression quantitative trait locus (eQTL)

mapping

The purpose of the eQTL mapping is to identify the locations of

the genome that control the expressions of the transcripts. We used

the results of the previous analysis to reduce the number of tran-

scripts for the eQTL analysis. For example, out of the 22840 tran-

scripts, we identified 888 genes that are associated with the grain

protein trait. The eQTL mapping was then targeted to these 888

transcripts. This has dramatically reduced the number of transcripts

for eQTL mapping related to the grain protein trait. Recall that

Table 1 gives the number of transcripts associated with each of the

eight traits. The eQTL mapping for each trait was only conducted

on the identified transcripts. The barley genome contains seven

chromosomes. The total number of SNP markers investigated was

Q = 495 with an average marker interval less than 2 centiMorgan,

covering the entire barley genome.

The entire eQTL analysis took about nine hours of Intel Core

Duo CP U P8400, 2.27GHz in an Hp P avilion dv4 computer. Fig-

ure 3 (a-d) shows the plots of the proportions of transcripts associ-

ated with markers for four of the eight traits. There is too much

information obtained from the eQTL analysis. Figure S2 (e-h)

shows the plots of the remaining four traits. Here, we used grain

protein and yield traits as examples to describe the plots. For the

grain protein trait, three chromosomes (2, 3, 5) seem to control

more genes than other chromosomes. For example, the central

region of chromosome 2 contains almost 50% of the 888 tran-

scripts. This region is considered as a hot spot. For the yield trait,

chromosome 3 is the only one containing more transcripts. The hot

spot is located in the middle of the chromosome and it controls the

expression of about 80% of the 457 transcripts.

a

t

O

k

l

a

h

o

m

a

S

t

a

t

e

U

n

i

v

e

r

s

i

t

y

(

G

W

L

A

)

o

n

A

p

r

i

l

2

1

,

2

0

1

1

b

i

o

i

n

f

o

r

m

a

t

i

c

s

.

o

x

f

o

r

d

j

o

u

r

n

a

l

s

.

o

r

g

D

o

w

n

l

o

a

d

e

d

f

r

o

m

Other information about this eQTL analysis is provided in Table

S4 (Sheet1-Sheet8) and Table S5 (Sheet1-Sheet8). The additional

information includes the eQTL effects for each transcript across

the genome (Table S5). The posterior probability of each transcript

associated with each marker (Table S4). The supplemental tables

can serve as a reference database for barley biologists to further

study the gene networks for the eight quantitative traits.

4 DISCUSSION

We adopted a new statistical method (SEM) for quantitative trait

associated microarray data analysis. We used the method to ana-

lyze 22840 microarrayed genes associated with eight quantitative

traits in barley. Many genes have been identified to be associated

with these traits. The actual functions of these genes in barley are

not known prior to this study. These genes are provided in Table

S2 (Sheet1-Sheet8) and Table S3. For example, among the 22840

genes, 888 are related to grain protein content. This dataset pro-

vides much information for barley biologists to further study these

genes. The functions of some genes are known in rice. For exam-

ple, according to BLASTX results, transcript 22767

(rbah35f01_s_at) is the code for the Cyclopropane-fatty-acyl-

phospholipid synthase in rice. This gene is strongly associated to

the grain protein in barley, with an F-test statistic of 1436.718 and

a p-value of zero.

The same SEM algorithm for phenotype associated microarray

data analysis has been applied to eQTL mapping with virtually no

modification. The eQTL mapping conducted was still an interval

mapping approach where one marker is analyzed at a time. Ho w-

ever, all the transcripts were analyzed simultaneously. This is al-

ready a significant improvement over the traditional interval map-

ping for QTL where one transcript was analyzed at a time

(Kendziorski, et al., 2006). Results of the eQTL analysis are pro-

vided in Table S4 (Sheet1-Sheet8) and Table S5 (Sheet1-Sheet8).

This dataset will help barley biologists to infer gene networks for

these quantitative traits. Transcripts simultaneously associated with

one marker belong to the same network (or pathway) because they

are all controlled by the segregation of the same locus. For exam-

ple, marker ABC152D (83.1 cM) on chromosome 2 controls the

expression of about 50% of the transcripts associated with the

grain protein. These transcripts are allegedly to be in the same

pathway for the development of grain protein.

Fig. 3. Proportions of transcripts associated with markers for the first four traits (Amylase, Diastatic power, Grain protein and Heading date).

The chromosomes are separated by the dotted vertical reference lines.

a

t

O

k

l

a

h

o

m

a

S

t

a

t

e

U

n

i

v

e

r

s

i

t

y

(

G

W

L

A

)

o

n

A

p

r

i

l

2

1

,

2

0

1

1

b

i

o

i

n

f

o

r

m

a

t

i

c

s

.

o

x

f

o

r

d

j

o

u

r

n

a

l

s

.

o

r

g

D

o

w

n

l

o

a

d

e

d

f

r

o

m

Theoretically, the method can analyze all markers simultane-

ously using a single model. This is the all-transcript-and-all-marker

model. Practically, however, it is difficult to handle the large ma-

trix with a dimensionality repeatedly in the SEM algorithm. The

theory is identical to the joint analysis of all the eight traits (a ma-

trix). Further study on this simultaneous analysis is needed for the

SEM algorithm. The MCMC implemented Bayesian eQTL map-

ping (Jia and Xu, 2007) can be adopted here, but it is a sampling

based method and is time consuming in terms of computing time.

This study focused on the SEM algorithm for phenotype associated

microarray analysis with the eQTL mapping as an example of ex-

tension to other problems.

Finally, the data were analyzed using an R program, which can

be downloaded from our laboratory website (www.statgen.ucr.edu)

under the "Phenotype Associated Microarray" section. A sample

dataset (subset of the barley data) is also provided in the website

for interested users to test the method. Users may customize the

code to analyze their own data using the SEM algorithm.

ACKNOWLEDGEMENTS

Funding: This project was supported by the Agriculture and Food

Research Initiative (AFRI) of the USDA National Institute of Food

and Agriculture under the Plant Genome, Genetics and Breeding

Program 2007-35300-18285 to SX.

REFERENCES

Blalock, E.M., Geddes, J.W., Chen, K.C., Porter, N.M., Markesbery, W.R. and Land-

field, P.W. (2004) Incipient Alzheimer's disease: microarray correlation analyses

reveal major transcriptional and tumor suppressor responses, Proc Natl Acad Sci

U S A, 101, 2173-2178.

Celeux, G. and Diebolt, J. (1986) The SEM algorithm: A probabilistic teacher algo-

rithm derived from the EM algorithm for the mixture problem, Comput. Statist.

Quart., 2, 73-82.

Cookson, W., Liang, L., Abecasis, G., Moffatt, M. and Lathrop, M. (2009) Mapping

complex disease traits with global gene expression, Nat Rev Genet, 10, 184-194.

Cui, X., Hwang, J.T., Qiu, J., Blades, N.J. and Churchill, G.A. (2005) Improved

statistical tests for differential gene expression by shrinking variance components

estimates, Biostatistics, 6, 59-75.

Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum Likelihood from

Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society,

39, 1-38.

Hayes, P.M., Liu, B.H., Knapp, S.J., Chen, F., Jones, B., Blake, T., Franckowiak, J.,

Rasusson, D., Sorrells, M., Ullrich, S.E., Wesenberg, D. and Kleinhofs, A. (1993)

Quantitative trait locus effects and environmental interaction in a sample of North

American barley germ plasm, Theor Appl Genet, 87, 392-401.

Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf,

U. and Speed, T.P. (2003) Exploration, normalization, and summaries of high

density oligonucleotide array probe level data, Biostatistics, 4, 249-264.

Jia, Z. and Xu, S. (2005) Clustering expressed genes on the basis of their association

with a quantitative phenotype, Genet Res, 86, 193-207.

Jia, Z. and Xu, S. (2007) Mapping quantitative trait loci for expression abundance,

Genetics, 176, 611-623.

Kendziorski, C.M., Chen, M., Yuan, M., Lan, H. and Attie, A.D. (2006) Statistical

methods for expression quantitative trait loci (eQTL) mapping, Biometrics, 62,

19-27.

Kerr, M.K., Martin, M. and Churchill, G.A. (2000) Analysis of variance for gene

expression microarray data, J Comput Biol, 7, 819-837.

Kraft, P., Schadt, E., Aten, J. and Horvath, S. (2003) A family-based test for correla-

tion between gene expression and trait values, Am J Hum Genet, 72, 1323-1330.

Lander, E.S. and Botstein, D. (1989) Mapping Mendelian Factors Underlying Quanti-

tative Traits Using RFLP Linkage Maps, Genetics, 121, 185-199.

Luo, Z.W., Potokina, E., Druka, A., Wise, R., Waugh, R. and Kearsey, M.J. (2007)

SFP genotyping from affymetrix arrays is robust but largely detects cis-acting ex-

pression regulators, Genetics, 176, 789-800.

Morley, M., Molony, C.M., Weber, T.M., Devlin, J.L., Ewens, K.G., Spielman, R.S.

and Cheung, V.G. (2004) Genetic analysis of genome-wide variation in human

gene expression, Nature, 430, 743-747.

Potokina, E., Caspers, M., Prasad, M., Kota, R., Zhang, H., Sreenivasulu, N., Wang,

M. and Graner, A. (2004) Functional association between malting quality trait

components and cDNA array based expression patterns in barley (Hordeum vul-

gare L.), Molecular Breeding, 14, 153-170.

Qu, Y. and Xu, S. (2006) Quantitative trait associated microarray gene expression data

analysis, Mol Biol Evol, 23, 1558-1573.

Quackenbush, J. (2001) Computational analysis of microarray data, Nat Rev Genet, 2,

418-427.

Wernisch, L., Kendall, S.L., Soneji, S., Wietzorrek, A., Parish, T., Hinds, J., Butcher,

P.D. and Stoker, N.G. (2003) Analysis of whole-genome microarray replicates us-

ing mixed models, Bioinformatics, 19, 53-61.

Wolfinger, R.D., Gibson, G., Wolfinger, E.D., Bennett, L., Hamadeh, H., Bushel, P.,

Afshari, C. and Paules, R.S. (2001) Assessing gene significance from cDNA mi-

croarray expression data via mixed models, J Comput Biol, 8, 625-637.

a

t

O

k

l

a

h

o

m

a

S

t

a

t

e

U

n

i

v

e

r

s

i

t

y

(

G

W

L

A

)

o

n

A

p

r

i

l

2

1

,

2

0

1

1

b

i

o

i

n

f

o

r

m

a

t

i

c

s

.

o

x

f

o

r

d

j

o

u

r

n

a

l

s

.

o

r

g

D

o

w

n

l

o

a

d

e

d

f

r

o

m

gender effect.problems have been encountered for this method. 2009). The cofactors are not something of interest themselves. Σ1 ) + (1 − δ j ) N (0. γ j ~ δ j N (0. 2011 β j ~ N (µβ . A gene classified into cluster one is claimed to be associated with the traits while genes classified into cluster zero are not associated with the traits.gov/ggpages/SxM/phenotypes. They are included in the model to reduce or eliminate the interference on the association between Y and Z. Let X be a n × p matrix for some factors not directly relevant to the quantitative traits. one (cluster 1) being associated with the phenotypes and the other (cluster 0) not associated with the phenotype. Detailed description of the experiment can be found from the original study (Hayes. et al. Σ1 is an unknown positive definite variance matrix. diastatic power. Σ1 . The residual error ε j is an n × 1 vector with an assumed N (0. n . et al. L ..uk/microarrayas/aer/entry with accession number: E-TABM-112. The phenotypic values of each trait were rescaled so that the range of each trait is between -1 and +1. We now have three sources of data. In this study. heading date. et al. CA). π } (5) We are also interested in ρ j . 1986) and its application to both eQTL mapping and phenotype associated microarray data analysis. One is the identifiability problem where the cluster labels can be exchanged among different clusters. This ad hoc modification of the EM algorithm lacks strong theoretical foundation and cannot guarantee to produce the optimal result. p = q = 1 and X is a vector of unity with a dimensionality of n × 1 .. These two problems can be avoided by introducing a small noise to the cluster means in every iterative step of the EM algorithm (Qu and Xu. The eight traits are α -amylase. 2003) implemented in the GeneSpring GX 11 software package (Agilent Technologies. Given δ j . be an indicator variable for the cluster membership of gene j. The expressed levels of gene j can be expressed using the following linear model. We used this new method to detect expressed genes that are associated with multiple quantitative traits of barley.. (1993) and downloadable from the following website: http://wheat. γ j is a q × 1 vector for the regression coefficients of gene j on all the q traits.usda. For the other cluster. but may affect the gene expressions. This Gaussian mixture prior divides all the genes into two clusters. respectively.ebi. The gene expression data were published by Luo et al. Iσ 2 ) distribution. X min and X max are the minimum and maximum values of the phenotypic value.pw. Let δ j ~ Burnoulli( ρ j ) . where n is the sample size (number of DH lines). and Z k is the rescaled phenotypic value for k = 1. The number of replicated measurements ranged from 6 to 16 depending on different traits. Trait specific gene networks can be inferred by studying the genetic loci controlling the expressed genes only associated with the trait under investigation.. age effect and so on. Both the single trait association and multiple trait joint association analyses were conducted for all the eight traits using the average trait values across all environments. The experiment involved 150 double haploid (DH) lines derived from the cross of two spring barley varieties. Σ0 = ω I q is a known diagonal matrix with a common ω = 10 −8 across all the diagonal elements. et al. (2007) and downloadable from the ArrayExpress: http://www. The actual parameters involved in the problem are denoted by θ = {µ β . the posterior probability of gene j being associated with the traits. 2. In other words. It is defined as (1) δj = 1 if j belongs to cluster 1 0 if j belongs to cluster 0 (6) where X k is the original phenotypic value for the kth line..1 MATERIALS AND METHODS Experimental data where µβ is a p × 1 vector of mean and Σ β is an unknown p × p positive definite variance matrix.2 Linear Model Denote the microarray data by a data matrix Y with n rows and m columns. The phenotypic values of eight quantitative traits of barley were published by Hayes et al. The prior distribution of γ j is a Gaussian mixture with two components. (1) Y the microarray data. Σ0 ) (7) .. The DH plants are indeed independent samples from the line cross of the barley experiment. we proposed an alternative method with a rigorous theoretical basis to solve the problem.. Santa Clara.ac. 1993). Σ β ) (3) 2 2. Σ 0 is a known positive definite matrix with values close to zero and the value can be changed according to the investigator's preference. A real dataset collected in the North American Barley Genome Project (NABGP) is used for the demonstration.html. i. for example.. This new method is then compared with the existing EM algorithm (Jia and Xu. ArrayExpress also provides the preprocessed dataset without log transformation. Finding the genetic loci controlling the expressions can help identify gene regulation networks (Cookson. 2005) to demonstrate its superiority. 2004). The phenotypes of the traits were measured in different environments (locations and years). and (3) X the cofactors not directly relevant to the association study. Let y j be the jth column of matrix Y.org at Oklahoma State University (GWLA) on April 21. we focus on a new method called the stochastic expectation and maximization (SEM) algorithm (Celeux and Diebolt.. The gene expression levels are quantitative traits (Morley. Variable π ( 0 < π < 1 ) is a prior probability that a gene randomly selected from the pool belongs to cluster 1. The relationship between the posterior ρ j and the prior π will be presented later. plant height. lodging and malt extract. The other problem occurs when two or more clusters often have the same cluster mean. The original (raw) microarray data were normalized using the RMA algorithm (Irizarry. σ 2 .e. Σ 0 ) (4) In the above Gaussian mixture. The prior distribution for β j is Downloaded from bioinformatics. All the 150 DH lines were microarrayed for 22840 transcripts.oxfordjournals. for 0 < ρ j < 1 . The formula for the rescaling is Zk = 2 X k − X min −1 X max − X min γ j ~ π N (0. Let Z be an n × q matrix for the rescaled phenotypic values of q quantitative traits measured from all n individuals. Morex and Steptoe. where n is the number of individuals subject to the microarray analysis and m is the number of microarrayed genes. Σ β . y j = X β j + Zγ j + ε j (2) where β j is a p × 1 vector for the effects of cofactors. 2006). m ). (2) Z the phenotypic data. This assumption is very common in the linear model analysis. We now assign prior distributions to the parameters included in the linear model. the genetic effect γ j has the following distribution. grain yield. In the special case of one phenotype with no cofactors. an n × 1 vector for the expression levels of gene j for all the n subjects ( j = 1. grain protein. In this study. Σ1 ) + (1 − π ) N (0.

2011 The MLE of parameters are obtained through a two-step approach. an algorithm called the stochastic expectation and maximization (SEM) algorithm (Celeux and Diebolt. δ j = 0) = N ( y j | X µβ .76% 3. δ j ) j =1 m (10) where Downloaded from bioinformatics. Under the random model framework.43% 2. the updated proportion of genes coming from cluster 1 is π= (8) 1 m ∑δ j m j =1 (17) The population mean µβ is updated using (9) µ β = ∑ X TV j−1 X ∑ X TV j−1 y j j =1 j =1 m −1 m (18) and the notation N { y | a.85% 2. et al. the log likelihood function for the entire dataset is ) The variance-covariance matrix of β j is denoted by Σ β and updated using Σβ = 1 m 1 m ∑ E (β j β T ) = m ∑ E (β j ) E (β Tj ) + var(β j ) j m j =1 j =1 (19) L(θ | δ ) = ∑ ln p ( y j | θ . 1977) and thus we only provide the EM steps without proof. δ = {δ j } . we can proceed with the EM algorithm described below. Once δ = {δ j } are sampled for all genes in the stochastic process. The two steps are repeated iteratively until a stationary distribution is reached for each parameter. However. not in parallel.org at Oklahoma State University (GWLA) on April 21.3 Stochastic sampling where The density of y j defined in equation (8) can be split into the following two densities. X Σ β X T + Z Θ j Z T + Iσ 2 } where Θ j = δ j Σ1 + (1 − δ j )Σ 0 V j = X Σβ X T + Z Θ j Z T + Iσ 2 (16) We now provide the formulas for updating each parameter using the EM algorithm.76% 0. πm ∑δ j =1 m j E (γ j γ T ) = j 1 πm ∑δ j =1 m j E (γ j ) E (γ T ) + var(γ j ) j (22) p1 ( y j | θ ) = p( y j | θ . Proportion 1. y j ) = π p1 ( y j | θ ) π p1 ( y j | θ ) + (1 − π ) p0 ( y j | θ ) (13) Because δ j is a Bernoulli variable. Therefore. b} stands for the normal density of variable y with mean a and variance b. Also note that the parameter vector does not include β j and γ j . 1977). (14) Trait SEM Algorithm Number Proportion 2. the density of y j given the cluster membership is normal with mean and variance shown in the following normal density p ( y j | θ . et al.04% 3.00% 1. β j and γ j are treated as missing values.89% 2.02% 0.13% 0.85% 1. This step is called the stochastic step. δ j = 1) = N ( y j | X µβ . If they are integrated out. This step is called the EM step.61% 0.oxfordjournals. The first step is to estimate the parameters by maximizing the above log likelihood function given δ = {δ j } through the regular expectation-maximization (EM) algorithm (Dempster. the corresponding matrix Σ 0 is a constant. X Σβ X + Z Σ1Z + Iσ ) T T 2 (11) and E(γ j ) = Σ1Z TV j−1 ( y j − X µβ ) (23) and p0 ( y j | θ ) = p( y j | θ . The indicator variable δ j tells which Gaussian component γ j belongs to.82% 2. The updated equation for Σ1 is Σ1 = 1 2. The stochastic step and the EM step are performed sequentially.Note that this conditional distribution is not a Gaussian mixture because the membership is already known. Denote the variance covariance matrix of γ j conditional on δ j by Θ j = var(γ j | δ j ) = δ j Σ1 + (1 − δ j )Σ 0 (15) Let us define .95% Alpha-amylase Diastatic power Grain protein Grain yield Heading date Height Lodging Malt extract 644 467 888 457 401 784 650 526 2. Given δ j . X Σβ X T + Z Σ0 Z T + Iσ 2 ) The posterior probability that δ j = 1 is (12) var(γ j ) = Σ1 − Σ1Z T Vj−1Z Σ1 (24) ρ j = E(δ j | θ . it is sampled from p(δ j ) = Bernoulli(δ j | ρ j ) Table 1. ) ) E(β j ) = Σ β X TV j−1 ( y j − X µβ ) and (20) var(β j ) = Σβ − Σβ X TV j−1 X Σβ (21) Given δ j .30% EM Algorithm Number 233 195 257 139 166 264 173 216 distribution.4 EM algorithm The EM algorithm for the Gaussian mixture model is standard (Dempster. The second step is to stochastically simulate δ = {δ j } from its conditional posterior distribution. Numbers of genes associated with individual traits in the barley microarray data analysis using the two-cluster SEM algorithm and the three-cluster EM algorithm...16% 0.73% 1. δ j ) = N { y j | X µ β . we only need to update Σ1 using all γ j that come from cluster one. Given the cluster membership. the unknown variance-covariance matrix of γ j is Θ j = δ j Σ1 + (1 − δ j )Σ 0 . 1986).

The problem with the eQTL analysis is that the large number of expression traits make eQTL mapping very difficult. may be used. The proportion is defined by the investigator in an arbitrary manner. Note again that the stochastic and EM steps are performed sequentially and repeated many times until a stationary distribution for each parameter is reached. one for each marker. We now have to analyze the data Q times. This section describes the application of the SEM algorithm to eQTL mapping. We can monitor the converging process for each parameter. So far all parameter have been updated.org at Oklahoma State University (GWLA) on April 21. the observations with the posterior sample for the SEM algorithm are not correlated. The SEM algorithm remains the same as before except that we must analyze the data Q times (one for each marker). The number of transcripts (genes) measured in the experiment was m = 22840 . We can now combine the stochastic steps with the EM steps to compete the analysis.9 to declare significant association. σ 2(t ) . we start to collect the posterior sample for θ . The approach is similar to the interval mapping in which one marker is studied at a time (Lander and Botstein. (29) 2. we simply chose an arbitrary cut-off value of ρ j ≥ 0. These parameters {π k . T = 100 seems to be ( ( sufficient. Downloaded from bioinformatics. The top proportion of genes is selected as candidate genes associated with the phenotypes. they converge to a stationary distribution due to the stochastic process of δ = {δ j } . Since no other cofactor existed except the intercept. e. gene j is associated with the quantitative ˆ traits. the posterior probability of gene j associated with marker k is denoted by ρ jk . These ρ j ’s are ranked in a descending order.6 Expression quantitative trait locus (eQTL) mapping The gene expression levels can be treated as quantitative traits and QTL mapping can be performed on each transcript. The EM algorithm of Jia and Xu (2005) classified each gene into one of three clusters. false discovery rate (FDR). .e. Q . The M-step consists of calculating θ = {µ β . we now replace the phenotype Z by the numerically coded genotype of marker k denoted by Z k so that Fig. Each of the five parameters θ = {µ β Σ β Σ1 σ 2 π } is of single dimension. E (γ j ).5 SEM estimate where A1 A1 is first genotype and A1 A1 is the second genotype of marker k. so called eQTL mapping (Kendziorski. The barley population under study is a doubled haploid population and thus only two genotypes exist for each locus. et al. denoted by T. σ 2 . Σ(βt ) . 1989). ρ jk } are important to the eQTL mapping.. Σ β . 2011 3 3. rather. π (t ) } be the t-th observation in the posterior sample (after convergence). it is ˆ ρj = 1 T (t ) 1 T ( t ) ∑ ρ j ≈ T ∑δ j T t =1 t =1 (27) that is most important because it represents the posterior probability that gene j belongs to cluster 1. Unlike the posterior sample for the fully Bayesian analysis. Consider Q markers with known map positions and the genotypes for all the n individuals. rather. The three 2. Let θ ( t ) = {µ βt ) . q = 1 for each trait analysis. denoted by a vector ) The SEM algorithm differs from the classical EM algorithm in that the parameters do not converge to some fixed values. the Z matrix for each trait analysis was an n × 1 = 150 × 1 vector for the average phenotypic values (rescaled between -1 and +1). the estimate parameter vector of the SEM algorithm is ˆ θ= ) π = [π 1 L π Q ] (30) In addition. Some objective criteria. Since each trait was measured from multiple environments. Here. Using the same model as given in equation (2). For the single trait analysis. does not have to be large. The SEM algorithm developed for the quantitative trait associated microarray data analysis can be extended to eQTL mapping with limited modification.1 RESULTS Single trait association 1 T (t ) ∑θ T t =1 (26) The most important quantity of the SEM analysis is not the entire vector of θ . π } . Therefore.L . Previously. Σ1t ) . we have a single π for the proportion of genes associated with the phenotype. var( β j ) and var(γ j ).The residual error variance is updated using y j = X β j + Z k γ jk + ε j (28) σ2 = T 1 ∑ ( y j − X µ β ) ( y j − X µ β − XE ( β j ) − δ j ZE (γ j ) ) mn j =1 m (25) where γ jk is the QTL effect for transcript j at marker k. We now have Q such π 's to indicate the proportions of transcripts associated with all markers. Once all parameters have converged. i.. Both the (two-cluster) SEM algorithm developed here and the three-cluster EM algorithm of Jia and Xu (2005) were used for the single trait association study.g. ROC curve comparing the SEM and EM algorithms in the simulation study. we took the average of the phenotypic values across the environments as the phenotypic values that entered the linear model for analysis. but it is not the focus of this study. β j is a scalar with dimension p = 1 . The posterior sample size. 2006). Σ1 . We will study the association of all the m transcripts simultaneously with the kth marker for k = 1. The entire eQTL mapping will take Q separate analyses. The Z k variable is defined as +1 for A1 A1 Zk = −1 for A2 A2 The E-step of the EM algorithm consists of calculating E ( β j )..oxfordjournals. 1.

but the SEM model performs better and has higher sensitivity.75 1.71 0.067 for the proportion and residual error variance.88 1.77 -0. 3.45 -4. The result of the SEM algorithm shows that more genes are associated with the height and grain protein than other traits. µβ is an 8 × 1 vector.72 -2.35 -0.10 2.62 1. This is equivalent to the criterion of ρ j ≥ 0.0001 <0. the SEM model can identify most associated genes in both cases and has high sensitivity and specificity.12 1. we detected a total of 1646 genes that are jointly associated with the eight traits.936 481. the EM model is able to identify most of the associated genes.003 -1. We choose the grain yield as an example to demonstrate the converging process of the estimated parameters. When parameters are precisely estimated. which indicates that the SEM method also has good specificity in real data analysis.0001 <0. The estimated parameters for all the eight traits obtained from the separate SEM analyses are listed in Table S1.28 0.722 348. the sensitivity is quite low. The estimated regression coefficients for the top ten genes jointly associated with all eight traits.2% of all the 22840 genes included in the analysis.59 1. We also performed permutation for the real data analysis by permuting the phenotypes (grain yield) to test the specificity of the two methods. and some have negative associations with the traits. The estimated µβ ˆ ˆ and Σ β are µ β = 7.42 -2.18 -0. Only the SEM algorithm was used in this analysis because the EM algorithm of Jia and Xu (2005) cannot be applied to multiple trait association study. If parameters converge to some local optimal values that are different from the true values.99 1.23 1. The list of associated genes is given in the supplemental data (Table S3) for interested readers.75 -0. We also presented the predicted regression coefficients obtained with the SEM analysis and the scatter plots of the observed genes expressions for the genes associated with each trait (Figure S1). remain scalars.40 0. The ROC curve (Figure 1) shows that both the SEM and the EM models have high sensitivity and specificity. The criterion of detection for each gene was that the posterior probability of being in cluster 2 (the neutral cluster) was less than 0. The Z matrix is an 150 × 8 matrix for all the eight traits measured from all the 150 DH lines.61 -2. which are similar to the EM results when parameters were estimated well. After the parameters converged to their stationary distributions. Contig23592_at and Contig3295_at.0001 <0. 2011 clusters shared a common variance of the regression coefficient but with three different means.80 1.50 -0. Some genes have positive associations with the traits.13 2.82 -0.0001 <0.998 492.g.58 γ8 -0.55 0.21 0.314 396.35 F-test 945. From the trace plots.089 and σ 2 = 0. Genes classified into either the first or the third cluster are differentially expressed genes.88 γ5 -0.45 γ2 0.91 1. L. the parameters fluctuated around the mean values and the means are the SEM estimates of the parameters.0001 <0.0001 <0.9 in the SEM analysis. γ 8 . Rank 1 2 3 4 5 6 7 8 9 10 Probe set ID Contig1132_s_at Contig124_at Contig2163_at Contig11524_at Contig2279_at Contig4769_at Contig23772_s_at HVSMEf0020F06r2_at Contig11126_at Contig4526_at γ1 -2.86 0.689 543.44 -0.38 γ7 -0.547 421.93 -1.88 1. The partial regression coefficients of gene on the eight traits are denoted by γ 1 .g.83 0. In order to compare the performance of the SEM model and the EM model. After 100 permutations by the grain yield. zero for the second cluster and positive value for the third cluster.2 Multiple traits association study For the joint association study of eight traits.62 -0.33 1.18 -0.45 0.Lists of all the detected genes associated with the traits and gene annotations are given in Table S2 (Sheet1-Sheet8) for interested readers. The numbers of genes associated with each of the eight traits are listed in Table 1 for the SEM and the EM algorithms. It is well known that the EM algorithm tends to converge to some sub-optimal values which are close to the initial values.09 0.67 1.55 0. The rest of the parameters.oxfordjournals.. The heading date trait has the least number of associated genes.55 -0. π and σ 2 .744 581.9 to declare significance association.05 0. e.0001 <0.51 0.68 -0.38 -0..99 2.41 2.82 0.32 -2. Using the same criterion of ρ j ≥ 0. respectively.1. The trace plots (parameters against the iteration) are depicted in Figure 2 for all the five parameters and the regression coefficient of gene AF250937_s_at.92 0.74 -0.9).81 0.0001 <0. respectively.53 0.10 -1.21 -1. We can see that the SEM algorithm consistently detected more genes than the EM algorithm.org at Oklahoma State University (GWLA) on April 21.999 p-value <0.16 0. The three cluster means were restricted with negative value for the first cluster.62 -1.17 γ3 3.32 -0.63 4.89 2. The ˆ ˆ estimated parameters are π = 0.94 0. However.061 347.76 0.30 -0.67 -0.16 -0. the dimensionality of the parameters increased to p = 8 and q = 8 . e.62 -1. The estimated variance matrix of the differentially expressed cluster is ) .062 435. AF250937_s_at and Contig6445_at. Σ β is a 8 × 8 matrix and Σ1 is an 8 × 8 matrix.978 and Σ β = 4.0001 Downloaded from bioinformatics.15 1. accounting for 7.12 0.27 γ6 0.55 0.99 -0. We set different parameters and simulated several datasets based on the EM model to test the convergence of parameters. we can see that all parameters have converged in about 10 iterations.25 0.652 . Therefore.05 -0. we carried out a simulation experiment based on the SEM model by using the estimated parameters obtained from the barley data (grain yield) analysis. we took average of the posterior probabilities generated from the 100 permutations for each gene and we found that no gene had probabilities exceeding the cut-off point (0.0001 <0.11 γ4 -1.Table 2.10 0.

0.026 0.023 −0.026 0.015 −0.9 ) used for gene declaration whereas the latter represents the probability that a randomly selected gene belongs to the associated cluster and it does not depend on the cut-off point.025 −0. We used the results of the previous analysis to reduce the number of transcripts for the eQTL analysis.017 −0.097 −0.oxfordjournals.107 0.27GHz in an Hp Pavilion dv4 computer.008 −0. 3. we tested the specificity of SEM in real data analysis.0003 0. we used parameters estimated from the barley data to simulate 5000 genes (100 individuals) based on the SEM model.024 −0.023 0.193 −0. The total number of SNP markers investigated was Q = 495 with an average marker interval less than 2 centiMorgan. There is too much information obtained from the eQTL analysis. out of the 22840 transcripts.013 0. For example.084 0. we identified 888 genes that are associated with the grain protein trait.org at Oklahoma State University (GWLA) on April 21. In the simulation study. The eQTL mapping for each trait was only conducted on the identified transcripts. Figure S2 (e-h) shows the plots of the remaining four traits.204 −0. Among the total of 22840 genes. The false positive is 521 / 22840 = 0.027 0.002 0. .054 −0.002 −0. the central region of chromosome 2 contains almost 50% of the 888 transcripts.Interestingly.048 0.096 0.023 0.009 0.025 −0.026 −0. This region is considered as a hot spot. ) The purpose of the eQTL mapping is to identify the locations of the genome that control the expressions of the transcripts.007 0. all genes associated with the 8 traits in the single trait analysis were also detected in the joint analysis.048 0.097 0.039 0. we used grain protein and yield traits as examples to describe the plots. covering the entire barley genome.005 −0.095 0.024 −0.002 −0.054 0.012 −0. 3. The barley genome contains seven chromosomes. For example. 5) seem to control more genes than other chromosomes.026 −0. 521 genes still had significant effects in the permuted data. 2. The SEM algorithm identified all associated genes.052 −0.198 0.9% because the former depends on the cut-off point ( ρ j ≥ 0.007 0. The F-test statistic was calculated using ˆ 1 ˆ ˆ ˆ ˆ Fj = γˆT Σ1 − Σ1Z TV j−1Z Σ1 j q where ( ) −1 γˆ j (31) ˆ ˆ ˆ γˆ j = Σ1Z TV j−1 ( y j − X µβ ) and (32) ˆ ˆ ˆ ˆ V j = X Σ β X T + Z Σ1Z T + Iσ 2 The p-value was calculated using (33) Downloaded from bioinformatics. three chromosomes (2.095 0. The entire eQTL analysis took about nine hours of Intel Core Duo CPU P8400.009 ˆ Σ1 = 0. The SEM convergence processes of five parameters for the grain yield trait and the regression coefficient ( γ ) of gene ) Expression quantitative trait locus (eQTL) mapping AF250937_s_at (the gene having strong association with grain yield).033 −0.∞ ( x) is the cumulative distribution function of the central F distribution with numerator degree of freedom q = 8 and the denominator degree of freedom ∞ .027 −0.039 −0. 2011 ˆ p − value = 1 − F8.015 0.026 −0.002 0.017 0.151 Note that the proportion of genes detected (7. demonstrating the high efficiency of the joint analysis.009 0.033 0.012 0. reasonably low. Here. chromosome 3 is the only one containing more transcripts. Figure 3 (a-d) shows the plots of the proportions of transcripts associated with markers for four of the eight traits.008 −0.074 0.027 −0. For the grain protein trait.026 0. ∞ ( Fj ) (34) where F8. The hot spot is located in the middle of the chromosome and it controls the expression of about 80% of the 457 transcripts. which indicated that SEM does have high sensitivity and specificity. For the yield trait.022811 . Recall that Table 1 gives the number of transcripts associated with each of the eight traits. By randomly permuting the eight traits.005 −0.013 −0. This has dramatically reduced the number of transcripts for eQTL mapping related to the grain protein trait.3 Fig.074 0.2%) in the experiˆ ment is not the same as π = 8.0003 0.009 −0.023 0. 2. The eQTL mapping was then targeted to these 888 transcripts. The estimated regression coefficients for the top ten genes jointly associated with all traits are listed in Table 2 along with the F-test statistics and the pvalues.027 0.

This dataset provides much information for barley biologists to further study these genes.oxfordjournals. 2011 Fig. 3. The same SEM algorithm for phenotype associated microarray data analysis has been applied to eQTL mapping with virtually no modification. Many genes have been identified to be associated with these traits. Results of the eQTL analysis are provided in Table S4 (Sheet1-Sheet8) and Table S5 (Sheet1-Sheet8). Proportions of transcripts associated with markers for the first four traits (Amylase. transcript 22767 (rbah35f01_s_at) is the code for the Cyclopropane-fatty-acylphospholipid synthase in rice. The additional information includes the eQTL effects for each transcript across the genome (Table S5). These genes are provided in Table S2 (Sheet1-Sheet8) and Table S3. Other information about this eQTL analysis is provided in Table S4 (Sheet1-Sheet8) and Table S5 (Sheet1-Sheet8). The eQTL mapping conducted was still an interval mapping approach where one marker is analyzed at a time. Diastatic power. The supplemental tables can serve as a reference database for barley biologists to further study the gene networks for the eight quantitative traits.org at Oklahoma State University (GWLA) on April 21. However. all the transcripts were analyzed simultaneously. et al. according to BLASTX results. marker ABC152D (83.. 888 are related to grain protein content. This gene is strongly associated to the grain protein in barley. with an F-test statistic of 1436. 4 DISCUSSION We adopted a new statistical method (SEM) for quantitative trait associated microarray data analysis. This is already a significant improvement over the traditional interval mapping for QTL where one transcript was analyzed at a time (Kendziorski. The chromosomes are separated by the dotted vertical reference lines. The actual functions of these genes in barley are not known prior to this study. These transcripts are allegedly to be in the same pathway for the development of grain protein. Grain protein and Heading date). The functions of some genes are known in rice. We used the method to analyze 22840 microarrayed genes associated with eight quantitative traits in barley. The posterior probability of each transcript associated with each marker (Table S4).Downloaded from bioinformatics.1 cM) on chromosome 2 controls the expression of about 50% of the transcripts associated with the grain protein. Transcripts simultaneously associated with one marker belong to the same network (or pathway) because they are all controlled by the segregation of the same locus. This dataset will help barley biologists to infer gene networks for these quantitative traits. For example. For example. . among the 22840 genes. For example. 2006).718 and a p-value of zero.

and Speed. Hobbs.M. Markesbery.J. Celeux.. Wolfinger. Afshari. Nat Rev Genet. B. T. Quackenbush. N. 8. 176. and Kleinhofs.. X. 62. S. 2011 ACKNOWLEDGEMENTS Funding: This project was supported by the Agriculture and Food Research Initiative (AFRI) of the USDA National Institute of Food and Agriculture under the Plant Genome. Parish. Quart. Hayes. Lander. Sorrells. the data were analyzed using an R program. Y. P. E.T. C. N. Am J Hum Genet. S.. Scherf. A.E. G. M. Statist. (2005) Improved statistical tests for differential gene expression by shrinking variance components estimates. Wesenberg.. Nature.. Mol Biol Evol. D. Prasad. but it is a sampling based method and is time consuming in terms of computing time. M. Wietzorrek.D. 611-623. Blake. E. 23... The MCMC implemented Bayesian eQTL mapping (Jia and Xu. the method can analyze all markers simultaneously using a single model. Chen. 418-427. Kendall. (2000) Analysis of variance for gene expression microarray data.. and Churchill.. Potokina. 86. 2007) can be adopted here. M. 193-207.A. Dempster. 59-75. and Lathrop. Ullrich.. Molecular Breeding. (2004) Incipient Alzheimer's disease: microarray correlation analyses reveal major transcriptional and tumor suppressor responses.P. Kerr. it is difficult to handle the large matrix with a dimensionality repeatedly in the SEM algorithm. (2003) Exploration. L. 14. 87. 101. Liu. 184-194. F.L. R. V.. C. Abecasis. T. Genet Res.. G. 185-199.. Knapp.. and Kearsey. M. T. S. and Rubin. Biostatistics. A.. Molony. P. 2173-2178.. B. G. D. 19-27.. Y. S.C. G. (1989) Mapping Mendelian Factors Underlying Quantitative Traits Using RFLP Linkage Maps... and Churchill. Zhang. P. and Attie. Wang.D. K. Laird.M. Spielman. 625-637. which can be downloaded from our laboratory website (www. 73-82. Biometrics. Jia.. Geddes. J Comput Biol. (2001) Assessing gene significance from cDNA microarray expression data via mixed models. Genetics. W.. Franckowiak. A sample dataset (subset of the barley data) is also provided in the website for interested users to test the method.. J. Wise.. Downloaded from bioinformatics. Biostatistics. 2. K. This is the all-transcript-and-all-marker model. E. 6. Schadt. (2003) Analysis of whole-genome microarray replicates using mixed models. Butcher. Bioinformatics.. R. and Paules. Jia. 7. Genetics.. 176. Porter... G.H. A.. and Landfield. 19.. and Horvath.W. (2007) Mapping quantitative trait loci for expression abundance..B. Lan. A. 743-747. (1993) Quantitative trait locus effects and environmental interaction in a sample of North American barley germ plasm. L. K. N.K.M.M... B. Z. Morley. and Stoker. and Diebolt. (2007) SFP genotyping from affymetrix arrays is robust but largely detects cis-acting expression regulators. Moffatt. J. E. (2003) A family-based test for correlation between gene expression and trait values. Further study on this simultaneous analysis is needed for the SEM algorithm. Hinds.. Jones. 72... Blades. and Botstein. N. C. (2005) Clustering expressed genes on the basis of their association with a quantitative phenotype.. P. Theor Appl Genet.M. (2006) Quantitative trait associated microarray gene expression data analysis. Soneji. J. Proc Natl Acad Sci U S A.. Z. 789-800.G. Ewens. Wolfinger.. Finally. 249-264. J. Antonellis. F. (2004) Functional association between malting quality trait components and cDNA array based expression patterns in barley (Hordeum vulgare L. J. Irizarry. S. J Comput Biol. Kendziorski. 392-401..S. Beazer-Barclay.statgen. Rasusson.. 819-837.). Liang. Weber. Yuan. 1-38. 1558-1573. and Xu. Kota. H. L. Luo. P. 53-61. Collin. Qu.D.. M.S.A. M... A. Bennett.A. The theory is identical to the joint analysis of all the eight traits (a matrix). 153-170. S.org at Oklahoma State University (GWLA) on April 21. Chen. Genetics and Breeding Program 2007-35300-18285 to SX. W. This study focused on the SEM algorithm for phenotype associated microarray analysis with the eQTL mapping as an example of extension to other problems. Comput. Genetics.. Practically. U. and Xu. A. (2009) Mapping complex disease traits with global gene expression.. and summaries of high density oligonucleotide array probe level data. Devlin.. Potokina. and Xu. 1323-1330. Journal of the Royal Statistical Society. normalization. E. REFERENCES Blalock. Cookson. 4. Druka.. Kraft.ucr. M.J..P. M. Z. 39.. 10. R.. E. D. (2001) Computational analysis of microarray data. J. Wernisch. M.. Nat Rev Genet. N. R. M. 121. Users may customize the code to analyze their own data using the SEM algorithm.edu) under the "Phenotype Associated Microarray" section. (2006) Statistical methods for expression quantitative trait loci (eQTL) mapping.. 2.W. M. Bushel. . S.D. R.. H. Caspers. Chen..G. J. Qiu... Martin. M..oxfordjournals. however.W. (1986) The SEM algorithm: A probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. 430.R. Gibson.J.G. Waugh. and Graner. D..M.L. Cui. (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm.Theoretically. H.S. (2004) Genetic analysis of genome-wide variation in human gene expression.. S..M. J. R. Hwang.D. Hamadeh. J. Sreenivasulu. R. T. and Cheung.J. Aten.