.
.
. . .
.
.
Integration of Biological Knowledge in the
MixtureofGaussians Analysis of Genomic Clustering
S. Sfakianakis
1,2
M. Zervakis
2
M. Tsiknakis
1
D. Kafetzopoulos
3
1
Institute of Computer Science,
Foundation for Research and Technology  Hellas
2
Department of Electronic and Computer Engineering,
Technical University of Crete
3
Institute of Molecular Biology,
Foundation for Research and Technology  Hellas
November 3, 2010
Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 1 / 15
. . . . . .
Introduction
Objective
Early work in bioinformatics focused on the identiﬁcation of a small
number of infromative genes to discriminate between phenotypes or
experimental conditions
Instead we now see a shift to more integrated analysis of gene
expression data, leading to the systems biology
In this work we try to use existing biological knowledge to guide the
analysis of gene expression data
Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 2 / 15
. . . . . .
Methods Finite Models and the EM
Finite Mixture Models
Mixture models [McLachlan, 2000] present a probabilistic framework
both for building complex probability distributions (e.g. density
estimation) as linear combinations of simpler ones but also for
clustering data (unsupervised learning)
They present a “generative” model: a sample xxx
j
can have been
generated by one of the g clusters or groups:
f (xxx
j
; ΘΘΘ) =
g
∑
i =1
π
i
f
i
(xxx
j
; θ
i
θ
i
θ
i
) (1)
where ΘΘΘ is the collection of the unknown parameters π
i
that are
usually referred as “mixing coeﬃcients”, and θ
i
θ
i
θ
i
, which are the
parameters of the component densities f
i
.
And also 0 ≤ π
i
≤ 1 and
∑
g
i =1
π
i
= 1.
Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 3 / 15
. . . . . .
Methods Finite Models and the EM
Gaussian Mixture Models
In the Gaussian Mixture Models (GMM) each of the component
probability distributions is Gaussian but with diﬀerent parameters:
f
i
(xxx
j
; θθθ
i
) = N(xxx
j
; µµµ
i
, ΣΣΣ
i
) ≡
1
√
(2π)
p
ΣΣΣ
i

e
−
1
2
(xxx
j
−µµµ
i
)
T
ΣΣΣ
−1
i
(xxx
j
−µµµ
i
)
(2)
f (xxx
j
; ΘΘΘ) =
g
∑
i =1
π
i
N(xxx
j
; µµµ
i
, ΣΣΣ
i
) (3)
There’s an eﬃcient iterative algorithm called Expectation
Maximization (EM) to compute the models’ parameters of GMM
[Dempster, 1977]
Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 4 / 15
. . . . . .
Methods Finite Models and the EM
EM for GMM
Initialize the models’ parameters µµµ
i
, Σ
i
, π
i
Estep. Compute the support (or “responsibility”) each sample
provides to a given component density as the conditional probability
τ
ji
≡ Pr (z
ji
= 1xxx
j
; ΘΘΘ
cur
) =
π
cur
j
N(xxx
j
; µµµ
cur
i
, ΣΣΣ
cur
i
))
∑
g
c=1
π
cur
c
N(xxx
j
; µµµ
cur
c
, ΣΣΣ
cur
c
))
(4)
Mstep.
µµµ
new
i
=
∑
N
j =1
τ
ij
xxx
j
∑
N
i =1
τ
ji
ΣΣΣ
new
i
=
∑
N
j =1
τ
ji
(xxx
j
−µµµ
new
i
)(xxx
j
−µµµ
new
i
)
T
∑
N
j =1
τ
ji
π
new
i
=
1
N
N
∑
j =1
τ
ji
(5)
Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 5 / 15
. . . . . .
Methods Stratiﬁed Model
Integrating Biological Knowledge: the Stratiﬁed model
The genes can be classifed into K “functional groups” e.g. based on
the Gene Ontology or the Pathways
We assume genes categorized into the same functional group are
dependent whereas genes in diﬀerent groups are independent
For Gaussian distributions independence is equivalent to
uncorrelatedness and therefore we introduce the following “stratiﬁed”
model for the covariance matrix:
Σ
Σ
Σ =
ΣΣΣ
(1)
000 · · · 000 000
000 ΣΣΣ
(2)
· · · 000 000
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
000 000 · · · ΣΣΣ
(K)
000
000 000 · · · 000 DDD
(r )
(6)
where each of ΣΣΣ
(k)
is the (unconstrained) covariance (sub)matrix for
the genes belonging to the k group, and DDD is the diagonal covariance
matrix of the r genes that do not belong to any group.
Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 6 / 15
. . . . . .
Methods Stratiﬁed Model
The sparse structure of the cov. matrix is imposed on every component of
the mixture model so that component densities are rewritten as
f
i
(xxx
j
; θ
i
θ
i
θ
i
) = N(xxx
j
; µµµ
i
,
Σ
Σ
Σ
i
)
(7)
and then taking into account the block diagonal structure we get a
factorization of the form:
f
i
(xxx
j
; θ
i
θ
i
θ
i
) = N(xxx
(r )
j
; µµµ
(r )
i
, DDD
(r )
i
)
K
∏
k=1
N(xxx
(k)
j
; µµµ
(k)
i
, ΣΣΣ
(k)
i
) (8)
and the mixture density becomes:
f (xxx
j
; ΘΘΘ) =
g
∑
i =1
π
i
N(xxx
(r )
j
; µµµ
(r )
i
, DDD
(r )
i
) ·
K
∏
k=1
N(xxx
(k)
j
; µµµ
(k)
i
, ΣΣΣ
(k)
i
)
=
g
∑
i =1
π
i
K+1
∏
k=1
N(xxx
(k)
j
; µµµ
(k)
i
, ΣΣΣ
(k)
i
)
(9)
Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 7 / 15
. . . . . .
Methods Stratiﬁed Model
EM for the stratiﬁed model
In the Estep the “responsibilities” are updated based on the current
model parameters as
τ
ji
=
π
cur
i
∏
K+1
k=1
N(xxx
(k)
j
; µµµ
(k),cur
i
, ΣΣΣ
(k),cur
i
)
∑
g
s=1
π
cur
s
∏
K+1
k=1
N(xxx
(k)
j
; µµµ
(k),cur
s
, ΣΣΣ
(k),cur
s
)
(10)
In the Mstep the new model parameters can be separately computed
per functional group as
µµµ
(k)
i
=
∑
N
j =1
τ
ji
xxx
(k)
j
∑
N
j =1
τ
ji
(11)
ΣΣΣ
(k)
i
=
∑
N
j =1
τ
ji
(xxx
(k)
j
−µµµ
(k)
i
)(xxx
(k)
j
−µµµ
(k)
i
)
T
∑
N
j =1
τ
ji
(12)
π
i
=
∑
N
j =1
τ
ji
N
(13)
Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 8 / 15
. . . . . .
Evaluation Data sets
Evaluation
In order to perform some evaluation of our method two data sets are used:
A Breast Cancer data set [Huang, 2003] where there exist 52 samples
with 18 samples exhibit recurrence of tumor and 34 do not.
A Prostate Cancer data set [Singh, 2002] where there exist 52 tumor
samples and 50 normal samples.
Both are based on the Aﬀymetrix
TM
hgu95av2 platform, containing 12625
probesets that are preprocessed using the GCRMA normalization and
summarization methods.
Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 9 / 15
. . . . . .
Evaluation Data sets
Deﬁning functional groups: KEGG Pathways
Table: The KEGG pathways used in the tests
Pathway id Pathway name
1 04115 p53 signaling pathway
2 04210 Apoptosis
3 04370 VEGF signaling pathway
4 05010 Alzheimer’s disease
5 05012 Parkinson’s disease
6 05014 Amyotrophic lateral sclerosis (ALS)
7 05016 Huntington’s disease
8 05200 Pathways in cancer
9 05210 Colorectal cancer
10 05212 Pancreatic cancer
11 05213 Endometrial cancer
12 05215 Prostate cancer
13 05222 Small cell lung cancer
14 05223 Nonsmall cell lung cancer
15 05416 Viral myocarditis
Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 10 / 15
. . . . . .
Evaluation Data sets
Comparison of clustering results
Algorithms: kmeans, PAM, and our stratiﬁed EM
“Hard” clustering can be done by assigning a sample to the cluster it
mostly supports (i.e. based on the value of τ
ji
)
Comparison of clustering results
The “true” underlying clusters are unknown
We use the class labels to validate and evaluate the cluster results
Because both datasets have a binary classiﬁcation, we request the
identiﬁcation of g = 2 clusters
Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 11 / 15
. . . . . .
Evaluation Results
Biological Homogeneity Index [Datta, 2006]
BHI =
1
g
g
∑
i =1
1
N
i
(N
i
− 1)
∑
x=y
x,y∈D
i
1I(C(x) = C(y)) (14)
Checks the homogeneity of the clusters based on the class labes
Ideally BHI = 1 if e.g. all the tumor samples are assigned to one
cluster and all the normal ones to the other
Table: BHI Results
Algorithm
BHI BHI
Breast Cancer Prostate Cancer
kmeans 0.55 0.52
pam 0.56 0.51
our EM 0.56 0.49
Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 12 / 15
. . . . . .
Evaluation Results
A detailed look into clustering results
Table: Classiﬁcation results (Breast Cancer)
Algorithm
Clusters
Miscl. rate Sensitivity Speciﬁcity
# 1 # 2
kmeans 12/11 22/7 0.346 0 1
PAM 12/12 22/6 0.346 0.667 0.647
our EM 14/12 20/6 0.346 0 1
Table: Classiﬁcation results (Prostate Cancer)
Algorithm
Clusters
Miscl. rate Sensitivity Speciﬁcity
# 1 # 2
kmeans 19/10 31/42 0.402 0.808 0.380
PAM 21/12 29/40 0.402 0.769 0.420
our EM 22/18 28/34 0.451 0.654 0.440
Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 13 / 15
. . . . . .
Conclusions
Conclusions
Integration of biological knowledge is a “hot” area
Such knowledge can be used to overcome computational deﬁciencies
and also improve the results and the validity of the methods
The stratiﬁed model can be seen as a middle solution between
choosing the full sample covariance matrix, which can lead to an
illposed inverse problem, and a lower dimensional diagonal covariance
matrix.
Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 14 / 15
. . . . . .
Conclusions
Limitations and future work
The outcome of the experiments are not very informative on the
validity of the described approach and further testing will be
conducted in the future.
The statiﬁed model that we deﬁned assumes the independence of the
uncategorized genes, an assumption that is deﬁnitely far from the
truth.
It can be the case that certain genes can have more than one
functional annotation or participate in more than one category or
pathway.
Improve the performance and robustness and study of the
convergence properties
Sfakianakis et al. (ECE) Integration of Biological Knowledge.. November 3, 2010 15 / 15