Cell: 9952749533 WWW - Researchprojects.info: Expert Systems and Solutions

EXPERT SYSTEMS AND SOLUTIONS
Email: expertsyssol@gmail.com expertsyssol@yahoo.com Cell: 9952749533 www.researchprojects.info PAIYANOOR, OMR, CHENNAI Call For Research Projects Final year students of B.E in EEE, ECE, EI, M.E (Power Systems), M.E (Applied Electronics), M.E (Power Electronics) Ph.D Electrical and Electronics. Students can assemble their hardware in our Research labs. Experts will be guiding the projects.
MICROARRAY DATA
REPRESENTED by a N M matrix
y j contains the gene expressions for the N genes

of the jth tissue sample (j = 1, ,M). N = No. of genes (103 - 104) M = No. of tissue samples (10 - 102) STANDARD STATISTICAL METHODOLOGY APPROPRIATE FOR M >> N HERE N >> M
(y 1 , , y M )
Microarray Data represented as N x M Matrix

Sample 1 Sample 2 Gene 1 Gene 2 Sample M
Expression Signature
M columns (samples) ~ 102 N rows (genes) ~ 104
Expression Profile
Gene N
Two Clustering Problems:

Clustering of genes on basis of tissues: genes not independent Clustering of tissues on basis of genes: latter is a nonstandard problem in cluster analysis (n << p)
UNSUPERVISED CLASSIFICATION (CLUSTER ANALYSIS)

INFER CLASS LABELS z1, , zn of y1, , yn
Initially, hierarchical distance-based methods of cluster analysis were used to cluster the tissues and the genes
Eisen, Spellman, Brown, & Botstein (1998, PNAS)
The notion of a cluster is not easy to define. There is a very large literature devoted to clustering when there is a metric known in advance; e.g. k-means. Usually, there is no a priori metric (or equivalently a user-defined distance matrix) for a cluster analysis. That is, the difficulty is that the shape of the clusters is not known until the clusters have been identified, and the clusters cannot be effectively identified unless the shapes are known.
In this case, one attractive feature of adopting mixture models with elliptically symmetric components such as the normal or t densities, is that the implied clustering is invariant under affine transformations of the data (that is, under operations relating to changes in location, scale, and rotation of the data). Thus the clustering process does not depend on irrelevant factors such as the units of measurement or the orientation of the clusters in space.
Hierarchical clustering methods for the analysis of gene expression data caught on like the hula hoop. I, for one, will be glad to see them fade. Gary Churchill (The Jackson Laboratory)
Contribution to the discussion of the paper by Sebastiani, Gussoni, Kohane, and Ramoni. Statistical Science (2003) 18, 64-69.
Hierarchical (agglomerative) clustering algorithms are largely heuristically motivated and there exist a number of unresolved issues associated with their use, including how to determine the number of clusters. in the absence of a well-grounded statistical model, it seems difficult to define what is meant by a good clustering algorithm or the right number of clusters.
(Yeung et al., 2001, Model-Based Clustering and Data Transformations for Gene Expression Data, Bioinformatics 17)
McLachlan and Khan (2004). On a resampling approach for tests on the number of clusters with mixture modelbased clustering of the tissue samples. Special issue of the Journal of Multivariate Analysis 90 (2004) edited by Mark van der Laan and Sandrine Dudoit (UC Berkeley).
Attention is now turning towards a model-based approach to the analysis of microarray data
For example:
Broet, Richarson, and Radvanyi (2002). Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. Journal of Computational Biology 9 Ghosh and Chinnaiyan (2002). Mixture modelling of gene expression data from microarray experiments. Bioinformatics 18 Liu, Zhang, Palumbo, and Lawrence (2003). Bayesian clustering with variable and transformation selection. In Bayesian Statistics 7 Pan, Lin, and Le, 2002, Model-based cluster analysis of microarray gene expression data. Genome Biology 3 Yeung et al., 2001, Model based clustering and data transformations for gene expression data, Bioinformatics 17
The notion of a cluster is not easy to define. There is a very large literature devoted to clustering when there is a metric known in advance; e.g. k-means. Usually, there is no a priori metric (or equivalently a user-defined distance matrix) for a cluster analysis. That is, the difficulty is that the shape of the clusters is not known until the clusters have been identified, and the clusters cannot be effectively identified unless the shapes are known.
In this case, one attractive feature of adopting mixture models with elliptically symmetric components such as the normal or t densities, is that the implied clustering is invariant under affine transformations of the data (that is, under operations relating to changes in location, scale, and rotation of the data). Thus the clustering process does not depend on irrelevant factors such as the units of measurement or the orientation of the clusters in space.
Height y = Weight BP
H + W H-W BP
tp://www.maths.uq.edu.au/~gj
McLachlan and Peel (2000), Finite Mixture Models. Wiley.
Mixture Software: EMMIX

EMMIX for UNIX
McLachlan, Peel, Adams, and Basford

http://www.maths.uq.edu.au/~gjm/emmix/emmix.html
Basic Definition
We let Y1,. Yn denote a random sample of size n where Yj is a p-dimensional random vector with probability density function f (yj)
f ( y j ) = 1 f1 ( y j ) + + g f g ( y j )
where the f i(yj) are densities and the i are nonnegative quantities that sum to one.
Mixture distributions are applied to data with two main purposes in mind:
To provide an appealing semiparametric framework in which to model unknown distributional shapes, as an alternative to, say, the kernel density method. To use the mixture model to provide a modelbased clustering. (In both situations, there is the question of how many components to include in the mixture.)
Shapes of Some Univariate Normal Mixtures

Consider
f ( yj ) = 1 ( y j ; 1 , ) + 2 ( y j ; 2 , )
2 2
where
( y j ; , ) = (2 ) exp{ ( y j ) }
2 1 1 2 2 2
1 2
denotes the univariate normal density with mean and variance 2.
=1
=2
=3
=4
Figure 1: Plot of a mixture density of two univariate normal components in equal proportions with common variance 2=1
=1
=2
=3
=4
Figure 2: Plot of a mixture density of two univariate normal components in proportions 0.75 and 0.25 with common variance
Normal Mixtures
Computationally convenient for multivariate data Provide an arbitrarily accurate estimate of the underlying density with g sufficiently large Provide a probabilistic clustering of the data into g clusters - outright clustering by assigning a data point to the component to which it has the greatest posterior probability of belonging
Synthetic Data Set 1
Synthetic Data Set 2

1 2 3 1 2 3 1
True Values 0.333 0.333 0.333 (0 2)T (0 0) T (0 2) T 2 0 0 0.2 2 0 0 0.2 2 0 0 0.2
Initial Values Estimates by EM 0.333 0.333 0.333 (-1 0) T (0 0) T (1 0) T 1 0 0 1 1 0 0 1 1 0 0 1 0.294 0.337 0.370 (-0.154 1.961) T (0.360 0.115) T (-0.004 2.027) T 1.961 0.016 0.016 0.218 2.346 0.553 0.553 0.218 2.339 0.042 0.042 0.206
Figure 7
Figure 8
MIXTURE OF g NORMAL COMPONENTS
f (y) = 1 (y; 1 , 1 ) + + g (y; g , g )

where where
2 log (y; , ) = (y )T 1 (y ) + constant

MAHALANOBIS DISTANCE
( y )T ( y )
EUCLIDEAN DISTANCE
MIXTURE OF g NORMAL COMPONENTS
f (y) = 1 (y; 1 , 1 ) + + g (y; g , g )

k-means
1 = = g = I I g
2 2
SPHERICAL CLUSTERS
Equal spherical covariance matrices
With a mixture model-based approach to clustering, an observation is assigned outright to the ith cluster if its density in the ith component of the mixture distribution (weighted by the prior probability of that component) is greater than in the other (g-1) components.
f (y) = 1 (y; 1 , 1 ) + + i (y; i , i ) + + g (y; g , g )
Figure 7: Contours of the fitted component densities on the 2nd & 3rd variates for the blue crab data set.
Estimation of Mixture Distributions

It was the publication of the seminal paper of Dempster, Laird, and Rubin (1977) on the EM algorithm that greatly stimulated interest in the use of finite mixture distributions to model heterogeneous data. McLachlan and Krishnan (1997, Wiley)
If need be, the normal mixture model can be made less sensitive to outlying observations by using t component densities. With this t mixture model-based approach, the normal distribution for each component in the mixture is embedded in a wider class of elliptically symmetric distributions with an additional parameter called the degrees of freedom.
The advantage of the t mixture model is that, although the number of outliers needed for breakdown is almost the same as with the normal mixture model, the outliers have to be much larger.
In exploring high-dimensional data sets for group structure, it is typical to rely on principal component analysis.
Two Groups in Two Dimensions. All cluster information would be lost by collapsing to the first principal component. The principal ellipses of the two groups are shown as solid curves.
Mixtures of Factor Analyzers

A normal mixture model without restrictions on the component-covariance matrices may be viewed as too general for many situations in practice, in particular, with high dimensional data. One approach for reducing the number of parameters is to work in a lower dimensional space by using principal components; another is to use mixtures of factor analyzers
Mixtures of Factor Analyzers

Principal components or a single-factor analysis model provides only a global linear model. A global nonlinear approach by postulating a mixture of linear submodels
f ( y j ) = i ( y j ; i , i ),
i =1
where
i = Bi B + Di
T i
(i = 1,..., g ),
Bi is a p x q matrix and Di is a diagonal matrix.
Single-Factor Analysis Model

Yj = + B U j + e j ( j = 1,..., n) , where U j is a q - dimensional (q < p ) vector of latent or unobservable variables called factors and Bi is a p x p matrix of factor loadings.
The Uj are iid N(O, Iq) independently of the errors ej, which are iid as N(O, D), where D is a diagonal matrix
D = diag ( ,..., )
2 1 2 p
Conditional on ith component membership of the mixture,
Y j = i + BiU ij + eij (i = 1,..., g ).

where Ui1 , ..., Uin are independent, identically distibuted (iid) N(O, Iq), independently of the eij , which are iid N(O, Di), where Di is a diagonal matrix (i = 1, ..., g).
An infinity of choices for Bi for model still holds if Bi is replaced by BiCi where Ci is an orthogonal matrix. Choose Ci so that T 1 Bi Di Bi is diagonal Number of free parameters is then
pq + p q (q 1).
1 2
Reduction in the number of parameters is then

1 2
{ ( p q ) (p + q)}
2
We can fit the mixture of factor analyzers model using an alternating ECM algorithm.
1st cycle: declare the missing data to be the component-indicator vectors. Update the estimates of
i and i
2nd cycle: declare the missing data to be also the factors. Update the estimates of
Bi and Di
M-step on 1st cycle:
( k +1) i
=
j =1
n
(k ) ij
/n
n (k ) ij
( k +1) i
=
j =1
(k ) ij
y j /
j =1
for i = 1, ... , g .
M step on 2nd cycle:
( k +1) i
= Vi
( k +1 / 2 )
(k ) i
( k )T i
( k +1 / 2 ) ( k ) i i
( k +1 / 2 ) 1 i
( k +1) i
= diag{Vi
(k ) i
( k +1 / 2 )
Vi
( k +1 / 2 )
(k ) i
(k +1)T i
where
=B B (
(k ) i
( k )T i
+ ) B D
(k ) 1 i
(k ) i
(k ) i
= Iq
( k )T i
Bi
Vi
( k +1 / 2 )
is given by
( k +1 / 2 ) n
j =1 i
( y j ;
)( y j
( k +1) i
)( y j
( k +1) T i
i ( y j ; ( k +1/ 2 ) ) j =1
Work in q-dim space:

(BiBiT + Di ) - 1= Di 1 - Di -1 Bi (Iq + BiTDi -1 Bi) -1 BiTDi -1 , |BiBiT+D i| = | Di | / |Iq -BiT(BiBiT+Di) -1 Bi| .
= diag (V B B T ), Di i i i
where
Vi =
n
) ( y )( y )T i ( y j ; j i j i
j =1
i ( y j ; )
j =1
= diag (V B B T ), Di i i i
With EM:
( k +1) i
= diag(Vi
= (n )
( k +1)
B
(k ) ij
( k +1) i
Wi B
T j
(k )
( k +1)T i
where
Wi
(k )
( k ) 1 i
j =1
E (U jU | y j )}
(k ) i
To avoid potential computational problems with small-sized clusters, we impose the constraint
Di = D (i =1,... g )
( k +1) i
= u
j =1
(k ) (k ) ij ij j
j =1
(k ) (k ) ij ij
Vi
( k +1 / 2 )
is given by
( k +1 / 2 ) n
j =1 i
( y j ;
)( y j
( k +1) i
)( y j
( k +1) T i
) u
(k ) ij
( i ( y j ; ( k +1/ 2 ) ) uijk ) j =1
Number of Components in a Mixture Model

Testing for the number of components, g, in a mixture is an important but very difficult problem which has not been completely resolved.
Order of a Mixture Model

A mixture density with g components might be empirically indistinguishable from one with either fewer than g components or more than g components. It is therefore sensible in practice to approach the question of the number of components in a mixture model in terms of an assessment of the smallest number of components in the mixture compatible with the data.
Likelihood Ratio Test Statistic

An obvious way of approaching the problem of testing for the smallest value of the number of components in a mixture model is to use the LRTS, -2log . Suppose we wish to test the null hypothesis,
H 0 : g = g 0 versus H1 : g = g1
for some g1>g0.
We let i denote the MLE of calculated under Hi , (i=0,1). Then the evidence against H0 will be strong if is sufficiently small, or equivalently, if -2log is sufficiently large, where 2 log = 2{log L( 1 ) log L( 0 )}
Bootstrapping the LRTS

McLachlan (1987) proposed a resampling approach to the assessment of the P-value of the LRTS in testing
H 0 : g = g0
H1 : g = g1
for a specified value of g0.
Bayesian Information Criterion

The Bayesian information criterion (BIC) of Schwarz (1978) is given by
2 log L( ) + d log n
as the penalized log likelihood to be maximized in model selection, including the present situation for the number of components g in a mixture model.
Gap statistic (Tibshirani et al., 2001)
Clest (Dudoit and Fridlyand, 2002)
PROVIDES A MODEL-BASED APPROACH TO CLUSTERING

McLachlan, Bean, and Peel, 2002, A Mixture Model-Based Approach to the Clustering of Microarray Expression Data, Bioinformatics 18, 413-422
http://www.bioinformatics.oupjournals.org/cgi/screen pdf/18/3/413.pdf
Example: Microarray Data Colon Data of Alon et al. (1999)

M = 62 (40 tumours; 22 normals) tissue samples of N = 2,000 genes in a 2,000 62 matrix.
Mixture of 2 normal components
Mixture of 2 t components
The t distribution does not have substantially better breakdown behavior than the normal (Tyler, 1994). The advantage of the t mixture model is that, although the number of outliers needed for breakdown is almost the same as with the normal mixture model, the outliers have to be much larger. This point is made more precise in Hennig (2002) who has provided an excellent account of breakdown points for ML estimation of location -scale mixtures with a fixed number of components g. Of course as explained in Hennig (2002), mixture models can be made more robust by allowing the number of components g to grow with the number of outliers.
For Normal mixtures breakdown begins with an additional point at about 15.2. For a mixture of t3-distributions, the outlier must 3.8 107 , lie at about 800, t1-mixtures need the outlier at about and a Normal mixture with additional noise component breaks 3.5 107. down with an additional point at
Clustering of COLON Data Genes using EMMIX-GENE
Grouping for Colon Data

1 2 3 4 5
10
11
12
13
14
15
16
17
18
19
20
Clustering of COLON Data Tissues using EMMIX-GENE
Grouping for Colon Data

1 2 3 4 5
10
11
12
13
14
15
16
17
18
19
20
Heat Map Displaying the Reduced Set of 4,869 Genes on the 98 Breast Cancer Tumours
Insert heat map of 1867 genes
Heat Map of Top 1867 Genes
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
i 1 2 3 4 5 6 7 8 9 10
mi Ui 146 112.98 93 61 55 43 92 71 20 23 23 74.95 46.08 35.20 30.40 29.29 28.77 28.76 28.44 27.73
i 11 12 13 14 15 16 17 18 19 20
mi 66 38 28 53 47 23 27 45 80 55
Ui 25.72 25.45 25.00 21.33 18.14 18.00 17.62 17.51 17.28 13.79
i mi 21 44 22 30 23 25 24 67 25 12 26 58 27 27 28 64 29 38 30 21
Ui 13.77 13.28 13.10 13.01 12.04 12.03 11.74 11.61 11.38 10.72
i mi 31 53 32 36 33 36 34 38 35 44 36 56 37 46 38 19 39 29 40 35
Ui 9.84 8.95 8.89 8.86 8.02 7.43 7.21 6.14 4.64 2.44
where
i = group number
mi = number in group i
Ui = -2 log i
Heat Map of Genes in Group G1
Clustering of gene expression profiles Longitudinal (with or without replication, for example time-course) Cross-sectional data EMMIX-WIRE EM-based MIXture analysis With Random Effects
A Mixture Model with Random-Effects Components for Clustering Correlated Gene-Expression Profiles. S.K. Ng, G. J. McLachlan, K. Wang, L. Ben-Tovim Jones, S-W. Ng.
Clustering of Correlated Gene Profiles
y j = X h + Ub hj +Vch + hj
Clustering of gene expression profiles

Longitudinal (with or without replication, for example time course) Cross-section data
( y j , c; ) = pr{Z hj = 1 | y j , c}
=
h f ( y j | zhj = 1,c h ; h ) i f ( y j | zij = 1, ci ; i ) i =1

h = X h + Vch
Bh = Ah + bhUU
T
N( h, h), with
Yeast Cell Cycle X is an 18 x 2 matrix with the (l+1)th row (l= 0,,17)
(cos (2 (7l ) + )
sin (2 (7l ) + ))
Yeast data is from Spellman (1998); 18 rows represent the 18 -factor (pheromone) synchronization where the yeast cells were sampled at 7 minute intervals for 119 minutes. is the period of the cell cycle and is the phase offset, estimated using least squares to be =53 and =0.
Clustering Results for Spellman Yeast Cell Cycle Data
Plots of First versus Second Principal Components
(a) Our clustering
(b) Muro clustering
A Mixture Model with Random-Effects Components for Clustering Correlated Gene-Expression Profiles. S.K. Ng, G. J. McLachlan, K. Wang, L. Ben-Tovim Jones, S-W. Ng.

Cell: 9952749533 WWW - Researchprojects.info: Expert Systems and Solutions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cell: 9952749533 WWW - Researchprojects.info: Expert Systems and Solutions

Uploaded by

Copyright:

Available Formats

EXPERT SYSTEMS AND SOLUTIONS

y j contains the gene expressions for the N genes

Microarray Data represented as N x M Matrix

M columns (samples) ~ 102 N rows (genes) ~ 104

Two Clustering Problems:

UNSUPERVISED CLASSIFICATION (CLUSTER ANALYSIS)

McLachlan and Peel (2000), Finite Mixture Models. Wiley.

Mixture Software: EMMIX

McLachlan, Peel, Adams, and Basford

Shapes of Some Univariate Normal Mixtures

denotes the univariate normal density with mean and variance 2.

Synthetic Data Set 1

Synthetic Data Set 2

True Values 0.333 0.333 0.333 (0 2)T (0 0) T (0 2) T 2 0 0 0.2 2 0 0 0.2 2 0 0 0.2

MIXTURE OF g NORMAL COMPONENTS

f (y) = 1 (y; 1 , 1 ) + + g (y; g , g )

2 log (y; , ) = (y )T 1 (y ) + constant

MIXTURE OF g NORMAL COMPONENTS

f (y) = 1 (y; 1 , 1 ) + + g (y; g , g )

Equal spherical covariance matrices

Estimation of Mixture Distributions

Mixtures of Factor Analyzers

Mixtures of Factor Analyzers

Bi is a p x q matrix and Di is a diagonal matrix.

Single-Factor Analysis Model

Conditional on ith component membership of the mixture,

Y j = i + BiU ij + eij (i = 1,..., g ).

Reduction in the number of parameters is then

M-step on 1st cycle:

M step on 2nd cycle:

Work in q-dim space:

Number of Components in a Mixture Model

Order of a Mixture Model

Likelihood Ratio Test Statistic

Bootstrapping the LRTS

for a specified value of g0.

Bayesian Information Criterion

Gap statistic (Tibshirani et al., 2001)

Clest (Dudoit and Fridlyand, 2002)

PROVIDES A MODEL-BASED APPROACH TO CLUSTERING

Example: Microarray Data Colon Data of Alon et al. (1999)

Mixture of 2 normal components

Clustering of COLON Data Genes using EMMIX-GENE

Grouping for Colon Data

Clustering of COLON Data Tissues using EMMIX-GENE

Grouping for Colon Data

Insert heat map of 1867 genes

Heat Map of Top 1867 Genes

Heat Map of Genes in Group G1

Heat Map of Genes in Group G2

Heat Map of Genes in Group G3

Clustering of Correlated Gene Profiles

Clustering of gene expression profiles

h f ( y j | zhj = 1,c h ; h ) i f ( y j | zij = 1, ci ; i ) i =1

Clustering Results for Spellman Yeast Cell Cycle Data

Plots of First versus Second Principal Components

(a) Our clustering

(b) Muro clustering

You might also like