You are on page 1of 54

Discriminant Analysis

Cluster Analysis

Classification and clustering


Classification Clustering

known number of classes based on a training set used to classify future observations Classification is a form of supervised learning

unknown number of classes no prior knowledge used to understand (explore) data Clustering a form of unsupervised learning

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Linear Discriminant Analysis

Classification
 In classification you do have a class label (o and x), each defined in terms of G1 and G2 values.  You are trying to find a model that splits the data elements into their existing classes  You then assume that this model can be used to assign new data points x and y to the right class
Ahmed Rebai

G1

* * * * ** * * o *
o

o o

o o o o o

G2 Supervised Learning G1
x?

* * o o * o * ** o o o * * *
o

o o y? o

G2

Bioinformatics and Comparative Genome Analysis March 2007

Linear Discriminant Analysis


Proposed by Fisher (1936) for classifying an observation into one of two possible groups based on many measurements x1,x2,xp. Seek a linear transformation of the variables Y=a1x1+a2x2+..+apxp such that the separation between the group means on the transformed scale is the best
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

How to calculate the ai ?


Coefficients that maximize the ratio of the between group sum of square and the within group sum of square that is
aT B a/aT W a

The vector a is the eigenvector of W-1B corresponding to the largest eigenvalue


Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

For two groups


We find a ! C ( x1  x2 ) where  x1 and x2 are mean vectors of the groups C the pooled covariance matrix of the X1 groups
X 2 x1 ! Group 1 X p
1

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

We find

n1 n2 T B! d d n

d ! ( x1  x2 )

W-1B has only one eigenvalue which is


n1n2 T 1 d W d Tr(W-1B)= n
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

Cluster Analysis
Arranging objects into groups is a natural and necessary skill that we all share

Try to place these faces into groups?

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

You can measure variables


case 1 2 3 4 5 6 7 8 9 10 11 12
Ahmed Rebai

sex m f m m m m m m m f m f

glasses y n y n n n y n y n n n

moustach n n n n n y n n y n y n

smile y y n n y? n y y y n n n

hat n n n n n y n n n n n n

Bioinformatics and Comparative Genome Analysis March 2007

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Data
A set of n objects (observations) measured for p variables Variable can be binary, continuous or mixture of both

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Rationale
A set of tools for building groups (clusters) from multivariate data objects Groups should have homogenous properties The clusters should be as homogenous as possible and the differences among the various groups as large as possible

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Homogeneity and Separation Principles Homogeneity: Elements within a cluster are close to each other Separation: Elements in different clusters are further apart from each other clustering is not an easy task!
Given these points a clustering algorithm might make two distinct clusters as follows
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

Bad Clustering
This clustering violates both Homogeneity and Separation
Close distances from points in separate clusters Far distances from points in the same cluster
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

Good Clustering
This clustering exhibits both good Homogeneity and Separation

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Clustering Techniques
 Agglomerative: Start with every element in its own cluster, and iteratively join clusters together Divisive: Start with one cluster and iteratively divide it into smaller clusters Hierarchical: Organize elements into a tree, leaves represent genes and the length of the branches represent the distances between genes. Similar genes lie within the same subtrees
Bioinformatics and Comparative Genome Analysis March 2007

Ahmed Rebai

Two fundamental steps


 Choice of a proximity measure: echa pair of observations (objects) are checked for the similarity of their values. A similarity measure is defined to measure the closeness of the objects.  Choice of group building algorithm: on the basis of proximity measure the objects are assigned to groups so that difference between groups become large
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

Proximity measure
The proximity is defined by a square matrix D containing measures of similarity (or dissimilarity) between each pair of objects (Similarity or distance matrix)

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Similarity of binary variables


Obi Obj 1 0 1 a1 a2 0 a3 a4

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Similarity between continuous variables


Lr-norm distances

 Commonly L2-norm (Euclidian distance) is used

 This assumes that variable are measured on the same scale; if not the variable should be standardized
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

Contingency tables
The distance between rows is the chisquare

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Cluster algorithms:agglomerative

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Rationale

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Example: an 8 points problem

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Distance matrix

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

And this gives

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

K-means clustering
There is no need to calculate a distance matrix at first You decide on the number of clusters you want to divide you objects into The computer randomly assigns each object to one of the K clusters Now we calculate the distance between each object and the center of each cluster
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

K-means clustering
If the object is closer to the center of another cluster than the one it is currently assigned to, it is reassigned to the closer cluster Recalculate the centroids Do a number of iterations of this procedure until the clusters no longer change and the algorithm stops
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

K-Means Clustering: Lloyd Algorithm


1. Lloyd Algorithm 2. Arbitrarily assign the k cluster centers 3. while the cluster centers keep changing 4. Assign each data point to the cluster Ci corresponding to the closest cluster representative (center) xi (1 i k) 5. After the assignment of all n data points, compute new cluster representatives according to the center of gravity of each existing cluster, that is, the new cluster representative is
*This may lead to merely a locally optimal clustering.
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

expression in condition 2

k1

k2

0 0 1 2 3

k3
4 5

expression in condition 1

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

expression in condition 2

k1

k2

0 0 1 2 3

k3
4 5

expression in condition 1
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

expression in condition 2

k1
3

k2

k3

0 0 1 2 3 4 5

expression in condition 1
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

expression in condition 2

k1
3

k2

k3

0 0 1 2 3 4 5

expression in condition 1
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

The problem is
You get what you asked for: the number of final clusters is the number you choose at the beginning One solution is to try different choices of the number of cluster Can use other techniques (PCA) to get an idea on the number of major clusters

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

K-means vs hierarchical clustering

This method differs from the hierarchical clustering in many ways. In particular, - There is no hierarchy, the data are partitioned. You will be presented only with the final cluster membership for each case. - There is no role for the dendrogram in k-means clustering. - You must supply the number of clusters (k) into which the data are to be grouped.
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

Cluster analysis in microarray data

Inferring Gene Functionality


 Researchers want to know the functions of new genes  Simply comparing the new gene sequences to known DNA sequences often does not give away the actual function of gene  For 40% of sequenced genes, functionality cannot be ascertained by only comparing to sequences of other known genes

 Microarrays allow biologists to infer gene function even when there is not enough evidence to infer function based on similarity alone
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

Microarray Analysis

 Microarrays measure the activity (expression level) of the gene under varying conditions/time points  Expression level is estimated by measuring the amount of mRNA for that particular gene A gene is active if it is being transcribed More mRNA usually indicates more gene activity
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

Microarray Experiments
Analyze mRNA produced from cells in the tissue with the environmental conditions you are testing Produce cDNA from mRNA (DNA is more stable) Attach phosphor to cDNA to see when a particular gene is expressed Different color phosphors are available to compare many samples at once Hybridize cDNA over the micro array Scan the microarray with a phosphor-illuminating laser Illumination reveals transcribed genes Scan microarray multiple times for the different color phosphors Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007

Using Microarrays

the sample over a period of time to see gene expression over time Track two different samples under the same conditions to see the difference in gene expressions
Track Each box represents one genes expression over time

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Using Microarrays (contd)


Green: expressed only from control Red: expresses only from experimental cell Yellow: equally expressed in both samples Black: NOT expressed in either control or experimental cells
Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Microarray Data
 Microarray data are usually transformed into an intensity matrix (below)  The intensity matrix allows biologists to make correlations between diferent genes (even if they are dissimilar) and to understand how genes functions might be related Time: Time X Time Y Time Z  Clustering comes into play Gene 1 10 8 10
Intensity (expression level) of gene at measured time
Ahmed Rebai

Gene 2 Gene 3 Gene 4 Gene 5

10 4 7 1

0 8.6 8 2

9 3 3 3

Bioinformatics and Comparative Genome Analysis March 2007

Clustering of Microarray Data


Plot each datum as a point in Ndimensional space Make a distance matrix for the distance between every two gene points in the Ndimensional space Genes with a small distance share the same expression characteristics and might be functionally related or similar! Clustering reveal groups of functionally related genes
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

Hierarchical clustering
Step 1: Transform genes * experiments matrix into genes * genes distance matrix
Exp 1 Gene A Gene B Gene C Exp 2 Exp 3 Exp 4 Gene A Gene B Gene C Gene A 0 ? ? 0 ? 0 Gene B Gene C

Step 2: Cluster genes based on distance matrix and draw a dendrogram until single node remains
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

Data and distance matrix


Genes Patients A B C D E 1 90 190 90 200 150 2 190 390 110 400 200

A B C D E

A 0.0

B 223.6 0.0

C 80.0 297.3 0.0

D 237.1 14.1 310.2 0.0

E 60.8 194.2 108.2 206.2 0.0

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Hierarchical clustering (continued)

G1 G2 G3 G4 G5

G1 0 2 6 10 9

G2 0 5 9 8

G3

G4

G5

0 4 5

0 3

G (12) G3 G4 G5
0

G (12) 0 6 10 9

G3 0 4 5

G4

G5

0 3

G (12) G3 G (45)

G (12) 0 6 10

G3 0 5

G (45)

1
Ahmed Rebai

2 3 4

Stage P5 P4 P3 P2 P1

Groups [1], [2], [3], [4], [5] [1 2], [3], [4], [5] [1 2], [3], [4 5] [1 2], [3 4 5] [1 2 3 4 5]

Bioinformatics and Comparative Genome Analysis March 2007

Clustering of Microarray Data (contd)

Clusters

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Hierarchical Clustering

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Hierarchical Clustering: Example

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Hierarchical Clustering: Example

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Hierarchical Clustering: Example

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

Hierarchical Clustering: Example

Ahmed Rebai

Bioinformatics and Comparative Genome Analysis March 2007

You might also like