Ahmed Rebai DA-Cluster

Discriminant Analysis
Cluster Analysis
Classification and clustering

Classification Clustering
known number of classes based on a training set used to classify future observations Classification is a form of supervised learning
unknown number of classes no prior knowledge used to understand (explore) data Clustering a form of unsupervised learning
Ahmed Rebai
Bioinformatics and Comparative Genome Analysis March 2007
Linear Discriminant Analysis
Classification
In classification you do have a class label (o and x), each defined in terms of G1 and G2 values. You are trying to find a model that splits the data elements into their existing classes You then assume that this model can be used to assign new data points x and y to the right class
Ahmed Rebai
G1
* * * * ** * * o *
o
o o
o o o o o
G2 Supervised Learning G1
x?
* * o o * o * ** o o o * * *
o
o o y? o
G2
Linear Discriminant Analysis

Proposed by Fisher (1936) for classifying an observation into one of two possible groups based on many measurements x1,x2,xp. Seek a linear transformation of the variables Y=a1x1+a2x2+..+apxp such that the separation between the group means on the transformed scale is the best
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
How to calculate the ai ?

Coefficients that maximize the ratio of the between group sum of square and the within group sum of square that is
aT B a/aT W a
The vector a is the eigenvector of W-1B corresponding to the largest eigenvalue

For two groups

We find a ! C ( x1 x2 ) where x1 and x2 are mean vectors of the groups C the pooled covariance matrix of the X1 groups
X 2 x1 ! Group 1 X p
1
Ahmed Rebai
We find
n1 n2 T B! d d n
d ! ( x1 x2 )
W-1B has only one eigenvalue which is

n1n2 T 1 d W d Tr(W-1B)= n
Cluster Analysis
Arranging objects into groups is a natural and necessary skill that we all share
Try to place these faces into groups?
Ahmed Rebai
You can measure variables

case 1 2 3 4 5 6 7 8 9 10 11 12
Ahmed Rebai
sex m f m m m m m m m f m f
glasses y n y n n n y n y n n n
moustach n n n n n y n n y n y n
smile y y n n y? n y y y n n n
hat n n n n n y n n n n n n
Ahmed Rebai
Data
A set of n objects (observations) measured for p variables Variable can be binary, continuous or mixture of both
Ahmed Rebai
Rationale
A set of tools for building groups (clusters) from multivariate data objects Groups should have homogenous properties The clusters should be as homogenous as possible and the differences among the various groups as large as possible
Ahmed Rebai
Homogeneity and Separation Principles Homogeneity: Elements within a cluster are close to each other Separation: Elements in different clusters are further apart from each other clustering is not an easy task!
Given these points a clustering algorithm might make two distinct clusters as follows
Bad Clustering
This clustering violates both Homogeneity and Separation
Close distances from points in separate clusters Far distances from points in the same cluster
Good Clustering
This clustering exhibits both good Homogeneity and Separation
Ahmed Rebai
Clustering Techniques
Agglomerative: Start with every element in its own cluster, and iteratively join clusters together Divisive: Start with one cluster and iteratively divide it into smaller clusters Hierarchical: Organize elements into a tree, leaves represent genes and the length of the branches represent the distances between genes. Similar genes lie within the same subtrees
Ahmed Rebai
Two fundamental steps

Choice of a proximity measure: echa pair of observations (objects) are checked for the similarity of their values. A similarity measure is defined to measure the closeness of the objects. Choice of group building algorithm: on the basis of proximity measure the objects are assigned to groups so that difference between groups become large
Proximity measure
The proximity is defined by a square matrix D containing measures of similarity (or dissimilarity) between each pair of objects (Similarity or distance matrix)
Ahmed Rebai
Similarity of binary variables

Obi Obj 1 0 1 a1 a2 0 a3 a4
Ahmed Rebai
Similarity between continuous variables

Lr-norm distances
Commonly L2-norm (Euclidian distance) is used
This assumes that variable are measured on the same scale; if not the variable should be standardized
Contingency tables
The distance between rows is the chisquare
Ahmed Rebai
Cluster algorithms:agglomerative
Ahmed Rebai
Rationale
Ahmed Rebai
Example: an 8 points problem
Ahmed Rebai
Distance matrix
Ahmed Rebai
And this gives
Ahmed Rebai
K-means clustering
There is no need to calculate a distance matrix at first You decide on the number of clusters you want to divide you objects into The computer randomly assigns each object to one of the K clusters Now we calculate the distance between each object and the center of each cluster
K-means clustering
If the object is closer to the center of another cluster than the one it is currently assigned to, it is reassigned to the closer cluster Recalculate the centroids Do a number of iterations of this procedure until the clusters no longer change and the algorithm stops
K-Means Clustering: Lloyd Algorithm

1. Lloyd Algorithm 2. Arbitrarily assign the k cluster centers 3. while the cluster centers keep changing 4. Assign each data point to the cluster Ci corresponding to the closest cluster representative (center) xi (1 i k) 5. After the assignment of all n data points, compute new cluster representatives according to the center of gravity of each existing cluster, that is, the new cluster representative is
*This may lead to merely a locally optimal clustering.
expression in condition 2
k1
k2
0 0 1 2 3
k3
4 5
Ahmed Rebai
k1
k2
0 0 1 2 3
k3
4 5
k1
3
k2
k3
0 0 1 2 3 4 5
k1
3
k2
k3
0 0 1 2 3 4 5
The problem is
You get what you asked for: the number of final clusters is the number you choose at the beginning One solution is to try different choices of the number of cluster Can use other techniques (PCA) to get an idea on the number of major clusters
Ahmed Rebai
K-means vs hierarchical clustering
This method differs from the hierarchical clustering in many ways. In particular, - There is no hierarchy, the data are partitioned. You will be presented only with the final cluster membership for each case. - There is no role for the dendrogram in k-means clustering. - You must supply the number of clusters (k) into which the data are to be grouped.
Cluster analysis in microarray data
Inferring Gene Functionality

Researchers want to know the functions of new genes Simply comparing the new gene sequences to known DNA sequences often does not give away the actual function of gene For 40% of sequenced genes, functionality cannot be ascertained by only comparing to sequences of other known genes
Microarrays allow biologists to infer gene function even when there is not enough evidence to infer function based on similarity alone
Microarray Analysis
Microarrays measure the activity (expression level) of the gene under varying conditions/time points Expression level is estimated by measuring the amount of mRNA for that particular gene A gene is active if it is being transcribed More mRNA usually indicates more gene activity
Microarray Experiments
Analyze mRNA produced from cells in the tissue with the environmental conditions you are testing Produce cDNA from mRNA (DNA is more stable) Attach phosphor to cDNA to see when a particular gene is expressed Different color phosphors are available to compare many samples at once Hybridize cDNA over the micro array Scan the microarray with a phosphor-illuminating laser Illumination reveals transcribed genes Scan microarray multiple times for the different color phosphors Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Using Microarrays
the sample over a period of time to see gene expression over time Track two different samples under the same conditions to see the difference in gene expressions
Track Each box represents one genes expression over time
Ahmed Rebai
Using Microarrays (contd)

Green: expressed only from control Red: expresses only from experimental cell Yellow: equally expressed in both samples Black: NOT expressed in either control or experimental cells
Ahmed Rebai
Microarray Data
Microarray data are usually transformed into an intensity matrix (below) The intensity matrix allows biologists to make correlations between diferent genes (even if they are dissimilar) and to understand how genes functions might be related Time: Time X Time Y Time Z Clustering comes into play Gene 1 10 8 10
Intensity (expression level) of gene at measured time
Ahmed Rebai
Gene 2 Gene 3 Gene 4 Gene 5
10 4 7 1
0 8.6 8 2
9 3 3 3
Clustering of Microarray Data

Plot each datum as a point in Ndimensional space Make a distance matrix for the distance between every two gene points in the Ndimensional space Genes with a small distance share the same expression characteristics and might be functionally related or similar! Clustering reveal groups of functionally related genes
Hierarchical clustering
Step 1: Transform genes * experiments matrix into genes * genes distance matrix
Exp 1 Gene A Gene B Gene C Exp 2 Exp 3 Exp 4 Gene A Gene B Gene C Gene A 0 ? ? 0 ? 0 Gene B Gene C
Step 2: Cluster genes based on distance matrix and draw a dendrogram until single node remains
Data and distance matrix

Genes Patients A B C D E 1 90 190 90 200 150 2 190 390 110 400 200
A B C D E
A 0.0
B 223.6 0.0
C 80.0 297.3 0.0
D 237.1 14.1 310.2 0.0
E 60.8 194.2 108.2 206.2 0.0
Ahmed Rebai
Hierarchical clustering (continued)
G1 G2 G3 G4 G5
G1 0 2 6 10 9
G2 0 5 9 8
G3
G4
G5
0 4 5
0 3
G (12) G3 G4 G5
0
G (12) 0 6 10 9
G3 0 4 5
G4
G5
0 3
G (12) G3 G (45)
G (12) 0 6 10
G3 0 5
G (45)
1
Ahmed Rebai
2 3 4
Stage P5 P4 P3 P2 P1
Groups [1], [2], [3], [4], [5] [1 2], [3], [4], [5] [1 2], [3], [4 5] [1 2], [3 4 5] [1 2 3 4 5]
Clustering of Microarray Data (contd)
Clusters
Ahmed Rebai
Hierarchical Clustering
Ahmed Rebai
Hierarchical Clustering: Example
Ahmed Rebai
Ahmed Rebai
Ahmed Rebai
Ahmed Rebai

Ahmed Rebai DA-Cluster

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ahmed Rebai DA-Cluster

Uploaded by

Copyright:

Available Formats

Discriminant Analysis

Classification and clustering

Bioinformatics and Comparative Genome Analysis March 2007

Linear Discriminant Analysis

Bioinformatics and Comparative Genome Analysis March 2007

Linear Discriminant Analysis

How to calculate the ai ?

The vector a is the eigenvector of W-1B corresponding to the largest eigenvalue

For two groups

Bioinformatics and Comparative Genome Analysis March 2007

W-1B has only one eigenvalue which is

Try to place these faces into groups?

Bioinformatics and Comparative Genome Analysis March 2007

You can measure variables

Bioinformatics and Comparative Genome Analysis March 2007

Bioinformatics and Comparative Genome Analysis March 2007

Bioinformatics and Comparative Genome Analysis March 2007

Bioinformatics and Comparative Genome Analysis March 2007

Bioinformatics and Comparative Genome Analysis March 2007

Two fundamental steps

Bioinformatics and Comparative Genome Analysis March 2007

Similarity of binary variables

Bioinformatics and Comparative Genome Analysis March 2007

Similarity between continuous variables

 Commonly L2-norm (Euclidian distance) is used

Bioinformatics and Comparative Genome Analysis March 2007

Bioinformatics and Comparative Genome Analysis March 2007

Bioinformatics and Comparative Genome Analysis March 2007

Example: an 8 points problem

Bioinformatics and Comparative Genome Analysis March 2007

Bioinformatics and Comparative Genome Analysis March 2007

And this gives

Bioinformatics and Comparative Genome Analysis March 2007

K-Means Clustering: Lloyd Algorithm

Bioinformatics and Comparative Genome Analysis March 2007

Bioinformatics and Comparative Genome Analysis March 2007

K-means vs hierarchical clustering

Cluster analysis in microarray data

Inferring Gene Functionality

Bioinformatics and Comparative Genome Analysis March 2007

Using Microarrays (contd)

Bioinformatics and Comparative Genome Analysis March 2007

Gene 2 Gene 3 Gene 4 Gene 5

Bioinformatics and Comparative Genome Analysis March 2007

Clustering of Microarray Data

Data and distance matrix

C 80.0 297.3 0.0

D 237.1 14.1 310.2 0.0

E 60.8 194.2 108.2 206.2 0.0

Bioinformatics and Comparative Genome Analysis March 2007

Hierarchical clustering (continued)

Bioinformatics and Comparative Genome Analysis March 2007

Clustering of Microarray Data (contd)

Bioinformatics and Comparative Genome Analysis March 2007

Bioinformatics and Comparative Genome Analysis March 2007

Hierarchical Clustering: Example

Bioinformatics and Comparative Genome Analysis March 2007

Hierarchical Clustering: Example

Bioinformatics and Comparative Genome Analysis March 2007

Hierarchical Clustering: Example

Bioinformatics and Comparative Genome Analysis March 2007

Hierarchical Clustering: Example

Bioinformatics and Comparative Genome Analysis March 2007

The vector a is the eigenvector of W-1B corresponding to the largest eigenvalue

W-1B has only one eigenvalue which is

Commonly L2-norm (Euclidian distance) is used