Professional Documents
Culture Documents
Cluster Analysis
known number of classes based on a training set used to classify future observations Classification is a form of supervised learning
unknown number of classes no prior knowledge used to understand (explore) data Clustering a form of unsupervised learning
Ahmed Rebai
Classification
In classification you do have a class label (o and x), each defined in terms of G1 and G2 values. You are trying to find a model that splits the data elements into their existing classes You then assume that this model can be used to assign new data points x and y to the right class
Ahmed Rebai
G1
* * * * ** * * o *
o
o o
o o o o o
G2 Supervised Learning G1
x?
* * o o * o * ** o o o * * *
o
o o y? o
G2
Ahmed Rebai
We find
n1 n2 T B! d d n
d ! ( x1 x2 )
Cluster Analysis
Arranging objects into groups is a natural and necessary skill that we all share
Ahmed Rebai
sex m f m m m m m m m f m f
glasses y n y n n n y n y n n n
moustach n n n n n y n n y n y n
smile y y n n y? n y y y n n n
hat n n n n n y n n n n n n
Ahmed Rebai
Data
A set of n objects (observations) measured for p variables Variable can be binary, continuous or mixture of both
Ahmed Rebai
Rationale
A set of tools for building groups (clusters) from multivariate data objects Groups should have homogenous properties The clusters should be as homogenous as possible and the differences among the various groups as large as possible
Ahmed Rebai
Homogeneity and Separation Principles Homogeneity: Elements within a cluster are close to each other Separation: Elements in different clusters are further apart from each other clustering is not an easy task!
Given these points a clustering algorithm might make two distinct clusters as follows
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
Bad Clustering
This clustering violates both Homogeneity and Separation
Close distances from points in separate clusters Far distances from points in the same cluster
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
Good Clustering
This clustering exhibits both good Homogeneity and Separation
Ahmed Rebai
Clustering Techniques
Agglomerative: Start with every element in its own cluster, and iteratively join clusters together Divisive: Start with one cluster and iteratively divide it into smaller clusters Hierarchical: Organize elements into a tree, leaves represent genes and the length of the branches represent the distances between genes. Similar genes lie within the same subtrees
Bioinformatics and Comparative Genome Analysis March 2007
Ahmed Rebai
Proximity measure
The proximity is defined by a square matrix D containing measures of similarity (or dissimilarity) between each pair of objects (Similarity or distance matrix)
Ahmed Rebai
Ahmed Rebai
This assumes that variable are measured on the same scale; if not the variable should be standardized
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
Contingency tables
The distance between rows is the chisquare
Ahmed Rebai
Cluster algorithms:agglomerative
Ahmed Rebai
Rationale
Ahmed Rebai
Ahmed Rebai
Distance matrix
Ahmed Rebai
Ahmed Rebai
K-means clustering
There is no need to calculate a distance matrix at first You decide on the number of clusters you want to divide you objects into The computer randomly assigns each object to one of the K clusters Now we calculate the distance between each object and the center of each cluster
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
K-means clustering
If the object is closer to the center of another cluster than the one it is currently assigned to, it is reassigned to the closer cluster Recalculate the centroids Do a number of iterations of this procedure until the clusters no longer change and the algorithm stops
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
expression in condition 2
k1
k2
0 0 1 2 3
k3
4 5
expression in condition 1
Ahmed Rebai
expression in condition 2
k1
k2
0 0 1 2 3
k3
4 5
expression in condition 1
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
expression in condition 2
k1
3
k2
k3
0 0 1 2 3 4 5
expression in condition 1
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
expression in condition 2
k1
3
k2
k3
0 0 1 2 3 4 5
expression in condition 1
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
The problem is
You get what you asked for: the number of final clusters is the number you choose at the beginning One solution is to try different choices of the number of cluster Can use other techniques (PCA) to get an idea on the number of major clusters
Ahmed Rebai
This method differs from the hierarchical clustering in many ways. In particular, - There is no hierarchy, the data are partitioned. You will be presented only with the final cluster membership for each case. - There is no role for the dendrogram in k-means clustering. - You must supply the number of clusters (k) into which the data are to be grouped.
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
Microarrays allow biologists to infer gene function even when there is not enough evidence to infer function based on similarity alone
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
Microarray Analysis
Microarrays measure the activity (expression level) of the gene under varying conditions/time points Expression level is estimated by measuring the amount of mRNA for that particular gene A gene is active if it is being transcribed More mRNA usually indicates more gene activity
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
Microarray Experiments
Analyze mRNA produced from cells in the tissue with the environmental conditions you are testing Produce cDNA from mRNA (DNA is more stable) Attach phosphor to cDNA to see when a particular gene is expressed Different color phosphors are available to compare many samples at once Hybridize cDNA over the micro array Scan the microarray with a phosphor-illuminating laser Illumination reveals transcribed genes Scan microarray multiple times for the different color phosphors Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Using Microarrays
the sample over a period of time to see gene expression over time Track two different samples under the same conditions to see the difference in gene expressions
Track Each box represents one genes expression over time
Ahmed Rebai
Microarray Data
Microarray data are usually transformed into an intensity matrix (below) The intensity matrix allows biologists to make correlations between diferent genes (even if they are dissimilar) and to understand how genes functions might be related Time: Time X Time Y Time Z Clustering comes into play Gene 1 10 8 10
Intensity (expression level) of gene at measured time
Ahmed Rebai
10 4 7 1
0 8.6 8 2
9 3 3 3
Hierarchical clustering
Step 1: Transform genes * experiments matrix into genes * genes distance matrix
Exp 1 Gene A Gene B Gene C Exp 2 Exp 3 Exp 4 Gene A Gene B Gene C Gene A 0 ? ? 0 ? 0 Gene B Gene C
Step 2: Cluster genes based on distance matrix and draw a dendrogram until single node remains
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
A B C D E
A 0.0
B 223.6 0.0
Ahmed Rebai
G1 G2 G3 G4 G5
G1 0 2 6 10 9
G2 0 5 9 8
G3
G4
G5
0 4 5
0 3
G (12) G3 G4 G5
0
G (12) 0 6 10 9
G3 0 4 5
G4
G5
0 3
G (12) G3 G (45)
G (12) 0 6 10
G3 0 5
G (45)
1
Ahmed Rebai
2 3 4
Stage P5 P4 P3 P2 P1
Groups [1], [2], [3], [4], [5] [1 2], [3], [4], [5] [1 2], [3], [4 5] [1 2], [3 4 5] [1 2 3 4 5]
Clusters
Ahmed Rebai
Hierarchical Clustering
Ahmed Rebai
Ahmed Rebai
Ahmed Rebai
Ahmed Rebai
Ahmed Rebai