Clustering methods for the analysis of DNA microarray data

Robert Tibshirani, Trevor Hastie, Mike Eisen, Doug Ross, David Botstein and Pat Brown

Department of Health Research and Policy, Department of Statistics, Department of Genetics and Department of Biochemistry, Stanford University October 15, 1999
Abstract
It is now possible to simultaneously measure the expression of thousands of genes during cellular di erentiation and response, through the use of DNA microarrays. A major statistical task is to understand the structure in the data that arise from this technology. In this paper

1

we review various methods of clustering, and illustrate how they can be used to arrange both the genes and cell lines from a set of DNA microarray experiments. The methods discussed are global clustering techniques including hierarchical, K-means, and block clustering, and tree-structured vector quantization. Finally, we propose a new method for identifying structure in subsets of both genes and cell lines that are potentially obscured by the global clustering approaches.

1 Introduction
DNA microarrays and other high-throughput methods for analyzing complex nucleic acid samples make it now possible to measure rapidly, e ciently and accurately the levels of virtually all genes expressed in a biological sample. The application of such methods in diverse experimental settings generates results rich in information. However, the process of transforming this information into meaningful biological insights is impeded by the complexity and vastness of the data. One way to overcome this obstacle is exempli ed in recent analyses of genome- scale expression timeseries (Eisen, Spellman, Brown & Botstein (1998), Tamayo, Slonim, Mesirov, Zhu, Kitareewan & Dmitrovsky (1999), Iyer, Eisen, Ross, Schuler, Moore, Lee, Trent, Hudson, Boguski, Lashkari, Botstein & Brown (1999), Chu, Eisen, Mulholland, Bot2

stein, Brown & Herskowitz (1998), Spellman, Sherlock, Iyer, Zhang, Anders, Eisen, Brown & Botstein (1998), Roth, Estep, & Church (1998)) where statistical clustering methods were used to organize the data by identifying groups of genes with similar behavior across time. Such organizational frameworks greatly facilitates the process of exploring these complex sets of biological data (Botstein & Brown (1999)). In this paper we discuss the logical extension of these methods to expression data from collections of discrete samples, where it is useful to uncover relationships among samples as well as genes, and illustrate the properties of various methods using gene expression data from sixty human tumor cell lines Ross, 1999 to be added]. We rst describe application of one-dimensional clustering methods to both the gene and sample dimensions. We then describe a new implementation of two-way clustering. Finally, we propose methods for identifying structure in subsets of both axes that are potentially obscured by global clustering approaches.

2 Clustering techniques
The data from a microarray experiment form a matrix, where the rows are di erent genes and the columns are di erent cell lines. In some experiments the samples are di erent cell lines from di erent people, and we assume that here. In other experiments the samples are a time series of measurements 3

during di erent phases of cell development. Recently some authors have explored the use of clustering methods to arrange the genes in some natural order, with similar genes placed close together. Good general references on clustering are Everitt (1980), Kaufman & Rousseeuw (1990) and Gordon (1999). There are two major approaches to clustering- bottom up and top-down. Hierarchical clustering (e.g. Sokal & Mitchener (1958)) is a bottom-up clustering method, that starts with each observation (gene) in its own cluster. It works by agglomerating the closest pair of clusters at each stage, successively combining clusters until all of the data is in one cluster. The clustering sequence is represented by a hierarchical tree{ the \dendogram"{, which can be cut at any level to yield a speci ed number of clusters. Eisen et al. (1998) apply this kind of clustering to DNA microarray data. Top down clustering starts with a speci ed number of clusters and initial positions for the cluster centers. The K-means (or Lloyd's) algorithm ((Lloyd 1957),(MacQueen 1967)) is used to reposition the cluster centers through the following steps a) observations are assigned to the closest cluster center to form a partition of the data, b) the observations in each cluster are averaged to produce new values for the center vector of that cluster. Steps (a) and (b) are iterated, and the process converges to a local minimum of the total within cluster variance. Typically the K-means procedure is repeated with 4

a number of initial values for the cluster centers, and the best solution (in terms of total within cluster variance) is chosen. Tree-structured vector quantization (TSVQ) carries out K-means clustering in a top-down, binary manner (Gersho & Gray (1992), Perlmutter, Cosman, Olshen, Gray, Li & Bergin (1998)). It is commonly used in image and signal compression. Principal components (e.g. Mardia, Kent & Bibby (1979)) when applied to the genes, nds the linear combinations of genes having the highest variance. Similarly, when applied to cell lines, it nds the highest variance linear combination of the cell lines. The correlation of each gene with the leading principal component provides a way of sorting (clustering) the genes, and similarly for the cell lines. The self-organizing map (SOM)(Kohonen (1989)) is similar to K-means clustering, with the additional constraint that the cluster centers are restricted to lie in a one or two-dimensional manifold. An online procedure is used to readjust the positions of the centers. There is a similarity between SOMs, multi-dimensional scaling and nonlinear principal components. See Ripley (1995) and Cherkassky & Mulier (1998) for more details. This method was used successfully for DNA microarray data by Tamayo et al. (1999). We have found that K-means clustering produces tighter clusters than hierarchical clustering, but the latter tends to produce a greater number of 5

smaller clusters, which can be a valuable feature for discovery. Unlike Kmeans, hierarchical clustering also produces an ordering of the objects (see below) which can be informative for data display. SOMs allow interpretation of the clusters, but should be checked against K-means clustering to see if the low-dimensional representation for the cluster centers is a reasonable assumption for the data. All of these methods are one-way clustering techniques. In this paper we investigate the use of two-way clustering, to simultaneously cluster both genes and cell lines. One simple approach this problem is to apply a oneway clustering method to the genes and cell lines separately, and we do this below. Block clustering, in contrast, uses both gene and cell line information to simultaneously cluster both. The two-way clustering procedures seek a global organization of genes and cell lines. We nd that they are able to discover gross global structure but may not be e ective for discovering nal detail. In response to this nding, we propose a new method call \gene shaving" which searches for sets of genes that optimally separate the cell lines.

3 Materials and methods
Data and preprocessing. Our data take the form of an m n matrix of real-

6

valued expression levels Y = yij , where genes are the rows and samples are the columns.
Two way clustering. We investigate four di erent methods for two-way

clustering. The rst three methods cluster and reorder the rows and columns of data matrix separately from one another. a) Two-way hierarchical clustering. We use average linkage Euclidean distance- based hierarchical clustering, on the rows and columns separately (see e.g. Hartigan (1973)). This also produces a (non-unique) ordering of the objects, one that ensures that the branches of the corresponding dendogram do not cross. We reorder the row and columns according to these orderings, and display the resulting data matrix. b) Two-way K-means clustering. As in (a), we cluster the rows and columns separately. We use 200 clusters for the genes and 20 for the cell lines, and then display the rows and columns ordered within cluster by multi-dimensional scaling, and between clusters by multi-dimensional scaling of the cluster centers. c) Two way tree-structured vector quantization (TSVQ). This procedure is K-means clustering, performed in a top-down, binary tree fashion (Gersho & Gray (1992), Perlmutter et al. (1998)). Two-means 7

clustering is performed at each tree node, and the best node is successively split until the speci ed number of clusters is obtained. An advantage over simple K-means is that an ordering of the objects can be obtained from the leaves of the tree. d) Principal components/ singular value decomposition. Here we compute a singular value decomposition of the data matrix. The leading left and right singular vectors are the rst principal components of the genes and cell lines respectively. We then sort the genes from smallest to largest inner product with the rst principal component of the genes, and similarly for the cell lines. e) Block clustering. This is a top down, row and column clustering of a data matrix. It reorders the rows and columns to produce a matrix with homogeneous blocks of the outcome (here gene expression). Block clustering also produces hierarchical clustering trees for the rows and columns. The basic algorithm for forward block splitting is due to Hartigan (1972); we have added a backward pruning procedure and devised a permutation-based method for deciding on the optimal number of blocks. Hartigan (1972) reviews earlier work on two-way clustering, citing Good (1965) and Tryon & Bailey (1970). Hartigan called his approach \direct clustering", but it has become known as block clustering 8

(e.g. (Du y & Quiroz 1991)). Here is an outline of the block clustering procedure: Begin with the entire data in one block At each stage nd the row or column split of all existing blocks into two pieces, choosing the one that produces largest reduction in the total within block variance. Allowable splits: if there are existing row splits that intersect the block, one of these must be used for the rows, called a \ xed split". The same is done for columns. Otherwise all split points are tried. The splitting is continued until a large number of blocks are obtained, and then some block are recombined until the optimal number of blocks are obtained (see discussion of this point below) To nd the best split into two groups, one can show that it is su cient to sort the rows (or columns) by row (resp. column) mean, and then seek a split in that order. A drawback of block clustering when applied to median centered data (the case here) is that at the start, all row and columns means are approximately zero. Hence the procedure has di culty getting started. By restricting the splits to xed splits, this ensures that a) the overall partition can be displayed as a contiguous representation, with a common 9

re-ordering for the rows and columns, and b) the partitions of each of the rows and columns can be described by hierarchical trees. Figure 1 shows a simple example for illustration. There are 5 genes and 3 cell lines, labelled 1{5 and 1{3 respectively. The rst (vertical) split separates cell line 3 from 1 and 2. The second (horizontal) split separates genes 2 and 3 from 1,4,5. Now consider splitting the rightmost box. The split that separates genes 1 and 2 from 3,4,5 in the right box would not allow a single contiguous representation of the entire data matrix, and hence is not permitted. The split that separates gene 2 from 1,3,4,5 violates property (b) above and is also not permitted. The only permissible horizontal split of the rightmost box is the one that separates genes 2 and 3 from 1,4,5, continuing the horizontal line segment in the left box all the way to the right. The contiguity property (a) is most important to preserve. It is however possible to relax (b), allowing splits such as 2 vs 3,5,4,1 in the right box.
Stopping rule for splitting blocks. For all clustering techniques, estimation

of the appropriate number of clusters is an important but di cult problem. Clustering algorithms will nd clusters, when applied to independent (unclustered) data, so it is important to calibrate them. Milligan & Cooper (1985) compare many of the suggested approaches to the problem, for oneway clustering. For block clustering, Du y & Quiroz (1991) suggest the use 10

1

4

genes

5

3

2

3

1 cell lines

2

Figure 1: Simple example to illustrate the block clustering rules. The rst
(vertical) split separates cell line 3 from 1 and 2. The second (horizontal) split separates genes 2 and 3 from 1,4,5. If the rightmost box is split horizontally, it must be split between genes 2,3 and 1,4,5.

11

of permutation tests to determine when a given block split is not signi cant. However this can lead to early stopping of the splitting process, which can miss good block splits later. Instead, our strategy is to split into some large number of blocks M , and then apply weakest link pruning (recombining) of the block to produce a series of partitions having di erent numbers of blocks (between 1 and M ). Then we apply the algorithm to permuted versions of the data, to estimate the best number of blocks k \maximum gap" approach: 1. Let rssk be the total within block sum of squares, when k clusters are used. 2. Create a new dataset by separately permuting the elements within each row of the data matrix, thereby forming a new data matrix. Apply block clustering to the permuted data, and let rss0 be the resulting k within cluster sum of squares. Do this for a number of permuted datasets (say 10) and compute the average ave(rss0 ). k 3. Compute the gap function gap(k) = ave(rss0 ) rssk k and nally choose the value of k that maximizes gap(k). 12 (1)

M . Here is a summary of what we call the

The idea is that the optimal number of blocks is the value for which the drop in residual sum of squares, relative to what we expect from permuted data, is largest. The same idea can be used to estimate the optimal number of clusters in hierarchical or K-means clustering. In that case, if rows were being clustered, we would permute the elements within each row of the data matrix (and similarly for clustering columns). e) Gene shaving The two-way clustering methods seek a single re-ordering of the cell lines for all genes. However a more complex pattern may exist. In particular, one set of genes might cluster the cell lines in one fashion, and another set of genes might produce a very di erent clustering. Here we describe a method which rst nds the linear combination of genes having maximal variation among the cell lines . We think of this linear combination as a \super gene". The genes having lowest correlation with super gene are then removed (\shaved") from the data, and the process is continued until the subset of genes contains only one gene. This process produces a sequence of gene blocks, each containing genes that are similar to one another and displaying large variance across the cell lines. The details of the gene shaving procedure are as follows: 1. Start with all of the data. Find the rst principal component of the 13

genes. 2. For each gene i compute the absolute value of its correlation with the rst principal component. 3. Remove the fraction of genes having the smallest absolute correlation. 4. Repeat steps 2 and 3 until only one gene remains. The proportion of genes shaved o at each stage is taken to be 10%. Denote the full set of genes by G. If B shaving steps are needed to leave a single gene, this procedure produces a sequence of nested gene groups

G

G1

G2

GB . In order to estimate the optimal shave size, we

can compare the columnwise variance for each group to that obtained by applying the procedure to permuted data (the \maximum gap" method), as described above for block clustering, to obtain an optimal gene group G^. b For illustration here, we have instead chose a constant shave size of 10 genes, which is fairly close to the optimal number found from the gap method. After isolating this optimal gene group, we compute its vector of column averages. and then for each gene we remove the component that is correlated with this average. With this modi ed data matrix we repeat the above procedure, obtaining a new gene shave. This is done repeatedly until no interesting gene shaves can be found. 14

The Dataset.
The dataset used in our study has expression measurements on 6830 genes for a set of 64 human cancer tumors. A full decription of these data appears in Ross et al 1999] The row and column median were set to zero, by alternately subtracting o median of each column and each row, in an alternating fashion. Finally, missing values were set to the value zero.

4 Results
Two-way clustering

Figures 2 | 7 show the clustering results for the human tumor data. K-means clustering performs poorly probably because it does not give an order of the clustered objects. In the gure we have used multidimensional scaling to order the objects within each cluster, and to order the centroids of each cluster. TSVQ xes this problem, and gives a similar picture to hierarchical clustering. Both TSVQ and hierarchical clustering have successfully organized the genes and cell lines to produce some visible structure. Block clustering probably does the best job of discovering contiguous blocks of gene expression. Two of the cell lines have two replicates in the dataset, indicated by 15

the su x \repro". An e ective clustering technique should place replicates nearby one another. Examining the gures, this occurs for hierarchical, principal component, TSVQ and partly for block clustering. K-mean clusterings fails in this regard. Block clustering also gives a one-way clustering of the cell lines, and and a one-way clustering of the genes. In Figure 7, the cell lines are partitioned into 9 groups, by the vertical lines in the diagram. This partition is hierarchical, meaning that for any two subpartitions the rst is contained in the second, or vice versa. Examination of the genes in Figure 7 corresponding to the green block in the bottom left, and the red block in the middle left reveal a number that are known to be characteristically up or down regulated in leukemia. Also included are unregulated ring 3 proteins, and cytoskeletal proteins. The presence of a breast cell line clustered with the leukemias is somewhat surprising, and is also seen with some of the other clustering techniques. However it di cult to extract ne gene-cell line interactions from block clustering or any of the other global clustering schemes.
Gene shaving

The rst three blocks of genes from the gene shaving process are shown in gures 8 and 10. The variance of the column means of gene expression is indicated in the heading. Some clear separation of the cell lines is visible. Al16

columns. The order of the rows and columns was randomly chosen.

Figure 2: Human tumor data, with genes in the rows and cell lines in the

BREAST BREAST CNS RENAL RENAL MELANOMA LEUKEMIA CNS COLON BREAST OVARIAN COLON LEUKEMIA BREAST NSCLC BREAST PROSTATE LEUKEMIA MELANOMA RENAL RENAL RENAL OVARIAN OVARIAN MELANOMA OVARIAN NSCLC UNKNOWN COLON OVARIAN CNS CNS MELANOMA RENAL RENAL LEUKEMIA MELANOMA OVARIAN NSCLC RENAL NSCLC NSCLC NSCLC BREAST MELANOMA COLON PROSTATE K562B-repro BREAST LEUKEMIA NSCLC NSCLC COLON COLON COLON MCF7D-repro MELANOMA CNS LEUKEMIA NSCLC MELANOMA K562A-repro RENAL MCF7A-repro

17

raw data

Figure 3: Clustering for human tumor data. Shown is the result of reordering

rows and columns, from hierarchical clustering applied separately to each.

BREAST MCF7A-repro BREAST MCF7D-repro COLON COLON COLON COLON COLON COLON COLON LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA K562B-repro K562A-repro LEUKEMIA LEUKEMIA

two way hierarchical clustering

NSCLC RENAL BREAST NSCLC NSCLC MELANOMA MELANOMA MELANOMA MELANOMA MELANOMA BREAST BREAST MELANOMA MELANOMA RENAL UNKNOWN OVARIAN BREAST NSCLC CNS CNS CNS CNS BREAST OVARIAN OVARIAN RENAL RENAL RENAL RENAL RENAL RENAL RENAL OVARIAN OVARIAN NSCLC NSCLC NSCLC NSCLC MELANOMA CNS NSCLC PROSTATE OVARIAN PROSTATE

18

Figure 4: Clustering for human tumor data. Shown is the result of K-means

clustering, applied separately to rows and columns.

RENAL RENAL MELANOMA NSCLC CNS CNS RENAL MELANOMA BREAST OVARIAN RENAL NSCLC NSCLC OVARIAN BREAST OVARIAN LEUKEMIA CNS OVARIAN RENAL MCF7D-repro PROSTATE NSCLC

Two-way k-means

COLON COLON LEUKEMIA LEUKEMIA COLON MCF7A-repro BREAST RENAL UNKNOWN LEUKEMIA OVARIAN BREAST NSCLC RENAL LEUKEMIA CNS PROSTATE BREAST RENAL LEUKEMIA NSCLC CNS COLON COLON K562B-repro K562A-repro MELANOMA NSCLC NSCLC BREAST OVARIAN NSCLC RENAL MELANOMA COLON COLON BREAST

19

structured vector quantization, applied separately to rows and columns.

Figure 5: Clustering for human tumor data. Shown is the result of tree-

NSCLC UNKNOWN OVARIAN MELANOMA CNS BREAST NSCLC CNS CNS CNS RENAL BREAST CNS BREAST NSCLC RENAL RENAL RENAL RENAL RENAL RENAL RENAL RENAL PROSTATE OVARIAN PROSTATE NSCLC NSCLC NSCLC NSCLC OVARIAN OVARIAN OVARIAN OVARIAN MELANOMA MELANOMA MELANOMA MELANOMA MELANOMA MELANOMA MELANOMA BREAST BREAST LEUKEMIA NSCLC NSCLC K562B-repro K562A-repro LEUKEMIA MCF7A-repro BREAST MCF7D-repro BREAST COLON COLON COLON COLON COLON COLON COLON LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA

20

two-way TSVQ

ordered with respect to their inner product with the rst principal component.

Figure 6: Clustering for human tumor data. Here the rows and columns are

BREAST MELANOMA OVARIAN PROSTATE PROSTATE NSCLC BREAST OVARIAN OVARIAN NSCLC MELANOMA OVARIAN NSCLC MELANOMA NSCLC MELANOMA OVARIAN MELANOMA UNKNOWN COLON

Largest principal component

BREAST MELANOMA NSCLC RENAL MELANOMA NSCLC MELANOMA OVARIAN COLON CNS RENAL COLON NSCLC RENAL BREAST LEUKEMIA CNS COLON BREAST RENAL NSCLC RENAL RENAL CNS LEUKEMIA CNS COLON NSCLC CNS MCF7D-repro RENAL COLON COLON RENAL RENAL BREAST MCF7A-repro K562B-repro LEUKEMIA K562A-repro LEUKEMIA BREAST LEUKEMIA LEUKEMIA

21

Figure 7: Clustering for human tumor data. Result from block clustering:

rows and columns have been rearranged, and some contiguous blocks are vis-

ible.

LEUKEMIA LEUKEMIA LEUKEMIA K562A-repro LEUKEMIA BREAST MCF7A-repro LEUKEMIA MCF7D-repro COLON NSCLC COLON BREAST NSCLC COLON BREAST MELANOMA COLON BREAST COLON RENAL MELANOMA UNKNOWN OVARIAN OVARIAN BREAST PROSTATE OVARIAN RENAL K562B-repro LEUKEMIA COLON COLON MELANOMA OVARIAN MELANOMA MELANOMA MELANOMA PROSTATE OVARIAN MELANOMA NSCLC OVARIAN MELANOMA NSCLC NSCLC NSCLC NSCLC NSCLC RENAL NSCLC CNS CNS RENAL BREAST RENAL RENAL CNS CNS CNS RENAL RENAL RENAL BREAST

block clustering

22

though the cancer classes were not used in the shaving process, the resulting orderings are quite successful at grouping together some of the classes. The gene names shown at the left of each rectangle are internal codes. The Most of the genes are uncharacterized, illustrating the potential for this technique to discover new patterns of expression. The full genes names are :

Block 1:

1. "357775" "SIDW357775,HumannuclearorphanreceptorLXR-alphamRNA,completecds 5':W95560,3':W95433]" 2. "512287" "SID512287,Humanneuronalpentraxin1(NPTX1)mRNA,completecds 5':AA057692,3':AA057694]" 3. "359412" "SIDW359412,CyclinD2 5':AA011227,3':AA010487]" 4. 376178" "SIDW376178,Human5'-AMP-activatedproteinkinase,gamma-1subunitmRNA, completecd 5':AA040683,3':AA040600]" 5. "136798" "FN1Fibronectin1Chr.2 136798,(IEW),5':R36450,3':R36451]" 6."359396""SIDW359396,HumancGMP-stimulated3',5'-

23

-cyclicnucleotidephosphodiesterasePDE2A3(PDE2A)mRNA,completecd

5':AA010496,3'

7. "376052" "SIDW376052,Humannucleotide-bindingproteinmRNA,completecds 5':AA039305,3':AA039353]" 8. "151144" "FN1Fibronectin1Chr.2 151144,(EW),5':H03906,3':H03907]" 9. "324037"

"SIDW324037,Homosapiensclone24590mRNAsequence 5':W46518,3':W46450]

Block 2

1. "50250" 2. "512355"

ESTsChr.9 50250,(R),5':H17799,3':H17800]"

"SID512355,ESTs,HighlysimilartoSRCSUBSTRATEP80/85PROTEINS Gallusgallus] 5':AA059424,3':AA057835]"

Block 3

1. "241935" SPP1Secretedphosphoprotein1(osteopontin,bonesialoproteinI)Chr.4 241935,(EW),5':H93913,3':H93048]" 2. "363981" "SPP1Secretedphosphoprotein1(osteopontin,bonesialoproteinI)Chr.4 363981,(EW),5':AA021511,3':AA021512]"

24

The rst block are related to stromal cells, and tend to separate the tissue tumors from blood cancers. The second block of genes are uncharacterized. The third block consists of Secreted phosphoprotein genes, and produce a di erent separation of the stromal cancers than the rst block of genes. This illustrates the potential for this technique to discover new patterns of expression.

5 Discussion
We have investigated the use of two-way clustering methods DNA microarray data. Some of the methods are successful for discovering contiguous areas of high or low gene expression, including hierarchical clustering, TSVQ, and block clustering. We have introduced the \maximum gap" diagnostic for protection against nding spurious structure. There are close connections between block clustering and the classi cation and regression tree algorithm (CART) of Breiman, Friedman, Olshen & Stone (1984). Block clustering is very similar to CART with splits on 2 categorical predictors (genes and cell lines), and the pruning algorithm is the same as that in CART. What's di erent is the restriction to xed splits and the use 25

1071X

3414X

3397X

4751X

2808X

2492X

3281X

5037X

200X

Figure 8: First gene block from gene shaving process.

LEUKEMIA NSCLC NSCLC LEUKEMIA COLON RENAL COLON BREAST NSCLC LEUKEMIA BREAST BREAST COLON OVARIAN PROSTATE BREAST COLON OVARIAN MCF7A-repro LEUKEMIA LEUKEMIA OVARIAN RENAL K562B-repro CNS MCF7D-repro K562A-repro COLON PROSTATE OVARIAN COLON COLON NSCLC BREAST CNS MELANOMA OVARIAN NSCLC NSCLC CNS MELANOMA MELANOMA MELANOMA NSCLC RENAL MELANOMA RENAL BREAST OVARIAN MELANOMA MELANOMA NSCLC CNS RENAL CNS RENAL NSCLC UNKNOWN RENAL RENAL LEUKEMIA MELANOMA RENAL BREAST

variance= 4.37

26

6293X

3004X

4344X

2453X

1082X

2016X

802X

118X

502X

Figure 9: Second gene block from gene shaving process.

BREAST MELANOMA BREAST MELANOMA MELANOMA LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA MELANOMA MELANOMA K562A-repro LEUKEMIA COLON MELANOMA MELANOMA BREAST MCF7A-repro MELANOMA K562B-repro COLON COLON COLON CNS BREAST RENAL MCF7D-repro NSCLC COLON OVARIAN COLON NSCLC BREAST COLON OVARIAN OVARIAN OVARIAN BREAST NSCLC NSCLC PROSTATE RENAL RENAL RENAL RENAL OVARIAN NSCLC BREAST RENAL CNS NSCLC RENAL NSCLC CNS UNKNOWN CNS RENAL OVARIAN PROSTATE NSCLC CNS RENAL NSCLC

variance= 3.007

27

4325X

263X

Figure 10: Third gene block from gene shaving process.

MCF7A-repro HS_578T_CL5006__BREAST MOLT-4_CL7006_LEUKEMIA NCI-H226_CL1013__NSCLC CCRF-CEM_CL7003_LEUKEMIA ADR-RES_CL5002_UNKNOWN SR_CL7019__LEUKEMIA OVCAR-8_CL6005_OVARIAN K562A-repro SNB-75_CL12005_RENAL HCT-15_CL4015__COLON NCI-H522_CL1003__NSCLC T-47D__CL5014__BREAST MCF7D-repro OVCAR-5_CL6003_OVARIAN KM12__CL4017_COLON HOP-62_CL1026_NSCLC SF-539__CL12016_CNS SN12C_CL9008__RENAL

variance= 11.344

BT-549_CL5013_BREAST OVCAR-3_CL6001_OVARIAN PC-3 (CL11001) PROSTATE DU-145_CL11003_PROSTATE SF-268__CL12014_CNS HL-60 (CL7008) LEUKEMIA SW-620_CL4009_COLON SF-295_CL12015_CNS HCT-116_CL4003_COLON MCF7_CL5001__BREAST K-562 (CL7005) LEUKEMIA MALME-3M_CL10002_MELANOMA MDA-MB-231_CL5005__BREAST HOP-92__CL1029_NSCLC K562B-repro COLO205_CL4010_COLON HT-29___CL4001__COLON UACC-62_CL10020_MELANOMA NCI-H23_CL1001__NSCLC EKVX__CL1008_NSCLC SK-MEL-2_CL10005_MELANOMA OVCAR-4 (CL6002) OVARIAN LOXIMVI (CL10001) MELANOMA UACC-257CL10021_MELANOMA HCC-2998_CL4002_COLON SK-OV-3_CL6011_OVARIAN A549_CL1004__NSCLC CAKI-1_CL9015_RENAL SK-MEL-28_CL10008_MELANOMA 786-0__CL9018_RENAL TK-10_CL9024_RENAL SNB-19_CL12002_CNS SK-MEL-5_CL10007_MELANOMA RPMI-8226_CL7010__LEUKEMIA UO-31_CL9004__RENAL RXF-393_CL9016__RENAL U251_CL12009_CNS MDA-N_CL5012__BREAST ACHN_CL9023_RENAL M-14_CL10014_MELANOMA MDA-MB-435_CL5011__BREAST A498_CL9013_RENAL NCI-H322_CL1017_NSCLC NCI-H460_CL1021_NSCLC IGROV1_CL6010_OVARIAN

28

of permutations to estimate the optimal number of splits. By seeking a single global organization of the data, the two-way clustering procedures are limited in their ability to discover ne structure. The gene shaving method, introduced here, looks for blocks of genes that produce different separations of the cell lines, and the initial results look very promising.

There are many interesting modi cations of this procedure. For example any aspect of the data can be used to direct the shaving process. If class labels are available for the cell lines (tumor types in our example), the shaving can be supervised by these labels. The procedure then tries to nd subsets of genes that separate the classes as well as possible. Details will be given in a forthcoming paper.

Acknowledgments: We would like to thank Andreas Buja for pointing
us to the work of Hartigan, and Du y and Qurioz on block clustering.

References
Botstein, D. & Brown, P. (1999), `Exploring the new world of the genome with dna microarrays', Nature Genetics (Supp.) 21, 33{7. Breiman, L., Friedman, J., Olshen, R. & Stone, C. (1984), Classi cation and
Regression Trees, Wadsworth.

29

Cherkassky, V. & Mulier, F. (1998), Learning from data, Wiley. Chu, S.and DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P. O. & Herskowitz, I. (1998), `The transcriptional program of sporulation in budding yeast', Science 282, 699{705. Du y, D. & Quiroz, A. (1991), `A permutation-based algorithm for block clustering', J. of Classi cation 8, 65{91. Eisen, M., Spellman, P., Brown, P. & Botstein, D. (1998), `Cluster analysis and display of genome-wide expression patterns', Proc. Nat. Acad. Sci

95, 14863{14868.
Everitt, B. (1980), Cluster analysis, Halstead, New York. Gersho, A. & Gray, R. M. (1992), VECTOR QUANTIZATION AND SIGNAL COMPRESSION, Kluwer Academic Publisher.

Good, I. (1965), Categorization of Classi cation Mathematics and Computer
Science in Biology and Medicine, Her Majesty's Stationary O ce, Lon-

don. Gordon, A. (1999), Classi cation (2nd edition), Chapman and Hall/CRC press, London.

30

Hartigan, J. (1972), `Direct clustering of a data matrix', J. Amer. Statis.
Assoc. 6, 123{129.

Hartigan, J. (1973), Clustering algorithms, Wiley, New York. Iyer, V. R., Eisen, M. B., Ross, D. R., Schuler, G., Moore, T., Lee, J. C. F., Trent, J. M., Hudson, J., Boguski, M., Lashkari, D.and Shalon, D., Botstein, D. & Brown, P. (1999), `The transcriptional program in the response of human broblasts to serum', Science 283, 83{87. Kaufman, L. & Rousseeuw, P. (1990), Finding groups in data: an introduction to cluster analysis, New York; Wiley.

Kohonen, T. (1989), Self-Organization and Associative Memory (3rd edition), Springer-Verlag, Berlin.

Lloyd, S. (1957), Least squares quantization in pcm., Technical report, Bell Laboratories. Published in 1982 in IEEE Trans. Inf. Theory, 28, 128-137. MacQueen, J. (1967), Some methods for classi cation and analysis of multivariate observations, in `Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L.M. LeCam and J. Neyman', Univ. of Cal. Press, pp. 281{297.

31

Mardia, K., Kent, J. & Bibby, J. (1979), Multivariate Analysis, Academic Press. Milligan, G. W. & Cooper, M. C. (1985), `An examination of procedures for determining the number of clusters in a data set', Psychometrika

50, 159{179.
Perlmutter, S., Cosman, P.C.and Tseng, C.-W., Olshen, R., Gray, R., Li, K. & Bergin, C. (1998), `Medical image compression and vector quantization', Statistical Science 13, 30{53. Ripley, B. D. (1995), Pattern Recognition and Neural Networks|a Statistical
Approach, Cambridge University Press.

Roth, F.P.and Hughes, J., Estep, P., & Church, G. (1998), `Finding dna regulatory motifs within unaligned noncoding sequences clustered by whole genome mrna quantitation', Nat. Biotechnol. 16, 939{45. Sokal, R. & Mitchener, C. (1958), `A statistical method for evaluating systematic relationships', Univ. Kansas Sci. Bull.. 38, 1409{1438. Spellman, P. T., Sherlock, G., Iyer, V. R., Zhang, M., Anders, K., Eisen, M. B., Brown, P. O. & Botstein, D.and Futcher, B. (1998), `Comprehensive identi cation of cell cylce-reulated genes of the yeast saccharomyces by microarray hybridization', Mol. Cell. Biol. 9(12), 3273{975. 32

Tamayo, P., Slonim, T., Mesirov, J., Zhu, Q., Kitareewan, S. & Dmitrovsky, E. (1999), `Interpreting patterns of gene expression with self-organizing maps: Methods and applications to hematopoietic diferentation', Proc.
Nat. Acad. Sci 96, 2907{2912.

Tryon, R. & Bailey, D. (1970), Cluster Analysis, McGraw-Hill., New York.

33