Professional Documents
Culture Documents
1. Introduction
Hierarchical clustering is widely used in the bioinformatics community to analyze
data and to reveal hidden structures that do not become obvious with simple visu-
alization strategies. The advantage of hierarchical clustering consists in making it
possible to visualize the results in the form of a binary tree diagram or dendrogram
that groups similar instances together. As a rule of thumb, the Euclidean distance or
one minus a suitable correlation coe±cient are used as the distance measures that is
1550012-1
N. Novoselova, J. Wang & F. Klawonn
required for hierarchical clustering. The leaves of the dendrogram derived from hi-
erarchical clustering correspond to the instances or objects, e.g. genes, proteins or
patients. For the visualization of the dendrogram a linear order of the leaves is
required that is consistent with the clustering result. The clustering itself yields only
a binary tree without specifying which of the subtrees should be placed on the left or
right. Switching the two subtrees of any node will lead to a di®erent dendrogram
although it still re°ects the same hierarchical clustering result. There exist the 2 n1
possible orderings of a binary tree with n leaf nodes, since in any of the n 1 non-leaf
nodes we have two choices. The chosen leaf order will highly in°uence the visuali-
zation and therefore the interpretation of the data based on the corresponding
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com
rearrange the leaf order of a dendrogram. However, all these methods are based
exclusively on the distance matrix, but do not exploit additional information pro-
vided by class labels as in our approach. Before we describe the principles of our
approach, we brie°y review the strategies from the literature that do not incorporate
class labels, since we borrow some algorithmic ideas from them.
The approaches1–4 try to rearrange the linear ordering of the leaf nodes in the
dendrogram such that the sum of the distances of adjacent leaves is minimized. Eisen
et al.1 proposed a simple heuristic strategy that switches subtrees according to
weights that can be assigned to the leaves (for example the average gene expression
over all experiments or the position of a gene in the chromosome). An implementation
of this method can be found in Ref. 2. After that several authors proposed di®erent
computationally feasible algorithms to optimize the leaf ordering.3,4 The optimization
of the ordering used by Bar-Joseph et al.3 is based on a recursive process, which
estimates the optimal ordering of subtrees in a bottom-up way. It resembles a
dynamic programming approach and includes a backtracking phase to improve
orderings made in previous steps of the algorithm. The algorithm's complexity is
Oðn 4 Þ, where n is the number of instances in the data set. The authors proposed
also an improvement of the algorithm that allows the earlier search termination and
can dramatically decrease the computation time in average. The algorithm in Ref. 4
is also based on dynamic programming using a three-dimensional table that stores
the costs of the best linear ordering of the leaves in the subtree, starting from the
bottom. The algorithm complexity is proclaimed to be Oðn 3 Þ when a multiplication
instead of sum is used in computing the table elements. Other approaches to optimal
leaf ordering can be found in Refs. 5–8. But they also do not consider the class labels
in the optimization process. Buchta and Hahsler7 adopted the ordering concept,
described in Ref. 3, where the minimal path length is de¯ned to be an optimal
ordering. Hahsler et al.8 implemented several seriation/sequencing techniques to
reorder matrices, dissimilarity matrices and dendrograms. Sakai9 proposes the R
package for sorting the dendrogram based on the average distance of subtrees.
In many studies in the ¯eld of medicine and biology, the elements of the data set,
characterized by the number of features (e.g. gene expression levels in microarray),
1550012-2
Optimized leaf ordering
are also tagged by some nominal class labels, as for example gene ontology (GO)
terms or disease subtypes. In such cases, it is often of interest to see whether the
clustering result is more or less coherent with the class labels, i.e. instances with
the same label also cluster together. If this is the case, the distance measure can
be considered suitable to classify unlabeled instances in nearest neighbor fashion.
Figure 1(a) shows a dendrogram for microarray data with three di®erent tissue
types,10 obtained by applying agglomerative hierarchical clustering to leukemia data
set. The Euclidean distance measure was used to compute the distance matrix. The
branches with the same color correspond to the same tissue type. It can be seen that
most of the tissue types cluster together, but some instances are apart from their
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com
proper group of tissue type. The ordering of the leaf nodes in Fig. 1(b) obtained by
simply °ipping some of the inner nodes results in dendrogram that is even more
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.
(a)
(b)
Fig. 1. Hierarchical clustering of three tissue types: 25 acute myeloid leukemia (AML) samples; nine
T-lineage acute lymphoblastic leukemia (ALL) samples; and 38 B-lineage ALL from acute leukemia
patients.10 (a) Dendrogram obtained from clustering. (b) Dendrogram after rearrangement of the leaf nodes.
1550012-3
N. Novoselova, J. Wang & F. Klawonn
In this section, we will formalize the problem of optimal leaf ordering of instances
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.
according to their class labels. Then we provide our dynamic programming algorithm
to solve this problem and some modi¯cations which help to reduce its time com-
plexity. After that we will brie°y describe our approach to optimize the leaf ordering
with class labels using a GA.
1550012-4
Optimized leaf ordering
where pi is number of individual sequences of elements with the ith class label when T
is ordered according t o , K is the number of di®erent class labels, coef 2 1; 2 is the
parameter to be chosen that reinforces the in°uence of longer sequences.
The value of coef is equal to 1.5 in our experiments. The coef values in the interval
1; 2 give the similar results. Our goal is to ¯nd such an ordering that the F 1 ðT Þ is
maximized, in other words, in the optimal ordering the sequences of identical class
labels should be as long as possible. Note that not all permutations 2 are ad-
mitted but only those that are consistent with the given dendrogram. Otherwise, one
could easily optimize the objective function (1) by simply enumerating the instances
by the classes.
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com
1 2
, , , ,
u 1 1 w
Fig. 2. Binary tree with a root node vr1 , where node v is the least common ancestor of elements u and w.
1550012-5
N. Novoselova, J. Wang & F. Klawonn
The computation of the optimized function value for the internal node v is based
on the known optimized function values of child nodes vl and vr .
To compute the value of the cell ðu; wÞ of matrix A for node v we must check
all feasible leaves i1 ; j1 , where i1 is the rightmost leaf of vl and j1 is the leftmost leaf
of vr . Then the optimized function value can be de¯ned as
8
>
> max ðA ½u; i1 þ Avr ½j1 ; wÞ; if classði1 Þ! ¼ classðj1 Þ
>i1 2vl;r ;j1 2vr;l vl
>
>
>
< max
A½u; w ¼ i1 2vl;r ;j1 2vr;l !
>
>
>
> Avl ½u; i1 þ Avr ½j1 ; w R½u; i1 coef
>
> ; if classði1 Þ ¼ classðj1 Þ;
: R½w; j1 coef þ ðR½u; i1 þ R½w; j1 Þ coef
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com
ð2Þ
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.
1550012-6
Optimized leaf ordering
u w
1 1 1 2 2 1 1
[ , ] [ , ]
objective function for the internal nodes and to store the lengths of sequences of
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.
leaves with the same class label from the left and right side of each subtree v. For
example, in Fig. 3 for the subtree v with leaves having class labels 1 or 2, R½u; w ¼
R½j1 ; w ¼ 2 and R½w; u ¼ R½i1 ; u ¼ 3 holds.
As a result of the algorithm we obtain the matrix A with elements ðu; wÞ corre-
sponding to the function values for the optimized leaf ordering from the leftmost
element u to the rightmost element w of the corresponding internal node v. The
maximal element ðuopt ; wopt Þ of the matrix A corresponds to the optimized ordering
of the tree T .
Using the matrices maxI; maxJ we can recover the sequence of elements in the
optimized leaf ordering of the whole tree T .
1550012-7
N. Novoselova, J. Wang & F. Klawonn
5:
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.
6:
7: return(
8: else
9: ( =OrderingClass ( )
10: ( = OrderingClass ( )
11: for all leaves
12: if // vremI is the number of intermediate leaves of the left
13: = // subtree to search for the optimal value of
14: else
15: =
16: end if
17: for all leaves
18: if
19: = //vremJ is the number of intermediate leaves of the
20: else //right subtree to search for the optimal
21: =
22: end if
23: max=-1
24: for all leaves
25: for all leaves
26:
27:
28:
29: +
30: end if
31: if >max
32: max=
33:
34:
35: end if
36: end if
1550012-8
Optimized leaf ordering
37: end if
38: =max
39:
40: =
41: =
42: if((all class( ) are equal) and
43: (class( )=class( )
44:
45: else
46:
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com
47: end if
48: if((all class( ) are equal) and
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.
Algorithm 1. (Continued )
1550012-9
N. Novoselova, J. Wang & F. Klawonn
7: (leaf, =SubTree( , )
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.
8: if(leaf!=NULL)
9: len=size( //number of elements of subtree
10: leaf // subtree is a singleton
11: , =
12: , = len
13: end if
14: end if
15: if is not a singleton
16: (leaf, =SubTree( , )
17: if(leaf!=NULL)
18: len=size( //number of elements of subtree
19: leaf // subtree is a singleton
20: , =
21: , =len
22: end if
23: end if
24: return( , leaf=NULL, )
25: end
26:
Algorithm 2
where F 2 ðT Þ is the partition entropy value for elements in c, K is the number of class
labels, ni is the number of elements of the ith class label, pi is the number of con-
tinuous sequences of elements of class i, ni;j is the number of elements in the jth
sequence with the ith class label. In order to ¯nd the optimal leaf ordering the
function F 2 ðT Þ must be minimized.
1550012-10
Optimized leaf ordering
, ,
1 1 1 2 2 2 1 1 1 2 2 1
, = , =
Fig. 4. Merging of subtrees vl;l and vr;l having the same class labels.
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com
A binary coding of the solutions is an obvious choice in this case and suits very
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.
well to the concept of GA. Since the permutation of the leaf nodes is induced by
either switching or not switching the subtrees of any of the n 1 inner nodes, a
solution can be coded as a binary vector of length n 1. Such a binary vector induces
a unique permutation of the leaf nodes. 0 stands for not switching the subtrees of the
corresponding node, whereas 1 stands for switching them.
In Fig. 5, we can see an example of the correspondence between the individual of
GA and the resulting tree.
The value of the ¯tness function for an individual of the GA is computed
according to (3) for the decoded tree. The ¯tness function indicates the e±ciency of
the solution for the optimization task, encoded in the GA individual. The initial
population is generated at random. For the realization of the GA optimization
scheme the standard crossover and mutation operations are applied to the popula-
tion of GA individuals. The optimization scheme consists of repeated execution of
several steps: decoding of individuals, calculation of the ¯tness function, selection of
individuals according to the function values, recombination, and elitism operations.
At each generation a new population of the GA is formed. The iteration process
Hierarchical trees
1 1 2 1 1 2 1 2 1 1 2 1
0 0 0 0 0 1 0 0 1 1
GA individual
1550012-11
N. Novoselova, J. Wang & F. Klawonn
3. Experiments
We have performed several experiments to test our approach of reordering the leaves
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com
possible. The resulting ordered collection of data cases (e.g. genes in microarray
analysis) is displayed in the form of a heatmap where rows represent cases and
columns represent single features (e.g. experiments in microarray analysis). Each
data point corresponds to a colored box, where the color represents a feature value.
The optional vertical and horizontal side bars can be used to annotate the rows and
columns of the heatmap. The hierarchy of the tree is also drawn on the left side of the
heatmap.
3.1. Methods
We have used two arti¯cial data sets and one biological data set to test the optimal
leaf ordering algorithm and its e®ect on the visualization of the results of hierarchical
clustering. The ¯rst simulated data set consists of 90 cases, characterized by the
values of 90 features. For each subsequent subset of 15 cases we set 40 consecutive
features to 1 (with 10 features shift), and then °ip them to 0 or 1 with probability
p ¼ 0:01. The rest of the features in the data set were generated at random. As a
result the six distinctive clusters are formed. The second arti¯cial data set is more
complex and consists of 400 data cases, each with 100 feature values. The similar
generation scheme as for the ¯rst data set is applied and eight distinctive clusters are
formed. For both data sets we have manually de¯ned the vector of class labels
c ¼ ðc1 ; c2 ; . . . ; cn Þ; ci 2 f1; 2; 3g, which does not conform to the generated clusters.
For the ¯rst data set each subsequent subset of 15 cases was di®erently labeled and
the labels of the ¯rst and the second half of data set were the same. For the second
data set the ¯rst 125 cases was attributed to class 1, the following 150 cases to class 2
and the last 125 cases to class 3.
The real biological data set presents cell cycle data.15 The cell cycle or cell-division
cycle is the series of events that take place in a cell leading to its division and
duplication (replication). The data set presents the results of microarray analysis and
consists of expression pro¯les of 800 genes, which are cell cycle regulated in Sac-
charomyces cerevisiae.3 These genes were assigned to ¯ve groups termed G1, S,
S/G2, G2/M, and M/G1, which present distinct phases of the cell cycle. The data set
1550012-12
Optimized leaf ordering
3.2. Results
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.
The results of applying the optimal leaf ordering and the hierarchical clustering
algorithm with heuristic ordering1 to the ¯rst simulated data set are shown in
Fig. 6. The additional color bar at the side of the heatmap presents the class label
annotations.
For the ¯rst simulated data set the optimal leaf orderings with class labels using
the dynamic programming algorithm and the GA yielded similar results (Fig. 6) with
less dispersed cluster labels than in hierarchical clustering alone. The evaluation
function for the optimal leaf ordering has higher values than for the unordered
hierarchical clustering (Table 1). For the second simulated data set, the optimal leaf
orderings are presented in Fig. 7. As the de¯ned class labels do not conform to the
generated clusters they are more dispersed for pure hierarchical clustering. The op-
timal leaf orderings better preserve the similarity of data pro¯les (Fig. 7). The value
of the evaluation function is slightly better for the optimal leaf ordering algorithm
with dynamic programming (Table 1).
We compared the results of the optimal leaf ordering and heuristic ordering for
the biological data set, considering the cell cycle groups as the class labels. In Fig. 8,
we can see that the optimal ordering with classes can perfectly locate similar class
Fig. 6. Comparison of leaf orderings using hierarchical clustering and two proposed approaches for the ¯rst
simulated dataset: (a) hierarchical clustering,1 (b) optimal leaf ordering (dynamic programming), and (c)
optimal leaf ordering (GA).
1550012-13
N. Novoselova, J. Wang & F. Klawonn
Fig. 7. Comparison of leaf orderings using hierarchical clustering and two proposed approaches for the
second simulated dataset: (a) hierarchical clustering,1 (b) optimal leaf ordering (dynamic programming),
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com
Fig. 8. Comparison of leaf orderings using hierarchical clustering and two proposed approaches for
biological data set.
labels close to each other. Unlike hierarchical clustering, the cell cycle order is more
preserved. Despite the optimization function of the optimal leaf ordering algorithm
does not take into account the distance similarities between gene expression pro¯les
they are better arranged according to similarity then the hierarchical clustering with
heuristic ordering. The peaks of gene expression for optimal leaf ordering change
continuously from the bottom to the top of the heatmap plot (Fig. 8). Therefore, the
optimal ordering with classes correctly identi¯ed the order of the groups in the real
cell cycle process. The results of the GA approach are slightly worse with a smaller
value of the objective function (Table 1).
We have compared the computation time of the di®erent approaches, including
the dynamic programming algorithm for leaf ordering with and without using the
Cþþ function, and the approach using the GA (Table 1). The results were obtained
on a BENQ laptop with a Microsoft Windows 7 (Service Pack 1) operating system
and an Intel(R) Core(TM)2 Duo CPU T8100 @2.10 GHz (2.00 GB RAM) processor.
As can be seen the computation time of the dynamic programming approach with
modi¯cations is the best.
1550012-14
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.
Table 1. Time complexity and values of the optimization function for optimal leaf ordering algorithms subject to class labels.
1550012-15
Simulated 400 7.34 s 1.07 s 7.34 min 3527.0 3527.0 3406.0 2900.0
data 2
Spellman 800 60.0 min 28.0 s 30.0 min 3245.0 3245.0 2960.8 2355.0
data set
Optimized leaf ordering
N. Novoselova, J. Wang & F. Klawonn
3.3. Discussion
Our optimized leaf ordering algorithm can work with any binary tree to order the
leaves according to the class labels, preserving the clustering structure. We have
made improvements to the original version of our algorithm that decreases the
computational time especially for large data sets.
For all analyzed data sets the resulting orderings with the proposed dynamic
programming algorithm and GA-based approach were almost the same with only
small variations in the resulting heatmaps. As the GA approach is as a rule more time
consuming (see Table 1) its use for large data sets is not recommended. The proposed
dynamic programming algorithm with modi¯cations is the most e±cient among all
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com
labels and the clustering structure of a data set, allowing to relate the classes to the
speci¯ed clusters and to determine the relationship between di®erent classes. By
applying the optimized ordering algorithm to a biological data set we have not only
obtained a closer grouping of similar class labels, but also a better ordering of genes
according to the similarity of their expression pro¯les, that corresponds to the time
order of the real cell cycle process.
4. Conclusion
In this paper, we have studied the problem of ordering leaves of a tree representing
hierarchical clustering of data cases with class labels. We have presented an e±cient
algorithm that allows to ¯nd such a leaf ordering that the data cases with the same
class labels are located as close as possible in the reordered hierarchical tree. We
have reformulated the standard problem of optimal leaf ordering, which optimizes
the function of similarities of data elements. In our case, we try to optimize the
function which is computed on the basis of class labels of individual data elements in
order to place the same class labels together. Several modi¯cations to the original
version of the algorithm have been proposed in order to make it more time e±cient.
Apart from the dynamic programming approach we have also developed an ap-
proach to optimal leaf ordering with classes using a GA. We have made several
experiments in order to compare the proposed approaches and standard hierarchical
clustering with heuristic ordering.1 According to our results the dynamic pro-
gramming leaf ordering algorithm with modi¯cations has both the best value of the
objective function and the smallest computation time. Therefore, it can be recom-
mended for practical applications. For the biological data set the dynamic pro-
gramming algorithm lead to a better ordering according to similarities of gene
pro¯les. Continuous drift of the peak expression values of genes in the heatmap can
be clearly seen (Fig. 8). Currently we are investigating the possibility to take into
account data cases with missing class labels for the leaf ordering. We expect that the
1550012-16
Optimized leaf ordering
location of such data cases in the resulting tree can facilitate the prediction of their
class label.
The implementation of our algorithms with all the necessary functions is available
as a R package at http://cran.r-project.org/web/packages/ReorderCluster.
Acknowledgments
We thank the colleagues from the Cellular Proteomics Research Group at the
Helmholtz Centre for Infection Research for many useful comments on this problem.
Natalia Novoselova acknowledges the support by a research grant A/13/00004 from
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com
the German Academic Exchange Service (DAAD). This study was co-¯nanced by
the European Union (European Regional Development Fund) under the Regional
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.
References
1. Eisen MB, Spellman PT, Brown PO, Botstein D, Cluster analysis and display of genome-
wide expression patterns, Proc Natl Acad Sci USA 95(25):14863–14868, 1998.
2. Eisen MB, Cluster v. 2.11 and TreeView v. 1.5, 2000, Available at http://rana.lbl.gov/
EisenSoftware.html.
3. Bar-Joseph Z, Gi®ord DK, Jaakkola TS, Fast optimal leaf ordering for hierarchical
clustering, Bioinformatics 17:22–29, 2001.
4. Biedl T, Brejova B, Demaine ED, Hamel AM, Vinar T, Optimal arrangement of leaves in
the tree representing hierarchical clustering of gene expression data, Technical Report
CS-2001-14, Department of Computer Science, University of Waterloo, 2001.
5. Chae M, Chen JJ, Reordering hierarchical tree based on bilateral symmetric distance,
PLoS One 6(8):e22546, 2011, doi: 10.1371/journal.pone.0022546.
6. Brandes U, Optimal leaf ordering of complete binary trees, J Discrete Algorithms
5(3):546–552, 2007.
7. Buchta C, Hahsler M, cba: Clustering for business analytics, R package version 0.2-14,
2014, Available at http://CRAN.R-project.org/package¼cba.
8. Hahsler M, Buchta C, Hornik K, Infrastructure for seriation, R package version 1.0-14,
2014, Available at http://CRAN.R-project.org/package¼seriation.
9. Sakai R, Dendsort: Modular leaf ordering methods for dendrogram nodes, R package
version 0.3.2, 2014, Available at http://CRAN.R-project.org/package¼dendsort.
10. Handl J, Knowles J, Kell DB, Computational cluster validation in post-genomic data
analysis, Bioinformatics 21:3201–3212, 2005, doi: 10.1093/bioinformatics/bti517.
11. R Core Team, R: A language and environment for statistical computing, R Foundation
for Statistical Computing, Vienna, Austria, 2014, Available at URL http://www.
R-project.org/.
12. Eddelbuettel D, Francois R, Rcpp: Seamless R and Cþþ integration, J Stat Soft 40
(8):1–18, 2011.
13. Goldberg DE, Genetic Algorithms in Search, Optimization and Machine Learning,
Addison-Wesley Publishing Company, Inc., USA, 1989.
1550012-17
N. Novoselova, J. Wang & F. Klawonn
14. Willighagen E, genalg: R based genetic algorithm, R package version 0.1.1.1, 2014,
Available at http://CRAN.R-project.org/package¼genalg.
15. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO,
Botstein D, Futcher B, Comprehensive identi¯cation of cell cycle-regulated genes of the
yeast Saccharomyces cerevisiae by microarray hybridization, Mol Biol Cell 9:3273–3297,
1998.
1550012-18
Optimized leaf ordering
Braunschweig (Germany). His main research interests focus on techniques for in-
telligent data analysis, especially clustering, classi¯cation and robust statistical
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.
1550012-19