You are on page 1of 19

Journal of Bioinformatics and Computational Biology

Vol. 13, No. 4 (2015) 1550012 (19 pages)


.c Imperial College Press
#
DOI: 10.1142/S0219720015500122

Optimized leaf ordering with class labels for hierarchical clustering

Natalia Novoselova*,§, Junxi Wang†,¶ and Frank Klawonn†,‡,||


*Department of Bioinformatics, United Institute of Informatics Problems
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com

Surganova Str. 6, Minsk 220012, Belarus



Biostatistics, Helmholtz Centre for Infection Research
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

Inho®enstraße 7, 38124, Braunschweig, Germany



Department of Computer Science, Ostfalia University of Applied Sciences
Salzdahlumer Straße 46/48, 38302, Wolfenbuttel,
€ Germany
§
novosel@newman.bas-net.by

hzwangjunxi@gmail.com
||f.klawonn@ostfalia.de, Frank.Klawonn@helmholtz-hzi.de

Received 17 February 2015


Accepted 22 February 2015
Published 2 April 2015

Hierarchical clustering is extensively used in the bioinformatics community to analyze biomed-


ical data. These data are often tagged with class labels, as e.g. disease subtypes or gene ontology
(GO) terms. Heatmaps in connection with dendrograms are the common standard to visualize
results of hierarchical clustering. The heatmap can be enriched by an additional color bar at the
side, indicating for each instance in the data set to which class it belongs. In the ideal case, when
the clustering matches perfectly with the classes, one would expect that instances from the same
class cluster together and the color bar consists of well-separated color blocks without frequent
alteration of colors (classes). But even in the case when instances from the same class cluster
perfectly together, the dendrogram might not re°ect this important aspect due to the fact that its
representation is not unique. In this paper, we propose a leaf ordering algorithm for the den-
drogram that preserving the hierarchical clustering result tries to group instances from the same
class together. It is based on the concept of dynamic programming which can e±ciently compute
the optimal or nearly optimal order, consistent with the structure of the tree.

Keywords: Hierarchical clustering; dendrogram; leaf ordering; dynamic programming; bio-


medical data.

1. Introduction
Hierarchical clustering is widely used in the bioinformatics community to analyze
data and to reveal hidden structures that do not become obvious with simple visu-
alization strategies. The advantage of hierarchical clustering consists in making it
possible to visualize the results in the form of a binary tree diagram or dendrogram
that groups similar instances together. As a rule of thumb, the Euclidean distance or
one minus a suitable correlation coe±cient are used as the distance measures that is

1550012-1
N. Novoselova, J. Wang & F. Klawonn

required for hierarchical clustering. The leaves of the dendrogram derived from hi-
erarchical clustering correspond to the instances or objects, e.g. genes, proteins or
patients. For the visualization of the dendrogram a linear order of the leaves is
required that is consistent with the clustering result. The clustering itself yields only
a binary tree without specifying which of the subtrees should be placed on the left or
right. Switching the two subtrees of any node will lead to a di®erent dendrogram
although it still re°ects the same hierarchical clustering result. There exist the 2 n1
possible orderings of a binary tree with n leaf nodes, since in any of the n  1 non-leaf
nodes we have two choices. The chosen leaf order will highly in°uence the visuali-
zation and therefore the interpretation of the data based on the corresponding
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com

dendrogram and heatmap.


This observation is not new at all and various approaches have been proposed to
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

rearrange the leaf order of a dendrogram. However, all these methods are based
exclusively on the distance matrix, but do not exploit additional information pro-
vided by class labels as in our approach. Before we describe the principles of our
approach, we brie°y review the strategies from the literature that do not incorporate
class labels, since we borrow some algorithmic ideas from them.
The approaches1–4 try to rearrange the linear ordering of the leaf nodes in the
dendrogram such that the sum of the distances of adjacent leaves is minimized. Eisen
et al.1 proposed a simple heuristic strategy that switches subtrees according to
weights that can be assigned to the leaves (for example the average gene expression
over all experiments or the position of a gene in the chromosome). An implementation
of this method can be found in Ref. 2. After that several authors proposed di®erent
computationally feasible algorithms to optimize the leaf ordering.3,4 The optimization
of the ordering used by Bar-Joseph et al.3 is based on a recursive process, which
estimates the optimal ordering of subtrees in a bottom-up way. It resembles a
dynamic programming approach and includes a backtracking phase to improve
orderings made in previous steps of the algorithm. The algorithm's complexity is
Oðn 4 Þ, where n is the number of instances in the data set. The authors proposed
also an improvement of the algorithm that allows the earlier search termination and
can dramatically decrease the computation time in average. The algorithm in Ref. 4
is also based on dynamic programming using a three-dimensional table that stores
the costs of the best linear ordering of the leaves in the subtree, starting from the
bottom. The algorithm complexity is proclaimed to be Oðn 3 Þ when a multiplication
instead of sum is used in computing the table elements. Other approaches to optimal
leaf ordering can be found in Refs. 5–8. But they also do not consider the class labels
in the optimization process. Buchta and Hahsler7 adopted the ordering concept,
described in Ref. 3, where the minimal path length is de¯ned to be an optimal
ordering. Hahsler et al.8 implemented several seriation/sequencing techniques to
reorder matrices, dissimilarity matrices and dendrograms. Sakai9 proposes the R
package for sorting the dendrogram based on the average distance of subtrees.
In many studies in the ¯eld of medicine and biology, the elements of the data set,
characterized by the number of features (e.g. gene expression levels in microarray),

1550012-2
Optimized leaf ordering

are also tagged by some nominal class labels, as for example gene ontology (GO)
terms or disease subtypes. In such cases, it is often of interest to see whether the
clustering result is more or less coherent with the class labels, i.e. instances with
the same label also cluster together. If this is the case, the distance measure can
be considered suitable to classify unlabeled instances in nearest neighbor fashion.
Figure 1(a) shows a dendrogram for microarray data with three di®erent tissue
types,10 obtained by applying agglomerative hierarchical clustering to leukemia data
set. The Euclidean distance measure was used to compute the distance matrix. The
branches with the same color correspond to the same tissue type. It can be seen that
most of the tissue types cluster together, but some instances are apart from their
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com

proper group of tissue type. The ordering of the leaf nodes in Fig. 1(b) obtained by
simply °ipping some of the inner nodes results in dendrogram that is even more
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

consistent with class labels (tissue types).


With such an optimized dendrogram it is easier to interpret the interrelation
between the available class labels and clusters. Moreover, it is possible to see whether
one can predict the class for unlabeled instances based on the distance that was used
for clustering.
In our study, we optimized the leaf ordering according to the class labels. To our
knowledge it is the ¯rst time to provide a leaf ordering algorithm based on the class

(a)

(b)

Fig. 1. Hierarchical clustering of three tissue types: 25 acute myeloid leukemia (AML) samples; nine
T-lineage acute lymphoblastic leukemia (ALL) samples; and 38 B-lineage ALL from acute leukemia
patients.10 (a) Dendrogram obtained from clustering. (b) Dendrogram after rearrangement of the leaf nodes.

1550012-3
N. Novoselova, J. Wang & F. Klawonn

label information. We reformulated the standard problem of optimal leaf ordering


and developed a dynamic programming algorithm to ¯nd the optimal or a near
optimal solution. We implemented our dynamic programming algorithm together
with an alternative approach using a genetic algorithm (GA) to rearrange the leaves
in a dendrogram. We performed several experiments on simulated and real data sets
and compared the results to the heuristic approach.1 Our algorithm together with all
the necessary functions is available as a R package at http://cran.r-project.org/web/
packages/ReorderCluster.

2. Leaf Ordering with Class Labels


J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com

In this section, we will formalize the problem of optimal leaf ordering of instances
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

according to their class labels. Then we provide our dynamic programming algorithm
to solve this problem and some modi¯cations which help to reduce its time com-
plexity. After that we will brie°y describe our approach to optimize the leaf ordering
with class labels using a GA.

2.1. Leaf ordering with dynamic programming


2.1.1. Problem de¯nition
In our study, we assume that a distance measure between the genes is given and
dendrogram, i.e. the binary tree is constructed by the method in Ref. 1 or by another
technique where the leaves in the tree correspond to the instances in the data set. Let
T be the resulting tree with n leaves. x1; x2 ; . . . ; xn denote the individual leaves of T
and v1; v2 ; . . . ; vn1 the inner nodes. Let c1; c2 ; . . . ; cn denote the class labels of the
corresponding leaves. According to Ref. 3, the linear ordering which preserves the
structure of T is de¯ned to be an ordering of the leaves of T generated by °ipping
internal nodes or changing the order between the two subtrees rooted at vi 2 T . Since
T has n  1 inner nodes, there are 2 n1 possible linear orderings that do not change
the structure of T . The task is to ¯nd an ordering of leaves of the hierarchical tree
such that the instances (leaf nodes) with the same class labels will be as close together
as possible. To formalize this task we need to introduce an objective function to judge
the quality of a particular linear ordering of instances, taking into account the class
labels and the constraints given by the binary tree. A particular leaf ordering  2 
where  is the set of all possible permutations of f1; . . . ; ng induces a permuted
vector c ¼ ðcð1Þ;...; cðnÞ Þ of class labels corresponding to the permuted leaves. Let ni;j
be the length of the jth sequence of elements ðcðkÞ;...; cðkþni;j Þ Þ from c in which only
the ith class label occurs without being interrupted by the elements of other class
labels. For  2 , we de¯ne the following function
X
K X
pi
F 1 ðT Þ ¼ ðni;j Þ coef ; ð1Þ
i¼1 j¼1

1550012-4
Optimized leaf ordering

where pi is number of individual sequences of elements with the ith class label when T
is ordered according t o , K is the number of di®erent class labels, coef 2 1; 2 is the
parameter to be chosen that reinforces the in°uence of longer sequences.
The value of coef is equal to 1.5 in our experiments. The coef values in the interval
1; 2 give the similar results. Our goal is to ¯nd such an ordering  that the F 1 ðT Þ is
maximized, in other words, in the optimal ordering the sequences of identical class
labels should be as long as possible. Note that not all permutations  2  are ad-
mitted but only those that are consistent with the given dendrogram. Otherwise, one
could easily optimize the objective function (1) by simply enumerating the instances
by the classes.
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com

2.1.2. Algorithm description


by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

We ¯nd an optimized ordering that conforms to the given dendrogram using


dynamic programming approach.3 The algorithm is recursive and proceeds in a
bottom up fashion from the leaf nodes of the tree T to its root. The optimized
ordering of each subtree v is determined only after computing the optimized leaf
orderings of its child nodes vl and vr where vl and vr are the left and the right
subtrees of v.
To save the optimized ordering and the corresponding value of the function F 1 ðvÞ
for each internal node v of T we de¯ne the matrix A of size n  n. We also de¯ne
three auxiliary matrices maxI; maxJ, and R of size n  n. Each cell ðu; wÞ; u ¼
1; . . . ; n; w ¼ 1; . . . ; n in these matrices corresponds to a pair of leaf nodes. The ele-
ment ðu; wÞ of matrix A stores the function value corresponding to the optimized leaf
ordering of node v, when u is the leftmost element of v and w is the rightmost element
of v. For each pair of leaves ðu; wÞ there exists only one subtree v for which the value
of A½u; w is computed. This is the subtree rooted at the least common ancestor of u
and w. In Fig. 2, node v is the least common ancestor of u and w.
For any other subtree vr1 (see Fig. 2) that is not the least common ancestor of u
and w both leaves then belong to its left v or right vr2 subtree and therefore do not
take part in the computing of the optimized leaf ordering for the node vr1 . Therefore,
we do not need to use an additional dimension for matrix A in order to store the node
for which the optimized leaf ordering with the elements ðu; wÞ is computed.

1 2
, , , ,

u 1 1 w

Fig. 2. Binary tree with a root node vr1 , where node v is the least common ancestor of elements u and w.

1550012-5
N. Novoselova, J. Wang & F. Klawonn

The computation of the optimized function value for the internal node v is based
on the known optimized function values of child nodes vl and vr .
To compute the value of the cell ðu; wÞ of matrix A for node v we must check
all feasible leaves i1 ; j1 , where i1 is the rightmost leaf of vl and j1 is the leftmost leaf
of vr . Then the optimized function value can be de¯ned as
8
>
> max ðA ½u; i1  þ Avr ½j1 ; wÞ; if classði1 Þ! ¼ classðj1 Þ
>i1 2vl;r ;j1 2vr;l vl
>
>
>
< max
A½u; w ¼ i1 2vl;r ;j1 2vr;l !
>
>
>
> Avl ½u; i1  þ Avr ½j1 ; w  R½u; i1  coef 
>
> ; if classði1 Þ ¼ classðj1 Þ;
: R½w; j1  coef þ ðR½u; i1  þ R½w; j1 Þ coef
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com

ð2Þ
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

where classðiÞ is the class label of the ith element.


The cell ðu; wÞ of matrix R stores the length of the sequences of leaves of the
internal node v (starting from the rightmost element w to the leftmost element uÞ
which have the same class labels and are not interrupted by elements of other class
labels. The cell ðu; wÞ of matrices maxI, maxJ stores the leaves i1 2 vl;r and j1 2 vr;l ,
respectively, which give the optimal value for A½u; w. After computing the function
value and the corresponding optimized ordering for all pairs ðu; wÞ, u 2 vl , w 2 vr for
node v, we can de¯ne the optimal ordering for node v as the pair ðuopt ; wopt Þ that
yields the maximal function value F 1 ðvÞ ¼ A½uopt ; wopt  ¼ maxu2vl ;w2vr A½u; w. At
each stage of this bottom up recursive process of computing the optimized function
value for each inner node v we update the matrices maxI; maxJ in order to store the
intermediate leaves that provide the optimized value for each pair of ðu; wÞ elements
of this node. At the end of the recursive process we obtain the optimized function
value for the whole tree T as the best value F 1 ðvroot Þ of the root node vroot . After
that, using the matrices maxI; maxJ, the sequence of leaves which gives the best
function value and corresponds to the optimized leaf ordering can be recovered using
a backtracking scheme.
The main scheme of this approach is sketched in Algorithm 1.
In lines 11–53 of the algorithm the optimized leaf ordering is computed for all
possible leftmost elements of the left subtree vl of node v. In lines 17–53, of the
algorithm the optimized leaf ordering is computed for all possible rightmost elements
of the right subtree vr of node v with ¯xed u 2 vl . In lines 25–37, the maximal value of
the optimization function and the corresponding leaf ordering is computed for the
¯xed elements u 2 vl and w 2 vr searching through all feasible intermediate leaves
i1 2 vl and j1 2 vr .
The matrix element ðu; wÞ is calculated in lines 27–30. Since the matrix A is
symmetric, in lines 38 and 39 the maximal value of optimized function is stored as
A½u; w ¼ A½w; u. In lines 33 and 34, the intermediate leaves i1 , j1 that yield the
maximum value of A½u; w are updated in the matrices maxI; maxJ. The elements
of the matrix R are calculated in lines 42–51. They are used for evaluating the

1550012-6
Optimized leaf ordering

u w
1 1 1 2 2 1 1
[ , ] [ , ]

Fig. 3. The elements of the matrix R for subtree v.


J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com

objective function for the internal nodes and to store the lengths of sequences of
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

leaves with the same class label from the left and right side of each subtree v. For
example, in Fig. 3 for the subtree v with leaves having class labels 1 or 2, R½u; w ¼
R½j1 ; w ¼ 2 and R½w; u ¼ R½i1 ; u ¼ 3 holds.
As a result of the algorithm we obtain the matrix A with elements ðu; wÞ corre-
sponding to the function values for the optimized leaf ordering from the leftmost
element u to the rightmost element w of the corresponding internal node v. The
maximal element ðuopt ; wopt Þ of the matrix A corresponds to the optimized ordering
of the tree T .
Using the matrices maxI; maxJ we can recover the sequence of elements in the
optimized leaf ordering of the whole tree T .

2.1.3. Improvements and variation of the original algorithm


As the time complexity of our algorithm is Oðn 4 Þ because it is based on similar
principles as the one in Ref. 3, where n is the number of leaves in the tree, we propose
some modi¯cation to decrease the computation time. The ¯rst modi¯cation consists
in merging the elements of the node v when all of them have the same class label.
Flipping the subtrees of such a node does not change the value of the objective
function. Therefore, we have developed Algorithm 2 that rearranges the original tree
and merges the subtrees with the same class label into one element. Algorithm 1 is
then applied to the simpli¯ed tree. The simpli¯ed tree preserve the sequence lengths
of the original subtrees and also the initial function values for the merged subtrees
are taken into account.
Figure 4 shows the result of applying Algorithm 2 to the subtree v.
The proposed Algorithms 1 and 2 are implemented in the R programming
language.11 Pure R code is relatively slow compared to C or Cþþ. To reduce
the computation time the second modi¯cation consists in converting Algorithm 1
to Cþþ and call it then from the R environment. For this purpose, we used the
Rcpp R package,12 that makes it simple to connect Cþþ to R. The experiments
with di®erent data sets have shown a considerable speed up of the algorithm (see
Table 1).

1550012-7
N. Novoselova, J. Wang & F. Klawonn

Inputs: is the root node of the tree


, the matrix to store the values of the optimized function
, the matrices to store the intermediate leaves
, the matrix to store the lengths of sequences with the same class labels for internal
subtrees
, the vector of class labels for each leaf node
Outputs: matrices
1: OrderingClass( )
2: if size( )=1
3:
4:
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com

5:
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

6:
7: return(
8: else
9: ( =OrderingClass ( )
10: ( = OrderingClass ( )
11: for all leaves
12: if // vremI is the number of intermediate leaves of the left
13: = // subtree to search for the optimal value of
14: else
15: =
16: end if
17: for all leaves
18: if
19: = //vremJ is the number of intermediate leaves of the
20: else //right subtree to search for the optimal
21: =
22: end if
23: max=-1
24: for all leaves
25: for all leaves
26:
27:
28:
29: +
30: end if
31: if >max
32: max=
33:
34:
35: end if
36: end if

Algorithm 1. Optimized leaf ordering with class labels.

1550012-8
Optimized leaf ordering

37: end if
38: =max
39:
40: =
41: =
42: if((all class( ) are equal) and
43: (class( )=class( )
44:
45: else
46:
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com

47: end if
48: if((all class( ) are equal) and
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

49: (class( )=class( )


50:
51: else
52:
53: end if
54: end if
55: end if
56: end if
return(
end

Algorithm 1. (Continued )

2.2. Leaf ordering with GA


We also developed an approach to optimize the leaf ordering with respect to the
class labels based on a di®erent objective function. Dynamic programming is not
suitable for this objective function. Instead, we used a GA. GA is a powerful sto-
chastic optimization technique, inspired by natural evolutionary processes to evolve
a population of individual task solutions to reach the global optimum of the objective
function or at least a good solution. GA is especially appropriate to solve optimi-
zation problems with complex solution spaces and having many parameters.13
The optimization task in this case of searching the optimal leaf ordering with class
labels can be formulated as follows.
Let  2 , where jj ¼ 2 n1 be the set of all possible leaf orderings of the pre-
viously constructed hierarchical tree T with n leaves. Each ordering is the result
of leaf permutations or °ipping the subtrees of the internal nodes of the tree. Each
leaf i corresponds to the particular class label ci , therefore the resulting ordering
can be considered as the vector of class labels of the permuted sequence of leaves,
c ¼ ðcð1Þ ; cð2Þ ; . . . ; cðnÞ Þ, where  is a permutation of f1; . . . ; ng. It is required to ¯nd
such an ordering  of the leaves of the tree that is consistent with the tree and which

1550012-9
N. Novoselova, J. Wang & F. Klawonn

Input: is the root node of the tree


, the matrix to store the values of the optimized function
, the matrix to store the lengths of sequences with the same class labels for internal
subtrees.
Output: the simplified tree .
1: SubTree( )
2: {
3: if all the leaves of have the same class label
4: return(the leftmost element leaf of )
5: end if
6: if is not a singleton
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com

7: (leaf, =SubTree( , )
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

8: if(leaf!=NULL)
9: len=size( //number of elements of subtree
10: leaf // subtree is a singleton
11: , =
12: , = len
13: end if
14: end if
15: if is not a singleton
16: (leaf, =SubTree( , )
17: if(leaf!=NULL)
18: len=size( //number of elements of subtree
19: leaf // subtree is a singleton
20: , =
21: , =len
22: end if
23: end if
24: return( , leaf=NULL, )
25: end
26:

Algorithm 2

minimizes the objective function


X
K X
pi
F 2 ðT Þ ¼  ni =n ni;j =ni  lnðni;j =ni Þ; ð3Þ
i¼1 j¼1

where F 2 ðT Þ is the partition entropy value for elements in c, K is the number of class
labels, ni is the number of elements of the ith class label, pi is the number of con-
tinuous sequences of elements of class i, ni;j is the number of elements in the jth
sequence with the ith class label. In order to ¯nd the optimal leaf ordering the
function F 2 ðT Þ must be minimized.

1550012-10
Optimized leaf ordering

, ,

1 1 1 2 2 2 1 1 1 2 2 1
, = , =

Fig. 4. Merging of subtrees vl;l and vr;l having the same class labels.
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com

A binary coding of the solutions is an obvious choice in this case and suits very
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

well to the concept of GA. Since the permutation of the leaf nodes is induced by
either switching or not switching the subtrees of any of the n  1 inner nodes, a
solution can be coded as a binary vector of length n  1. Such a binary vector induces
a unique permutation of the leaf nodes. 0 stands for not switching the subtrees of the
corresponding node, whereas 1 stands for switching them.
In Fig. 5, we can see an example of the correspondence between the individual of
GA and the resulting tree.
The value of the ¯tness function for an individual of the GA is computed
according to (3) for the decoded tree. The ¯tness function indicates the e±ciency of
the solution for the optimization task, encoded in the GA individual. The initial
population is generated at random. For the realization of the GA optimization
scheme the standard crossover and mutation operations are applied to the popula-
tion of GA individuals. The optimization scheme consists of repeated execution of
several steps: decoding of individuals, calculation of the ¯tness function, selection of
individuals according to the function values, recombination, and elitism operations.
At each generation a new population of the GA is formed. The iteration process

Hierarchical trees

1 1 2 1 1 2 1 2 1 1 2 1

0 0 0 0 0 1 0 0 1 1
GA individual

Fig. 5. The encoding of the GA individuals.

1550012-11
N. Novoselova, J. Wang & F. Klawonn

continues until the GA stopping condition will be satis¯ed. As a stopping condition


either a pre-de¯ned, maximal number of generations can be used or the algorithm is
terminated when no improvement of the best solutions is achieved anymore. In our
study, we used the standard binary GA, implemented in the R package genalg.14 The
package genalg includes R-based GAs for binary and °oating point chromosomes.
We used a population size of 200 individuals and a maximum number of iterations
equal to 100.

3. Experiments
We have performed several experiments to test our approach of reordering the leaves
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com

in dendrogram in order to make the same class labels to be located as close as


by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

possible. The resulting ordered collection of data cases (e.g. genes in microarray
analysis) is displayed in the form of a heatmap where rows represent cases and
columns represent single features (e.g. experiments in microarray analysis). Each
data point corresponds to a colored box, where the color represents a feature value.
The optional vertical and horizontal side bars can be used to annotate the rows and
columns of the heatmap. The hierarchy of the tree is also drawn on the left side of the
heatmap.

3.1. Methods
We have used two arti¯cial data sets and one biological data set to test the optimal
leaf ordering algorithm and its e®ect on the visualization of the results of hierarchical
clustering. The ¯rst simulated data set consists of 90 cases, characterized by the
values of 90 features. For each subsequent subset of 15 cases we set 40 consecutive
features to 1 (with 10 features shift), and then °ip them to 0 or 1 with probability
p ¼ 0:01. The rest of the features in the data set were generated at random. As a
result the six distinctive clusters are formed. The second arti¯cial data set is more
complex and consists of 400 data cases, each with 100 feature values. The similar
generation scheme as for the ¯rst data set is applied and eight distinctive clusters are
formed. For both data sets we have manually de¯ned the vector of class labels
c ¼ ðc1 ; c2 ; . . . ; cn Þ; ci 2 f1; 2; 3g, which does not conform to the generated clusters.
For the ¯rst data set each subsequent subset of 15 cases was di®erently labeled and
the labels of the ¯rst and the second half of data set were the same. For the second
data set the ¯rst 125 cases was attributed to class 1, the following 150 cases to class 2
and the last 125 cases to class 3.
The real biological data set presents cell cycle data.15 The cell cycle or cell-division
cycle is the series of events that take place in a cell leading to its division and
duplication (replication). The data set presents the results of microarray analysis and
consists of expression pro¯les of 800 genes, which are cell cycle regulated in Sac-
charomyces cerevisiae.3 These genes were assigned to ¯ve groups termed G1, S,
S/G2, G2/M, and M/G1, which present distinct phases of the cell cycle. The data set

1550012-12
Optimized leaf ordering

is a combination of three di®erent experiments (cdc15, cdc28, and  factor) that


together constitute 59 measurements of gene expression. In our experiment, we have
used the 24 cdc15 measurements.
We have applied hierarchical clustering with heuristic leaf ordering1 to all data
sets and then used the proposed optimal leaf ordering algorithm and the GA-based
approach to improve the orderings with respect to the class labels. In GA experi-
ments, the results were slightly better for the value of objective function in (1) when
the objective in (3) was optimized. All the algorithms were implemented in R en-
vironment with the R language. In order to compare the results for each leaf ordering
we have computed the value of the objective function in (1).
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com

3.2. Results
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

The results of applying the optimal leaf ordering and the hierarchical clustering
algorithm with heuristic ordering1 to the ¯rst simulated data set are shown in
Fig. 6. The additional color bar at the side of the heatmap presents the class label
annotations.
For the ¯rst simulated data set the optimal leaf orderings with class labels using
the dynamic programming algorithm and the GA yielded similar results (Fig. 6) with
less dispersed cluster labels than in hierarchical clustering alone. The evaluation
function for the optimal leaf ordering has higher values than for the unordered
hierarchical clustering (Table 1). For the second simulated data set, the optimal leaf
orderings are presented in Fig. 7. As the de¯ned class labels do not conform to the
generated clusters they are more dispersed for pure hierarchical clustering. The op-
timal leaf orderings better preserve the similarity of data pro¯les (Fig. 7). The value
of the evaluation function is slightly better for the optimal leaf ordering algorithm
with dynamic programming (Table 1).
We compared the results of the optimal leaf ordering and heuristic ordering for
the biological data set, considering the cell cycle groups as the class labels. In Fig. 8,
we can see that the optimal ordering with classes can perfectly locate similar class

(a) (b) (c)

Fig. 6. Comparison of leaf orderings using hierarchical clustering and two proposed approaches for the ¯rst
simulated dataset: (a) hierarchical clustering,1 (b) optimal leaf ordering (dynamic programming), and (c)
optimal leaf ordering (GA).

1550012-13
N. Novoselova, J. Wang & F. Klawonn

(a) (b) (c)

Fig. 7. Comparison of leaf orderings using hierarchical clustering and two proposed approaches for the
second simulated dataset: (a) hierarchical clustering,1 (b) optimal leaf ordering (dynamic programming),
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com

and (c) optimal leaf ordering (GA).


by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

Fig. 8. Comparison of leaf orderings using hierarchical clustering and two proposed approaches for
biological data set.

labels close to each other. Unlike hierarchical clustering, the cell cycle order is more
preserved. Despite the optimization function of the optimal leaf ordering algorithm
does not take into account the distance similarities between gene expression pro¯les
they are better arranged according to similarity then the hierarchical clustering with
heuristic ordering. The peaks of gene expression for optimal leaf ordering change
continuously from the bottom to the top of the heatmap plot (Fig. 8). Therefore, the
optimal ordering with classes correctly identi¯ed the order of the groups in the real
cell cycle process. The results of the GA approach are slightly worse with a smaller
value of the objective function (Table 1).
We have compared the computation time of the di®erent approaches, including
the dynamic programming algorithm for leaf ordering with and without using the
Cþþ function, and the approach using the GA (Table 1). The results were obtained
on a BENQ laptop with a Microsoft Windows 7 (Service Pack 1) operating system
and an Intel(R) Core(TM)2 Duo CPU T8100 @2.10 GHz (2.00 GB RAM) processor.
As can be seen the computation time of the dynamic programming approach with
modi¯cations is the best.

1550012-14
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

Table 1. Time complexity and values of the optimization function for optimal leaf ordering algorithms subject to class labels.

Time complexity Evaluation function (to be maximized in (1))


Dynamic Dynamic Dynamic Dynamic
Number programming programming algorithm Approach programming programming algorithm Approach Hierarchical
Data set of cases algorithm (original) (with modi¯cations) with GA algorithm (original) (with modi¯cations) with GA clustering
Simulated 90 0.85 s 0.014 s 1.27 min 444.82 444.82 444.82 396.69
data 1

1550012-15
Simulated 400 7.34 s 1.07 s 7.34 min 3527.0 3527.0 3406.0 2900.0
data 2
Spellman 800 60.0 min 28.0 s 30.0 min 3245.0 3245.0 2960.8 2355.0
data set
Optimized leaf ordering
N. Novoselova, J. Wang & F. Klawonn

3.3. Discussion
Our optimized leaf ordering algorithm can work with any binary tree to order the
leaves according to the class labels, preserving the clustering structure. We have
made improvements to the original version of our algorithm that decreases the
computational time especially for large data sets.
For all analyzed data sets the resulting orderings with the proposed dynamic
programming algorithm and GA-based approach were almost the same with only
small variations in the resulting heatmaps. As the GA approach is as a rule more time
consuming (see Table 1) its use for large data sets is not recommended. The proposed
dynamic programming algorithm with modi¯cations is the most e±cient among all
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com

the examined approaches and is recommended for practical applications. Optimized


ordering with classes helps to determine the relationship between the known class
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

labels and the clustering structure of a data set, allowing to relate the classes to the
speci¯ed clusters and to determine the relationship between di®erent classes. By
applying the optimized ordering algorithm to a biological data set we have not only
obtained a closer grouping of similar class labels, but also a better ordering of genes
according to the similarity of their expression pro¯les, that corresponds to the time
order of the real cell cycle process.

4. Conclusion
In this paper, we have studied the problem of ordering leaves of a tree representing
hierarchical clustering of data cases with class labels. We have presented an e±cient
algorithm that allows to ¯nd such a leaf ordering that the data cases with the same
class labels are located as close as possible in the reordered hierarchical tree. We
have reformulated the standard problem of optimal leaf ordering, which optimizes
the function of similarities of data elements. In our case, we try to optimize the
function which is computed on the basis of class labels of individual data elements in
order to place the same class labels together. Several modi¯cations to the original
version of the algorithm have been proposed in order to make it more time e±cient.
Apart from the dynamic programming approach we have also developed an ap-
proach to optimal leaf ordering with classes using a GA. We have made several
experiments in order to compare the proposed approaches and standard hierarchical
clustering with heuristic ordering.1 According to our results the dynamic pro-
gramming leaf ordering algorithm with modi¯cations has both the best value of the
objective function and the smallest computation time. Therefore, it can be recom-
mended for practical applications. For the biological data set the dynamic pro-
gramming algorithm lead to a better ordering according to similarities of gene
pro¯les. Continuous drift of the peak expression values of genes in the heatmap can
be clearly seen (Fig. 8). Currently we are investigating the possibility to take into
account data cases with missing class labels for the leaf ordering. We expect that the

1550012-16
Optimized leaf ordering

location of such data cases in the resulting tree can facilitate the prediction of their
class label.
The implementation of our algorithms with all the necessary functions is available
as a R package at http://cran.r-project.org/web/packages/ReorderCluster.

Acknowledgments
We thank the colleagues from the Cellular Proteomics Research Group at the
Helmholtz Centre for Infection Research for many useful comments on this problem.
Natalia Novoselova acknowledges the support by a research grant A/13/00004 from
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com

the German Academic Exchange Service (DAAD). This study was co-¯nanced by
the European Union (European Regional Development Fund) under the Regional
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

Competitiveness and Employment objective and within the framework of the


Bi2SON Project Einsatz von Informations-und Kommunikationstechnologien zur
Optimierung der biomedizinischen Forschung in Südost-Niedersachsen.

References
1. Eisen MB, Spellman PT, Brown PO, Botstein D, Cluster analysis and display of genome-
wide expression patterns, Proc Natl Acad Sci USA 95(25):14863–14868, 1998.
2. Eisen MB, Cluster v. 2.11 and TreeView v. 1.5, 2000, Available at http://rana.lbl.gov/
EisenSoftware.html.
3. Bar-Joseph Z, Gi®ord DK, Jaakkola TS, Fast optimal leaf ordering for hierarchical
clustering, Bioinformatics 17:22–29, 2001.
4. Biedl T, Brejova B, Demaine ED, Hamel AM, Vinar T, Optimal arrangement of leaves in
the tree representing hierarchical clustering of gene expression data, Technical Report
CS-2001-14, Department of Computer Science, University of Waterloo, 2001.
5. Chae M, Chen JJ, Reordering hierarchical tree based on bilateral symmetric distance,
PLoS One 6(8):e22546, 2011, doi: 10.1371/journal.pone.0022546.
6. Brandes U, Optimal leaf ordering of complete binary trees, J Discrete Algorithms
5(3):546–552, 2007.
7. Buchta C, Hahsler M, cba: Clustering for business analytics, R package version 0.2-14,
2014, Available at http://CRAN.R-project.org/package¼cba.
8. Hahsler M, Buchta C, Hornik K, Infrastructure for seriation, R package version 1.0-14,
2014, Available at http://CRAN.R-project.org/package¼seriation.
9. Sakai R, Dendsort: Modular leaf ordering methods for dendrogram nodes, R package
version 0.3.2, 2014, Available at http://CRAN.R-project.org/package¼dendsort.
10. Handl J, Knowles J, Kell DB, Computational cluster validation in post-genomic data
analysis, Bioinformatics 21:3201–3212, 2005, doi: 10.1093/bioinformatics/bti517.
11. R Core Team, R: A language and environment for statistical computing, R Foundation
for Statistical Computing, Vienna, Austria, 2014, Available at URL http://www.
R-project.org/.
12. Eddelbuettel D, Francois R, Rcpp: Seamless R and Cþþ integration, J Stat Soft 40
(8):1–18, 2011.
13. Goldberg DE, Genetic Algorithms in Search, Optimization and Machine Learning,
Addison-Wesley Publishing Company, Inc., USA, 1989.

1550012-17
N. Novoselova, J. Wang & F. Klawonn

14. Willighagen E, genalg: R based genetic algorithm, R package version 0.1.1.1, 2014,
Available at http://CRAN.R-project.org/package¼genalg.
15. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO,
Botstein D, Futcher B, Comprehensive identi¯cation of cell cycle-regulated genes of the
yeast Saccharomyces cerevisiae by microarray hybridization, Mol Biol Cell 9:3273–3297,
1998.

Natalia Novoselova received the Ph.D. degree in computer


sciences from United Institute of Informatics Problems (UIIP),
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com

National Academy of Sciences of Belarus (NASB) in 2008. She was


by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

a scienti¯c researcher from 1987 to 2000 in the Laboratory of Er-


gonomics, Institute of Engineering Cybernetics NASB, and has
been a senior scienti¯c researcher since 2000 in the Laboratory of
Bioinformatics, United Institute of Informatics Problems NASB in
Minsk, Belarus. Her research interests include data mining meth-
ods, notably the neural network, genetic algorithms and fuzzy systems and their
application to analysis of medical and biological data. She is an author of more than
20 research publications in these areas. In years 2008 and 2010, she was a recipient of
the two-month scienti¯c grants from German Academic Exchange Service, to con-
duct research in the ¯eld of bioinformatics at the Ostfalia University of Applied
Sciences, Wolfenbuettel, Gemany.

Junxi Wang received B.Sc. degree in Biotechnology/Bioinfor-


matics from the University of Emden/Leer (Germany) and the
Zhejiang University of Science and Technology (China) in 2014.
She works as a research assistant at the Helmholtz Centre for
Infection Research and is also pursuing her M.Sc. degree at Goethe
University in Frankfurt.

1550012-18
Optimized leaf ordering

Frank Klawonn received his M.Sc. and Ph.D. in mathematics


and computer science from the University of Braunschweig in 1988
and 1992, respectively. Before he became the head of the Lab for
Data Analysis and Pattern Recognition (DAPaR) at Ostfalia
University of Applied Sciences in Wolfenbuettel (Germany), he
was a Visiting Professor at Johannes Kepler University in Linz,
Austria (1996), at Rhodes University in Grahamstown, South
Africa (1997), a research fellow with British Telecom (2001) and
an Associate Professor at Emden/Leer University (until 2002). He is now also the
head of the Biostatistics Group at the Helmholtz Centre for Infection Research in
J. Bioinform. Comput. Biol. 2015.13. Downloaded from www.worldscientific.com

Braunschweig (Germany). His main research interests focus on techniques for in-
telligent data analysis, especially clustering, classi¯cation and robust statistical
by PRINCETON UNIVERSITY on 08/12/15. For personal use only.

methods. He is a member of the editorial board of international journals like Data


Mining, Modelling & Management (IJDMMM), Intelligent Data Analysis (IDA),
Evolving Systems (ES) and Fuzzy Sets and Systems (FSS).

1550012-19

You might also like