You are on page 1of 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/224244204

Rank Entropy Based Decision Trees for Monotonic Classification

Article  in  IEEE Transactions on Knowledge and Data Engineering · January 2011


DOI: 10.1109/TKDE.2011.149 · Source: IEEE Xplore

CITATIONS READS
118 637

6 authors, including:

Qinghua Hu Lei Zhang


Tianjin University University of Pittsburgh
290 PUBLICATIONS   8,619 CITATIONS    586 PUBLICATIONS   40,572 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Image enhancement and restoration View project

Unsupervised Feature Selection View project

All content following this page was uploaded by Qinghua Hu on 20 July 2014.

The user has requested enhancement of the downloaded file.


This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,MARCH 2011 1

Rank Entropy Based Decision Trees for


Monotonic Classification
Qinghua Hu, Member, IEEE, Xunjian Che, Lei Zhang, Member, IEEE, David Zhang, Fellow, IEEE,
Maozu Guo, and Daren Yu

Abstract—In many decision making tasks, values of features and decision are ordinal. Moreover, there is a monotonic constraint
that the objects with better feature values should not be assigned to a worse decision class. Such problems are called ordinal
classification with monotonicity constraint. Some learning algorithms have been developed to handle this kind of tasks in recent
years. However experiments show that these algorithms are sensitive to noisy samples and do not work well in real-world
applications. In this work, we introduce a new measure of feature quality, called rank mutual information, which combines the
advantage of robustness of Shannon’s entropy with the ability of dominance rough sets in extracting ordinal structures from
monotonic datasets. Then, we design a decision tree algorithm (REMT) based on rank mutual information. The theoretic and
experimental analysis shows that the proposed algorithm can get monotonically consistent decision trees if training samples are
monotonically consistent. Its performance is still good when data are contaminated with noise.

Index Terms—monotonic classification, rank entropy, rank mutual information, decision tree.

1 I NTRODUCTION decisions and building an automatic decision model.


Compared with general classification problems,
O RDINAL classification is a kind of important tasks
in management decision, evaluation and assess-
ment, where the classes of objects are discretely ordi-
much less attention has been paid to monotonic clas-
sification these years [6], [7]. In some literature, a
monotonic classification task was transformed from a
nal, instead of numerical or nominal. In these tasks, if
k-class ordinal problem to k − 1 binary class problems
there exists a constraint that the dependent variable
[9]. Then a learning algorithm for general classifi-
should be a monotone function of the independent
cation tasks was employed on the derived data. In
variables [5], [19], namely, given two objects x and x ,
fact, this technique did not consider the monotonicity
if x ≤ x , then we have f (x) ≤ f (x ), we call this kind
constraints in modeling.
of tasks monotonic classification, also called ordinal
Rule extraction from monotonic data attracts some
classification with monotonicity constraints [2], [3],
attention from the domains of machine learning and
[5], multi-criteria decision making [1], [4].
decision analysis. Decision tree induction is an effi-
Monotonic classification tasks widely exist in real-
cient, effective and understandable technique for rule
world life and work. We select commodities in a
learning and classification modeling [10], [11], [12],
market according to the price and quality; employers
where a function is required for evaluating and se-
select their employees based on their education and
lecting features to partition samples into finer subsets
experience; investors select stocks or bonds in terms of
in each node. Several measures, such as Gini, chi-
their probability of appreciation or risk; Universities
square and Gain-ratio, were introduced into deci-
select scholarship offers according to students’ per-
sion tree construction algorithms [10], [12]. It was
formances. Editors make a decision on a manuscript
concluded that Shannon’s entropy outperforms other
according to its quality. If we collect a set of samples
evaluation functions in most tasks [13]. Information
of a monotonic classification task, we can extract
entropy based decision tree algorithms have been
decision rules from the data for understanding the
widely discussed and used in machine learning and
data mining. Unfortunately, Shannon’s information
• Q. H. Hu, X. J. Che, M. Z. Guo and D. R. Yu are with Harbin
Institute of Technology, Harbin, China. E-mail: huqinghua@hit.edu.cn.
entropy can not reflect the ordinal structure in mono-
tonic classification. Even given a monotone dataset,
• L. Zhang and D. Zhang are with the Department of Computing, the learned rules might not be monotonic. This does
The Hong Kong Polytechnic University, Hong Kong, China. E-mail:
cslzhang@comp.polyu.edu.hk.
not agree with the underlying assumption of these
tasks and limits the application of these algorithms to
This work is supported by National Natural Science Foundation of China monotonic classification.
under Grant 60703013 and 10978011, Key Program of National Natural In order to deal with monotonic classification,
Science Foundation of China under Grant 60932008, National Science entropy based algorithms were extended in [14],
Fund for Distinguished Young Scholars under Grant 50925625 and China
Postdoctoral Science Foundation. Dr. Hu is supported by The Hong Kong where both the error and monotonicity were taken
Polytechnic University (G-YX3B). into account while building decision trees; a non-

Digital Object Indentifier 10.1109/TKDE.2011.149 1041-4347/11/$26.00 © 2011 IEEE


This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,MARCH 2011 2

monotonicity index was introduced and used in se- What’s the problems with these ordinal algorithms?
lecting attributes. In [15], an order-preserving tree- The authors attributed this unexpected phenomenon
generation algorithm was developed and a technique to the high levels of nonmonotonic noise in datasets.
for repairing non-monotonic decision trees was pro- In fact, both the experimental setup and noisy samples
vided for multi-attribute classification problems with lead to the incapability of these ordinal algorithms.
several linearly ordered classes. In 2003, a collection In order to determine whether applying monotonicity
of pruning techniques were proposed to make a non- constraints improves performance of algorithms, one
monotone tree monotone by pruning [33]. Also in should make a monotone version of this algorithm
2003, Cao-Van and Baets constructed a tree based that resembles the original algorithm as much as pos-
algorithm to avoid violating the monotonicity of data sible and then to see whether performance improves.
[16], [41]. In [32], Kamp, Feelders and Barile proposed More than 15 years ago, robustness was considered
a new algorithm for learning isotonic classification as an important topic in decision analysis [26]. In
trees (ICT). In 2008, Xia, et al. extended the Gini impu- monotonic classification, decisions are usually given
rity used in CART to ordinal classification, and called by different decision-makers. The attitudes of these
it ranking impurity [17]. Although the algorithms decision-makers may vary from time to time. There-
mentioned above enhance the capability of extracting fore, a lot of inconsistent samples exist in gathered
ordinal information, they can not ensure a monotone datasets. If the measures used to evaluate quality of
tree is built if the training data are monotone. attributes in monotonic classification are sensitive to
In 2009, Kotlowski and Slowinski proposed an algo- noisy samples, the performance of the trained models
rithm for rule learning with monotonicity constraints would degrade. An effective and robust measure of
[18]. They first monotonized datasets using a non- feature quality is required in this domain [38].
parametric classification procedure and then gener- The objective of this work is to design a robust and
ated rule ensemble consistent with the monotonized understandable algorithm for monotonic classification
datasets. In 2010, Feelders also discussed the effec- tasks. There usually are a lot of inconsistent samples
tiveness of relabeling in monotonic classification [40]. in the noisy context. We introduce a robust measure
If there is a small quantity of inconsistent samples, of feature quality, called rank entropy, to compute the
this technique may be effective. However, if a lot of uncertainty in monotonic classification. As we know,
inconsistent samples exist, these techniques have to Shannon’s information entropy is robust in evaluating
revise the labels of the inconsistent samples, as shown features for decision tree induction [10], [13], [34].
in [18]. This maybe lead to information loss. In 2010, Rank entropy inherits the advantage of robustness
Jimnez et al. introduced a new algorithm, POTMiner, of Shannon’s entropy [27]. Moreover, this measure
to identify frequent patterns in partially-ordered trees is able to reflect the ordinal structures in monotonic
[19]. This algorithm did not consider the monotonicity classification. Then we design a decision tree algo-
constraints. rithm based on the measure. Some numerical tests
In addition, by replacing equivalence relations with are conducted on artificial and real-world datasets.
dominance relations, Greco et al. generalized the clas- The results show the effectiveness of the algorithm.
sical rough sets to dominance rough sets for analyzing The contributions of this work is three folds. First,
multi-criteria decision making problems in [19]. The we discuss the properties of rank entropy and rank
model of dominance rough sets can be used to extract mutual information. Second, we propose a decision
rules for monotonic classification [36], [39]. Since then tree algorithm based on rank mutual information.
a collection of researches have been reported to gener- Finally, systemical experimental analysis is presented
alize or employ this model in monotonic classification to show the performance of the related algorithms.
[20], [21], [22], [23], [24], [35]. It was reported that The rest of the paper is organized as follows. Section
dominance rough sets produced large classification 2 introduces the preliminaries on monotonic classifi-
boundary on some real-world tasks [28], which made cation. Section 3 gives the measure of ordinal entropy
the algorithm ineffective as no or few consistent rules and discusses its properties. The new decision tree
could be extracted from data. algorithm is introduced in Section 4. Numerical ex-
Noise has great influence on modeling monotonic periments are shown in Section 5. Finally, conclusions
classification tasks [25]. Intuitively, algorithms con- and future work are given in Section 6.
sidering monotonicity constraints should offer sig-
nificant benefit over nonmonotonic ones because the
2 P RELIMINARIES ON M ONOTONIC C LASSI -
monotonic algorithms extract the natural properties of
the classification tasks. Correspondingly, the learned FICATION AND D OMINANCE ROUGH S ETS
models should be simpler and more effective. How- Let U = {x1 , · · · , xn } be a set of objects and A be a
ever, numerical experiments showed that the ordi- set of attributes to describe the objects; D is a finite
nal classifiers were statistically indistinguishable from ordinal set of decisions. The value of xi in attributes
non-ordinal counterparts [25]. Sometimes ordinal clas- a ∈ A or D is denoted by v(xi , a) or v(xi , D), respec-
sifiers performed even worse than nonordinal ones. tively. The ordinal relations between objects in terms
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,MARCH 2011 3

≤ ≤ ≤
of attribute a or D is denoted by ≤. We say xj is no RB di = {xj ∈ U |[xj ]>
B ∩ di = ∅},
worse than xi in terms of a or D if v(xi , a) ≤ v(xj , a)
or v(xi , D) ≤ v(xj , D), denoted by xi ≤a xj and The model defined above is called dominance rough
xi ≤D xj , respectively. Correspondingly, we can also sets, introduced by Greco, Matarazzo and Slowinski
define xi ≥a xj and xi ≥D xj . Given B ⊆ A, we say [20]. This model was widely discussed and applied in
xi ≤B xj if for ∀a ∈ B, we have v(xi , a) ≤ v(xj , a). recent years [21], [23], [24].
Given B ⊆ A, we associate an ordinal relation on the We know that d≤ i is a subset of objects whose
≤ ≤
universe defined as decisions are equal to or better than di , and RB di is
a subset of objects whose decisions are no worse than

RB = {(xi , xj ) ∈ U × U |xi ≤B xj }. di if their attribute values are better than xi , whereas
≤ ≤
RB di is a subset of objects whose decisions might be
A predicting rule is a function better than di . Thus RB ≤ ≤
di is the pattern consistently
f : U → D, equal to or better than di .
≤ ≤
It is easy to show that RB di ⊆ d≤ ≤ ≤
i ⊆ RB di and
which assigns a decision in D to each object in U . A
monotonically ordinal classification function should the subset of objects BN DB (d≤ ≤ ≤ ≤ ≤
i ) = RB di − RB di is

satisfy the following constraint called the upward boundary region of di in terms of
attribute set B.
xi ≤ xj ⇒ f (xi ) ≤ f (xj ), ∀xi , xj ∈ U. Analogically, we can also give the definitions of
downward lower and upper approximations, and
As to classical classification tasks, we know that the ≥ ≥ ≥ ≥
di , BN DB (d≥
boundary region RB di , RB i ). We can
constraint is ≥ ≤
obtain that BN DB (di ) = BN DB (di−1 ) for ∀i > 1.
xi = xj ⇒ f (xi ) = f (xj ), ∀xi , xj ∈ U. Finally the monotonic dependency of D on B is
computed as
Definition 1: Let DT =< U, A, D > be a decision
N
table, B ⊆ A. We say DT is B-consistent if ∀a ∈ B and |U − ∪ BN DB (di )|
i=1
xi , xj ∈ U , v(xi , a) = v(xj , a) ⇒ v(xi , D) = v(xj , D). γB (D) = .
Definition 2: Let DT =< U, A, D > be a decision ta- |U |
ble, B ⊆ A. We say DT is B-monotonically consistent If decision table DT is monotonically consistent in
if ∀xi , xj ∈ U , xi ≤B xj , we have xi ≤D xj . terms of B, we have BN DB (di ) = ∅. In addition,
≥ ≤ ≤
It is easy to show that a decision table is B- as d≤ max = U and dmin = U thus RB dmax = U and
consistent if it is B-monotonically consistent [5]. ≥ ≥
RB dmin = U .
In real-world applications, datasets are usually nei- Dominance rough sets give us a formal framework
ther consistent nor monotonically consistent due to for analyzing consistency in monotonic classification
noise and uncertainty. We have to extract useful de- tasks. However, it was pointed out that decision
cision rules from these contaminated datasets. Domi- boundary regions in real-world tasks are so large
nance rough sets were introduced to extract decision that we can not build an effective decision model
rules from datasets. We present some notations to be based on dominance rough sets because too many
used throughout this paper. inconsistent samples exist in datasets according to
Definition 3: Given DT =< U, A, D >, B ⊆ A, we the above definition [28]. Dominance rough sets are
define the following sets heavily sensitive to noisy samples. Several mislabeled
1) [xi ]≤
B = {xj ∈ U |xi ≤B xj }; samples might completely change the trained decision
2) [xi ]≤
D = {xj ∈ U |xi ≤D xj }. models. Here we give a toy example to show the
We can easily obtain the following properties influence of noisy data.
1) If C ⊆ B ⊆ A, we have [xi ]≤ ≤
C ⊇ [xi ]B ; Example 1: Table 1 gives a toy example of mono-
2) If xi ≤B xj , we have xj ∈ [xi ]B and [xj ]≤
≤ ≤
B ⊆ [xi ]B ;
tonic classification. There are ten samples described
≤ ≤ ≤
3) [xi ]B = ∪{[xj ]B |xj ∈ [xi ]B }; with an attribute A. These samples are divided into
4) ∪{[xi ]≤B |xi ∈ U } = U .
three grades according to decision D. We can see that
The minimal and maximal elements of decision D the decisions of these samples are consistent for the
are denoted by dmin and dmax , respectively, and d≤ i =
samples with larger attribute values are assigned with
{xj ∈ U |di ≤ v(xj , D)}. Then d≤ min = U and d ≤
max = the same or better decisions. In this case dependency
dmax . ρA (D) = 1.
Definition 4: Given DT =< U, A, D >, B ⊆ A, di is Now we assume that samples x1 and x10 are misla-
a decision value of D. As to monotonic classification, beled; and x1 belongs to d3 , while x10 belongs to d1 .
the upward lower and upper approximations of d≤ As d≤ ≤
1 = {x1 , · · · , x10 }, d2 = {x1 , x5 , x6 , x7 , x8 , x9 },
i
are defined as d3 = {x1 , x8 , x9 }, then R≤ d≤

1 = {x1 , · · · , x10 },
R≤ d≤ 1 = {x 1 , · · · , x10 }; R ≤ ≤
d2 = ∅, R≤ d≤ 2 =
≤ ≤
RB di = {xj ∈ U |[xj ]≤ ≤
B ⊆ di },
≤ ≤ ≤ ≤
{x1 , · · · , x10 }; R d3 = ∅, R d3 = {x1 , · · · , x10 }. So
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,MARCH 2011 4

TABLE 1
A toy monotonic classification task

Sample x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
A 2.12 3.44 5.32 4.53 7.33 6.51 7.56 8.63 9.11 9.56
D 1 1 1 1 2 2 2 3 3 3

≤ ≤ ≤ ≤
RB d2 = RB d3 . We derive ρA (D) = 0 in this context. and descending rank joint entropy of the set U with
From this example we can see that dominance rough respect to B and C is defined as
sets are sensitive to noisy data. n
The above analysis shows that although the model ≥ 1 |[xi ]≥ ≥
B ∩ [xi ]C |
RHB∪C (U ) = − log . (4)
of dominance rough sets provides a formally theoretic n i=1 n
framework for studying monotonic classification, it
may not be effective in practice due to noise. We Given DT =< U, A, D > and B ⊆ A, C ⊆ A, we
≤ ≤ ≤
should introduce a robust measure. have RHB∪C (U ) ≥ RHB (U ); RHB∪C (U ) ≥ RHC≤ (U );
RHB∪C (U ) ≥ RHB (U ); RHB∪C (U ) ≥ RHC≥ (U ).
≥ ≥ ≥

3 R ANK E NTROPY AND R ANK M UTUAL I N -


Given DT =< U, A, D >, C ⊆ B ⊆ A. Then we have
FORMATION ≥ ≥ ≤ ≤
RHB∪C (U ) = RHB (U ) and RHB∪C (U ) ≤ RHB (U ).
Information entropy performs well in constructing Definition 7: Given DT =< U, A, D >, B ⊆ A,
decision trees. However, it can not reflect the ordinal C ⊆ A, If C is known, the ascending rank conditional
structure in monotonic classification. The model of entropy of the set U with respect to B is defined as
dominance rough sets provides a formal framework
n
for studying monotonic classification, however it is ≤ 1 |[xi ]≤ ≤
B ∩ [xi ]C |
RHB|C (U ) = − log , (5)
not robust enough in dealing with real-world tasks. n i=1 |[xi ]≤
C|
In this section, we introduce a measure, called rank
mutual information (RMI) [27], to evaluate the ordinal and descending rank conditional entropy of the set U
consistency between random variables. with respect to B is defined as
Definition 5: Given DT =< U, A, D >, B ⊆ A. n
The ascending and descending rank entropies of the ≥ 1 |[xi ]≥ ≥
B ∩ [xi ]C |
RHB|C (U ) = − log . (6)
system with respect to B are defined as n i=1 |[xi ]≥
C|
n
≤ 1 |[xi ]≤
B| Given DT =< U, A, D >, B ⊆ A, C ⊆ A, we
RHB (U ) = − log , (1) ≤ ≤
n i=1 n have that RHB|C (U ) = RHB∪C (U ) − RHC≤ (U ) and
≥ ≥
RHB|C (U ) = RHB∪C (U ) − RHC≥ (U ).
n
≥ 1 |[xi ]≥
B|
RHB (U ) = − log . (2) Given DT =< U, A, D >, B ⊆ C ⊆ A, we have that
n i=1 n ≤ ≥
RHB|C (U ) = 0 and RHB|C (U ) = 0.

Example 2: Continue Example 1. RHA (U ) =
1

10
|[xi ]≤ | 1 10 1 9 Given DT =< U, A, D >, B ⊆ A, C ⊆ A. It is easy
− 10 log 10 A
= − 10 log 10 − 10 log 10 −
i=1 to obtain the following conclusions.
1 7 1 8 1 5 1 6 ≤ ≤
10 log 10 − 10 log 10 − 10 log 10 − 10 log 10 − 1) RHB∪C (U ) ≤ RHB (U ) + RHC≤ (U ),
1 4 1 3 1 2 1 1
10 log 10 − 10 log 10 − 10 log 10 − 10 log 10 = 0.7921, RHB∪C (U ) ≤ RHB (U ) + RHC≥ (U );
≥ ≥

≤ 1
10
|[xi ]≤ | 4

2) RHB|C (U ) ≤ RHB ≤ ≤
(U ), RHB|C (U ) ≤ RHC≤ (U ),
RHD (U ) = − 10 log 10 D
= − 10 log 10 3 6
10 − 10 log 10 − ≥ ≥ ≥
3 3
i=1 RHB|C (U ) ≤ RHB (U ), RHB|C (U ) ≤ RHC≥ (U ).
10 log 10 = 0.5144. ≥
|[x ] | ≥
Since 1 ≥ ni B ≥ 0, We have RHB (U ) ≥ 0 and Definition 8: Given DT =< U, A, D >, B ⊆ A, C ⊆
≤ ≤ ≥
RHB (U ) ≥ 0. RHB (U ) = 0 (RHB (U ) = 0) if and A. The ascending rank mutual information (ARMI) of
only if ∀xi ∈ U , [xi ]≥ ≤
B = U ([xi ]B = U ). Assume C ⊆
the set U between B and C is defined as
B ⊆ A. Then ∀xi ∈ U , we have [xi ]≥ ≥ ≤
B ⊆ [xi ]C ([xi ]B ⊆ 1
n
|[xi ]≤ ≤
B | × |[xi ]C |
≤ ≥ ≥
[xi ]C ). Accordingly, RHB (U ) ≤ RHC (U ) (RHB (U ) ≤ ≤
RM I ≤ (B, C) = − log , (7)
RHC≤ (U )).
n i=1 n × |[xi ]≤ ≤
B ∩ [xi ]C |
Definition 6: Given < U, A, D >, B ⊆ A, C ⊆ A, The and descending rank mutual information (DRMI) of
ascending rank joint entropy of the set U with respect the set U regarding B and C is defined as
to B and C is defined as
n
1
n
|[xi ]≤ ≤ 1 |[xi ]≥ ≥
B | × |[xi ]C |
≤ B ∩ [xi ]C | RM I ≥ (B, C) = − log . (8)
RHB∪C (U ) = − log , (3) n × |[xi ]≥ ≥
B ∩ [xi ]C |
n i=1 n n i=1
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,MARCH 2011 5

In essence, rank mutual information can be consid- The above properties show that rank entropy, rank
ered as the degree of monotonicity between features B conditional entropy and rank mutual information will
and C. The monotonicity should be kept in monotonic degenerate to Shannon’s one if we replace [xi ]≤ B with
classification. Rank mutual information RM I ≤ (B, D) [xi ]B . Thus we can consider that rank entropy is a
or RM I ≥ (B, D) can be used to reflect the monotonic- natural generalization from Shannon’s entropy. As
ity relevance between features B and decision D. So it we know Shannon’s entropy is robust in measuring
is useful for ordinal feature evaluation in monotonic relevance between features and decision for classifica-
decision tree construction. tion problems, we desire that rank entropy and rank
Given DT =< U, A, D >, B ⊆ A, C ⊆ A, we have mutual information are also robust enough in dealing
that with noisy monotonic tasks.
≤ ≤
1) RM I ≤ (B, C) = RHB (U ) − RHB|C (U ) = Fig.1 shows six scatter plots of 500 samples in differ-
≤ ≤
RHC (U ) − RHC|B (U ), ent 2-D feature spaces, where the relation between the
≥ ≥ first three feature pairs are linear; and some features
2) RM I ≥ (B, C) = RHB (U ) − RHB|C (U ) = are contaminated by noise; the fourth feature pair is
RHC≥ (U ) − RHC|B

(U ). completely irrelevant to each other; the final two are
nonlinear monotonous. We compute the rank mutual
Given DT =< U, A, D >, B ⊆ C ⊆ A, we have that information of these feature pairs. We can see that

1) RM I ≥ (B, C) = RHB (U ); the first and last pairs return the maximal values of
≤ ≤
2) RM I (B, C) = RHB (U ). rank mutual information among these feature pairs as
these pairs are nearly linear or nonlinear monotonous,
Given DT =< U, A, D >, B ⊆ A. RM I ≤ (B, D) ≤ while the feature pair in Subplot 4 gets a very small

RHD (U ). If ∀xi ∈ U , we have [xi ]≤ ≤
B ⊆ [xi ]D , then
value as these two features are irrelevant. The exam-
we say that the decision D is B-ascending consistent ple shows that rank mutual information is effective in
≤ measuring ordinal consistency.
and RM I ≤ (B, D) = RHD (U ). If ∀xi ∈ U , we have
≥ ≥
that [xi ]B ⊆ [xi ]D , then we say the decision D is B-
≥ 4 C ONSTRUCTING D ECISION T REES
descending consistent and RM I ≥ (B, D) = RHD (U ).
In addition, if D is B-ascending consistent, then D is BASED ON R ANK E NTROPY
also B-descending consistent. Decision tree is an effective and efficient tool for ex-
The above analysis tells us that the maximum of tracting rules and building classification models from
rank mutual information between features and deci- a set of samples [10], [11], [12], [29], [30], [31]. There
sion equals the rank entropy of decision, and rank are two basic issues in developing a greedy algorithm
mutual information arrives at its maximum if the clas- for learning decision trees [37]: feature evaluation and
sification task is monotonically consistent with respect pruning strategies, where feature evaluation plays the
to these features. In this case addition of any new central role in constructing decision trees; it deter-
feature will not increase the rank mutual information. mines which attribute should be selected for partition
In constructing decision trees, we add features one the samples if the samples in a node do not belong
by one for partition the samples in a node until the to the same class. As to classical classification tasks,
mutual information does not increase by adding any Shannon’s entropy is very effective. In this section,
new feature [10], [11]. Thus the algorithm stops there. we substitute Shannon’s entropy with rank entropy
Shannon’s entropy is widely employed and for monotonic decision trees.
performs well in learning decision trees. The rank We only consider univariate binary trees in this
entropy and rank mutual information not only work. Namely, in each node we just use one attribute
inherits the advantage of Shannon’s entropy, but to split the samples and the samples are divided
also measure the degree of monotonicity between into two subsets. Assume Ui the subset of samples
features. We have the following conclusions. in the current node and ai is selected for splitting
Ui in this node. Then the samples are divided into
Ui1 and Ui2 , where Ui1 = {x ∈ Ui |v(ai , x) ≤ c} and
Given DT =< U, A, D >, B ⊆ A and C ⊆ A. If we
Ui2 = {x ∈ Ui |v(ai , x) > c}.
replace ordinal subsets [xi ]≤ B with equivalence classes Before introducing the algorithm, the following
[xi ]B , where [xi ]B is the subset of samples taking the
rules should be considered in advance [15].
same feature values as xi in terms of feature set B,
1) splitting rule S: give S to generate partition in
then we have
each node;
1) RH ≤ (B) = HB (U ), RH ≥ (B) = HB (U ); 2) stopping criterion H: determine when to stop
≤ ≥
2) RHB|C (U ) = RHB|C (U ), RHB|C (U ) = growing the tree and generate leaf nodes;
RHB|C (U ); 3) labeling rule L: assign class labels to leaf nodes.
3) RM I ≤ (B, C) = M I(B, C); RM I ≥ (B, C) = As to splitting rule S, we here use rank mutual in-
M I(B, C). formation to evaluate the quality of features. Given at-
tributes A = {a1 , a2 , · · · , aN } and a subset of samples
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,MARCH 2011 6

  

  

  
          
50,  50,  50, 

  

  

  
        
50,  50,  50, 

Fig. 1. Scatter plot in different feature spaces and the corresponding rank mutual information of feature pairs

Ui , we select attribute a and parameter c satisfying Now we study the properties of monotonic decision
the following condition: trees generated with the above procedure.
Definition 9: Given ordinal decision tree T , the rule
a = arg max RM I ≤ (aj , c, D) from the root node to a leaf l is denoted by Rl . If
aj two rules Rl and Rm are generated from the same
 |[x]≤

aj |×|[x]D |
(9)
= arg max − |U1i | x∈Uj log ≤ ≤ , attributes, we say Rl and Rm are comparable; other-
aj |Ui |×|[x]aj ∩[x]D |
wise they are not comparable. In addition, we denote
where c is a number to split the value domain of Rl < Rm if the feature value of Rl is less than Rm .Rl
aj such that the split yields the largest rank mutual and Rm are also called left node and right node,
information between aj and D computed with the respectively. As to a set of rules R, we call it is mono-
samples in the current node. As binary trees are used, tonically consistent if for any comparable pair of rules
we just require one number to split the samples in Rl and Rm , if Rl < Rm , then D(Rl ) < D(Rm ), where
each node. D(Rl ) and D(Rm ) are the decisions of these rules;
Now we consider the stopping criterion. Obviously otherwise, we say R is not monotonically consistent.
if all the samples in a node belong to the same class, Property 1. Given < U, A, D >, if U is
the algorithm should stop growing the tree in this monotonically consistent, then the rules derived from
node. Moreover, in order to avoid overfitting data, the ordinal decision trees are also monotonically
we also stop growing the tree if the rank mutual consistent.
information produced by the best attribute is smaller
than a threshold ε. Moreover, some other pre-pruning Proof: Let Rl and Rm be a pair of comparable rules,
techniques can also be considered here [33]. and Rl < Rm . We prove that D(Rl ) < D(Rm ). If U
Regarding labeling rule L, if all the samples in a is monotonically consistent, then for any x, y ∈ U ,
leaf node come from the same class,this class label v(D, x) = v(D, y), we can get an attribute ai ∈ A,
is assigned to the leaf node. However, if the samples such that v(ai , x) = v(ai , y). That is, ∃ai ∈ A, it can
belong to different classes, we assign the median class separate x and y. So the samples in the node of x
of samples to this leaf. Furthermore, if there are two belong to the same decision, while the samples in the
classes having the same number of samples and the node of y belong to another decision. Thus, if we want
current node is a left branch of its parent node, we to derive D(Rl ) < D(Rm ), we just need to show if
then assign the worse class to this node; otherwise, ∀x, y ∈ Ui ∪ Uj , v(D, x) < v(D, y), we have x ∈ Ui , y ∈
we assign the better class to it. Uj . As U is monotonically consistent, for ∀x, y ∈ Ui ∪
The monotonic decision tree algorithm based on Uj and v(D, x) < v(D, y), ∃ai ∈ A , such that v(ai , x) <
rank mutual information is formulated in Table 2. v(ai , y). Then the parent node of Rl and Rm can get
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,MARCH 2011 7

TABLE 2 TABLE 3
Algorithm Description MAE on artificial data

REMT: Rank Entropy Based Decision Tree Classes CART Rank Tree REMT OLM OSDL
input: criteria: attributes of samples. 2 0.0370 0.8000 0.4200 0.0570 0.0640
decision: decision of samples. 4 0.0741 0.1451 0.0561 0.1470 0.0910
ε: stopping criterion: If the maximal rank mutual 6 0.1110 0.2959 0.0770 0.2180 0.1200
information is less than ε, the branch stops growing 8 0.1479 0.3307 0.1276 0.2600 0.1620
output: monotonic decision tree T . 10 0.1870 0.4560 0.1310 0.3270 0.2010
begin: 20 0.3579 1.0029 0.2569 0.5060 0.4130
1 generate the root node. 30 0.5146 1.4862 0.3735 0.7010 0.5840

2 if the number of sample is 1 or all the samples come


from the same class, the branch stops growing
3 otherwise, 5 E XPERIMENT A NALYSIS
There are several techniques to learn decision models
for each attribute ai ,
for monotonic classification. In order to show the
for each cj ∈ ai ,
effectiveness of the proposed algorithm, we conduct
divide samples into two subsets according to cj
some numerical experiments with artificial or real-
if v(ai , x) ≤ cj , then v(ai , x) = 1.
world data sets. We compare the proposed algorithm
else v(ai , x) = 2.
with others on real-world classification tasks.
compute RM Icj = RM I(ai , D).
First, we introduce the following function to gener-
end j.
c∗j = arg max RM Icj .
ate monotone datasets:
j
end i. 1 2 
4 select the best feature a and the corresponding splitting f (x1 , x2 ) = 1 + x1 + x2 − x21 , (10)
point: c∗ = arg max max RM I(ai , cj ) 2
i j
where x1 and x2 are two random variables indepen-
5 if RM I(a, D) < ε, then stop.
dently drawn from the uniform distribution over the
6 build a new node and split samples with a and c∗ . unit interval.
In order to generate ordered class labels, the result-
7 recursively produce new splits according to the above ing numeric values were discretized into k intervals
procedure until stopping criterion is satisfied. [0, 1/k], (1/k, 2/k], ..., (k-1/k, 1]. Thus each interval
8 end
contains approximately the same number of samples.
The samples belonging to one of the intervals share
the same rank label. Then we form a k-class mono-
tonic classification task. In this experiment, we try
k=2, 4, 6 and 8, respectively. The datasets are given
the maximal RM I(a, D). We have v(a, x) < v(a, y),
in Fig.5.
namely, x ∈ Ui and y ∈ Uj .
We here use the mean absolute error for evaluating
The above analysis shows us that with algorithm the performance of decision algorithms, computed as
REMT we can derive a monotonic decision tree from
1 N
a monotonically consistent dataset. M AE = |ŷi − yi |, (11)
N i=1
We give a toy example to show this property. where N is the number of samples in the test set and
We generate a dataset of 30 samples described with yi is the output of the algorithm and yi is the real
two features, as shown in Fig. 2. These samples are output of the i th sample.
divided into 5 classes. Now we employ CART and We first study the performance of algorithms on
REMT to train decision trees, respectively. Fig.3 and
different numbers of classes. We generate a set of
Fig.4 give the trained models. artificial datasets with 1000 samples and the class
CART returns a tree with eleven leaf nodes, while number varies from 2 to 30. Then we employ CART,
REMT generates a tree with nine leaf nodes. Most of Rank Tree [17], OLM [6], OSDL [41] and REMT to
all, the tree trained with CART is not monotonically learn and predict the decisions. OLM is an ordinal
consistent for there are two pairs of conflicted rules. learning model introduced by Ben-David et al., while
From left to right, we can see the fourth rule is not OSDL is ordinal stochastic dominance learner based
monotonically consistent with the fifth rule; besides, on associated cumulative distribution [41].
the sixth rule is not monotonically consistent with the Based on 5-fold cross validation technique, the av-
seventh rule, where the objects with the better features erage performance is computed and given in Table 3.
get the worse decision. However, REMT obtains a REMT yields the least errors in all the cases except
consistent decision tree. the case two classes are considered.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,MARCH 2011 8

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1

Fig. 2. Artificial data, where 30 samples are divided into 5 classes

$

$ $

$ Z $ $

$ $ Z Z
Z Z

Z Z $ $

Z Z Z Z

Fig. 3. Non-monotonic decision tree trained with CART, where two pairs of rules are not monotonically consistent.

$

$ $

$ $
$ $

Z Z Z $ Z Z Z Z

Z Z

Fig. 4. Monotonic decision tree trained with REMT, where all rules are monotonically consistent.

In addition, we also consider the influence of sam- The rest samples are used in testing for estimating
ple numbers on the performance of trained models. the performance of the trained models. The curves
We first generate artificial data of 1000 samples and 4 of loss varying with number of training samples are
classes. And then we randomly draw training samples shown in Fig.6. We can see that REMT is more precise
from this set. The size of training samples ranges than CART and rank trees, no matter how many
from 4 to 36. In this process, we guarantee there is training samples are used. In addition, we can also
at least one representative sample from each class. see that the difference between rank tree and REMT
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,MARCH 2011 9

k=2 k=4
1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

k=6 k=8
1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Fig. 5. Synthetic monotone datasets, where the samples are divided into 2, 4, 6 and 8 classes, respectively.

1.6
REMT
CART
1.4 RT

1.2
mean absolute rank loss

0.8

0.6

0.4

0.2
4 8 12 16 20 24 28 32 36
size of training set

Fig. 6. Loss curves of CART, Rank Tree and REMT, where the size of training samples gradually increases.

gets smaller and smaller as the number of training the problem of negative monotonicity to a positive
samples increases. monotonicity task. There are several solution to this
In order to test how our approach behaves in real- objective. We compute reciprocal of feature values if
world applications, we collected 12 datasets. Four negative monotonicity happens.
datasets come from UCI repository: Car, CPU per- For each dataset, we randomly drew n∗ N i samples
formance, Ljubljana breast cancer and Boston hous- as a training set each time, where Ni is the number
ing pricing. Bankruptcy comes from the experi- of classes, n=1, 2, 3, and so on. At least one sample
ence of a Greek industrial development bank fi- from each class was drawn in each round when we
nancing industrial and commercial firms and the generate the training set. The remained samples are
other datasets were obtained from weka homepage used as the test set. We compute the mean absolute
(http://www.cs.waikato.ac.nz/ml/weka/). loss as the performance of the trained model. The
Before training decision trees, we have to prepro- experiment was repeated 100 times. The average over
cess the datasets. Because we use ascending rank all 100 results is output as the performance of the
mutual information as the splitting rule, we assume models. The results are given in Fig.7, where ε=0.01.
that larger rank value should come from larger feature Regarding the curves in Fig.7, we see REMT is
values; we call this positive monotonicity. In practice, much better than CART and Rank Tree in most cases,
we may confront the case that the worse feature value except dataset Breast and CHNUnvRank. Moreover,
should get the better ranks. This is called negative we can also see that if the size of training sets is
monotonicity. If we use REMT, we should transform very small, REMT is much better than Rank Tree and
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,MARCH 2011 10

Balance Bankruptyrisk Boston Housing


1 0.8 1.4
RT RT RT
0.9 CART CART CART
REMT REMT 1.2 REMT
0.6
0.8
1
0.7 0.4
0.8
0.6
0.2
0.6
0.5

0.4 0 0.4
10 20 30 40 10 15 20 25 30 10 20 30 40

Breast Car Evaluation CHNUnvRank


2 1 2.5
RT RT RT
CART CART CART
REMT 0.8 REMT 2 REMT
1.5

0.6 1.5
1
0.4 1

0.5
0.2 0.5

0 0 0
10 20 30 40 10 20 30 40 50 60 10 20 30 40 50

CPU Employee Rej/Acc Employee Sel.


1.2 3 4
RT RT RT
CART CART CART
1 REMT REMT REMT
3
2.5
0.8
2
0.6
2
1
0.4

0.2 1.5 0
10 20 30 40 50 60 20 40 60 80 20 40 60 80

Lecturers Eva. Social Workers Squash


0.8
RT RT RT
CART CART CART
1.2 REMT 1.2 REMT 0.7 REMT

1 1 0.6

0.8 0.8 0.5

0.6 0.6 0.4


10 20 30 40 50 10 20 30 40 10 15 20 25 30

Fig. 7. Average performance of real-world tasks before monotonization, where the x-coordinate is the number
of training samples and y-coordinate is mean absolute error.

CART. As the number of training samples increases, In order to compare the performance when datasets
the superiority becomes less and less. This trend are monotonic, we now relabel the samples so as
shows REMT is more effective than CART and Rank to generate monotonic training sets. Before training
Tree if the size of training samples is small. decision trees, we first introduced a monotonization
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,MARCH 2011 11

Balance Bankruptyrisk Boston Housing


0.8 0.5 1.2
RT RT RT
CART CART 1.1 CART
0.7 REMT 0.4 REMT REMT
1

0.9
0.6 0.3
0.8

0.7
0.5 0.2
0.6

0.4 0.1 0.5


10 15 20 25 30 10 15 20 25 30 10 15 20 25 30

Breast Car Evaluation CHNUnvRank


0.5 2.5
1.4 RT RT RT
CART CART CART
1.2 REMT REMT 2 REMT
0.4
1
1.5
0.8 0.3
1
0.6
0.2
0.4 0.5

0.2
0.1 0
10 15 20 25 30 10 15 20 25 30 10 20 30 40 50

CPU Employee Rej/Acc Employee Sel.


1.2 3 4
RT RT RT
CART CART CART
1 REMT 2.5 REMT
REMT 3

0.8 2
2
0.6 1.5

1
0.4 1

0.2 0.5 0
10 20 30 40 50 60 20 40 60 80 20 40 60 80

Lecturers Eva. Social Workers Squash


1.4 1.4 0.8
RT RT RT
CART CART CART
1.2 1.2
REMT REMT 0.7 REMT

1 1

0.6
0.8 0.8

0.6 0.6 0.5

0.4 0.4
0.4
10 20 30 40 50 10 20 30 40 10 15 20 25 30

Fig. 8. Average performance curves of real-world tasks after monotonization, where the x-coordinate is the
number of training samples and y-coordinate is mean absolute error.

algorithm to revise the labels of some samples and numbers of training samples is given in Fig. 8. The
generated monotone training datasets [18]. And then same conclusion can be derived.
we learned decision trees and predicted labels of Finally, we test these algorithms on the datasets
test samples. The variation of loss varying with the based on 5-fold cross validation technique. Table 4
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,MARCH 2011 12

TABLE 4 more effective than other techniques if training sets


Mean absolute loss (before monotonization) are monotonized.
Dataset REMT CART Rank Tree OLM OSDL
balance 0.5171 0.5830 0.7678 0.6176 0.8688
bankrupty 0.1250 0.1250 0.4679 0.6464 0.3071
6 C ONCLUSIONS AND F UTURE WORK
breast 0.0916 1.3519 0.1172 0.3320 0.0859 Monotonic classification is a kind of important tasks
car 0.1696 0.2436 0.3970 0.4676 0.3102 in decision making. There is a constraint in these
CHNUnv 0.1900 0.1800 0.1900 1.0900 0.5200 tasks that the features and decision are monotoni-
CPU 0.2641 0.2692 0.3731 0.4159 0.3835 cally consistent. That is, the objects with better fea-
empl. rej/acc 1.2347 1.3800 1.3998 1.2920 1.2830 ture values should not get worse decisions. Classical
empl. sel. 0.3808 0.5057 0.6389 0.5062 0.3421 learning algorithms can not extract this monotonous
lect. eva. 0.3471 0.3726 0.5015 0.6050 0.4020 structure from datasets, thus they are not applicable
workers 0.4451 1.0017 0.5671 0.5630 0.4270 to these tasks. Some monotonic decision algorithms
housing 0.4464 1.4406 0.4130 0.4091 1.0770 have also been developed in these years by integrating
squash 0.4430 1.5275 0.4325 0.3236 0.5964 monotonicity with indexes of separability, such as
Average 0.3879 0.5817 0.5222 0.6057 0.5503 Gini, mutual information, dependency, and so on.
However, the comparative experiments showed that
TABLE 5 noisy samples have great impact on these algorithms.
Mean absolute loss (after monotonization) In this work, we combine the advantage of robust-
ness of Shannon’s entropy with the ability of domi-
Dataset REMT CART Rank Tree OLM OSDL nance rough sets in extracting ordinal structures from
balance 0.3408 0.4768 0.5728 0.2304 0.1504 monotonic datasets by introducing rank entropy. We
bankrupty 0.1500 0.1786 0.4821 0.6679 0.3571 improve the classical decision tree algorithm with
breast 0.0744 1.3462 0.0686 0.3319 0.0858 rank mutual information and design a new decision
car eva. 0.0294 0.0462 0.2287 0.0382 0.0266 tree technique for monotonic classification. With the
CHNUnv 0.1900 0.1800 0.1900 1.0900 0.5200 theoretic and numerical experiments, the following
CPU 0.2641 0.2736 0.3253 0.4159 0.3835 conclusions are derived.
empl. rej/acc 0.5603 0.5685 0.6765 0.5500 0.5740 1) Monotonic classification is very sensitive to noise;
empl. sel. 0.1269 1.0145 0.4726 0.2215 0.1311 several noisy samples may completely change the
lect. eva. 0.0929 0.1054 0.2516 0.1380 0.1120 evaluation of feature quality. A robust measure of
workers 0.1721 1.0580 0.4671 0.1920 0.1860 feature quality is desirable.
housing 0.3595 0.3991 0.3497 0.6541 0.4783 2) Mutual information is a robust index of feature
squash 0.4430 1.5275 0.5257 0.3473 0.5745 quality in classification learning; however it cannot re-
Average 0.2336 0.5123 0.3842 0.4064 0.2983 flect the ordinal structure in monotonic classification.
Rank mutual information combines the advantage of
information entropy and dominance rough sets. This
presents the mean absolute loss yielded with different new measure can not only measure the monotonous
learning algorithms, including REMT, CART, Rank consistency in monotonic classification, but also is
Tree, OLM and OSDL. Among 12 tasks, REMT obtains robust to noisy samples.
the best performance on 6, while CART, Rank tree, 3) Rank entropy based decision trees can produce
OLM and OSDL produce 2, 0, 2 and 3 best results, monotonically consistent decision trees if the given
respectively. REMT outperform CART over ten tasks, training sets are monotonically consistent. When the
and it produce better performances than Rank tree, training data is nonmonotonic, our approach pro-
OLM and OSDL on 9, 10 and 9 tasks, respectively. As a duces a nonmonotonic classifier, but its performance
whole, REMT produces the best average performance is still good.
over 12 tasks, The experimental results show that It is remarkable that sometimes just a fraction of
REMT is better than CART, Rank tree, OLM and features satisfies the monotonicity constraint with de-
OSDL in most cases. cision in real-world tasks. Some of features do not
We also calculate the mean absolute loss of these satisfy this constraint. In this case, decision rules
algorithms on monotonized datasets. The generaliza- should be in the form that if Feature 1 is equal to
tion performance based on cross validation is given in a1 and Feature 2 is better than a2 , then decision D
Table 5. Comparing the results in Table 4 and Table 5, should be no worse than dk . This problem is called
we can see all the mean absolute errors derived from partial monotonicity. We require some measures to
different algorithms decrease if data are monotonized. evaluate the quality of two types of features at the
Furthermore, REMT outperforms CART, Rank tree, same time when we build decision trees from this
OLM and OSDL on 11 tasks, 9 tasks, 9 tasks and kind of datasets. We will work on this problem in the
10 tasks, respectively. The results show REMT is also future.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,MARCH 2011 13

ACKNOWLEDGEMENT [26] J. S. Dyer, P. C. Fishburn, R. E. Steuer, J. Wallenius, S. Zionts.


Multiple criteria decision making, multiattribute utility theory: The
The authors would like to express their gratitude to next ten years. Management Science. Vol. 38, pp. 645-654, 1992.
the anonymous reviewers for their constructive com- [27] Q. H. Hu, M. Z. Guo, D. R. Yu, J. F. Liu. Information entropy
for ordinal classification. Science in China Series F: Information
ments, which is helpful for improving the manuscript. sciences. vol. 53, no. 6, pp. 1188-1200, 2010.
[28] S. Greco, B. Matarazzo, R. Slowinski, and J. Stefanowski. Vari-
able Consistency Model of Dominance-Based Rough Sets Approach.
R EFERENCES W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 170-
181, 2001.
[1] J. Wallenius, J. S. Dyer, P. C. Sishburn, R. E. Steuer, S. Zionts, K. [29] B. Chandra, P. P. Varghese. Fuzzy SLIQ decision tree algorithm.
Deb. Multiple criteria decision making, Multiattribute utility theory: IEEE Transactions on Systems Man and Cybernetics Part B-
recent accomplishments and what lies ahead, Management Science. Cybernetics, vol. 38, no. 5, pp.1294-1301, 2008.
Vol. 54, no. 7, pp. 1336-1349, 2008. [30] H. W. Hu, Y. L. Chen, K. Tang. Dynamic Discretization Approach
[2] B. Zhao, F. Wang, C. S. Zhang. Block-quantized support vector for Constructing Decision Trees with a Continuous Label. IEEE
ordinal regression, IEEE Transactions on Neural Networks, vol. Transactions on Knowledge and Data Engineering, vol. 21,
20, no. 5, pp. 882-890, 2009. no.11, pp.1505-1514, 2009.
[3] B. Y. Sun, J. Y. Li, D. D. Wu, et al. Kernel Discriminant Learning for [31] D. Hush, R. Porter. Algorithms for optimal dyadic decision trees.
Ordinal Regression. IEEE Transactions on Knowledge and Data Machine Learning, vol. 80, no. 1, pp. 85-107, 2010.
Engineering, vol. 22, no. 6, pp. 906-910, 2010. [32] R. V. Kamp, A. J. Feelders and N. Barile (2009). Isotonic
[4] C. Zopounidis, M. Doumpos. Multicriteria classification and sort- Classification Trees. In N. Adams (Ed.), Proceedings of IDA 2009
ing methods: A literature review. European Jounral of Operational (pp. 405-416). Springer.
Research. vol. 138, pp. 229-246, 2002. [33] A. J. Feelders, Martijn Pardoel, Pruning for Monotone Classifica-
[5] R. Potharst, A. J. Feelders. Classification trees for problems with tion Trees. Lecture Notes in Computer Science, 2003, vol. 2811,
monotonicity constraints. ACM SIGKDD Explorations Newslet- 1-12.
ter. vol. 4, no. 1, pp.1-10, 2002. [34] D. Cai, An Information-Theoretic Foundation for the Measurement
[6] A. Ben-David, L. Sterling, Y. H. Pao. Learning and classification of Discrimination Information. IEEE Transactions on Knowledge
of monotonic ordinal concepts. Computation Intelligence 1989; 5: and Data Engineering, vol. 22, no. 9, pp. 1262-1273, 2010.
45-49. [35] Y. H. Qian, C. Y. Dang, J. Y. Liang, D. W. Tang, Set-valued
[7] A. Ben-David. Automatic generation of symbolic multiattribute ordi- ordered information systems. Information Sciences, vol. 179, no.
nal knowledge-based DSSs: Methodology and applications. Decision 16, pp. 2809-2832, 2009.
Sciences 1992; 23: 1357-1372. [36] J.C. Bioch and V. Popova, Rough Sets and Ordinal Classification.
[8] Frank E., Hall M. A simple approach to ordinal classification. ECML in: A.v.d. Bosch, H. Weigand, editors. Proceedings of the 12th
2001, LNAI 2167, pp. 145-156, 2001. Belgian-Dutch Artificial Intelligence Conference (BNAIC’2000),
[9] J. P. Costa, J. S. Cardoso. Classification of ordinal data using neural De Efteling, pp.85-92, 2000.
networks. J. Gama et al. (Eds.): ECML 2005, LNAI 3720, pp. 690- [37] J.C. Bioch and V. Popova, Labelling and Splitting Criteria for
697, 2005. Monotone Decision Trees. in: M. Wiering, editor. Proceedings
[10] J. R. Quinlan Induction of decision trees. Machine Learning, 1986, of the 12th Belgian-Dutch Conference on Machine Learning
1 (1): 81-106. (BENELEARN2002), Utrecht, pp.3-10, 2002.
[11] J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan [38] J.C. Bioch and V. Popova, Monotone Decision Trees and Noisy
Kaufmann, San Mateo, CA (1993). Data. in: Proceedings of the 14th Belgian-Dutch Conference on
[12] L. Breiman, et al., Classification and Regression Trees, Chapman Artificial Intelligence (BNAIC2002), Leuven, pp.19-26, 2002.
& Hall, Boca Raton, 1993. [39] V. Popova, Knowledge Discovery and Monotonicity. PhD thesis,
[13] J. Mingers. An empirical comparison of selection measures for Erasmus University Rotterdam, 2004
decision-tree induction. Machine Learning, 1989, 3 (4): 319-342. [40] A. Feelders. Monotone Relabeling in Ordinal Classification. 2010
[14] A. Ben-David Monotonicity maintenance in information-theoretic IEEE International Conference on Data Mining, pp. 803-808,
machine learning algorithms. Machine Learning 1995, 19: 29-43. 2010.
[15] R. Potharst, J. C. Bioch. Decision trees for ordinal classification. [41] C.-V. Kim, Supervised ranking from semantics to algorithms, PhD
Intelligent Data Analysis. 2000, 4: 97-111. thesis, Ghent University, Belgium, 2003.
[16] K. Cao-Van, B. D. Baets. Growing decision trees in an ordinal
setting. International Journal of Intelligent Systems, 2003, 18:
733-750.
[17] F. Xia, W. S. Zhang, F. X. Li, Y. W. Yang. Ranking with decision
tree. Knowledge and Information Systems. 2008, 17: 381-395.
[18] W. Kotlowski, R. Slowinski. Rule learning with monotonicity
constraints. In proceeding of the 26th international conference
on machine learning, Montreal, Canada, pp. 537-544, 2009.
[19] A. Jimnez, F. Berzal, J.-C. Cubero. POTMiner: mining ordered,
unordered, and partially-ordered trees. Knowledge and Informa- Qinghua Hu received B. E., M. E. and PhD
tion Systems, vol. 23, no. 5, pp.199-224, 2010. degrees from Harbin Institute of Technology,
[20] S. Greco, B. Matarazzo, R. Slowinski. Rough approximation of Harbin, China in 1999, 2002 and 2008, re-
a preference relation by dominance relations. ICS Research Report spectively. Now he is an associate professor
16/96. European Journal of Operational Research 1999, 117: 63- with Harbin Institute of Technology and a
83. postdoctoral fellow with the Hong Kong Poly-
[21] S. Greco, B. Matarazzo, R. Slowinski. Rough sets methodology technic University. His research interests are
for sorting problems in presence of multiple attributes and criteria. focused on intelligent modeling, data mining,
European Journal of Operational Research. 2002,138: 247-259. knowledge discovery for classification and
[22] S. Greco, B. Matarazzo, R. Slowinski. Rough approximation by regression. He is a PC co-chair of RSCTC
dominance relations. International Journal of Intelligent Systems. 2010 and severs as referee for a great num-
2002, 17: 153-171. ber of journals and conferences. He has published more than 70
[23] J W.T. Lee, D. S. Yeung, E. C. C. Tsang. Rough sets and ordinal journal and conference papers in the areas of pattern recognition
reducts. Soft Computing. 2006, 10: 27-33. and fault diagnosis.
[24] Q. H. Hu, D. R. Yu, M. Z. Guo. Fuzzy preference based rough
sets. Information Sciences, vol. 180, no. 10, pp. 2003 -2022, 2010.
[25] A. Ben-David, L. Sterling, T. Tran. Adding monotonicity to
learning algorithms may impair their accuracy. Expert Systems
with Applications 36 (2009) 6627-6634.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,MARCH 2011 14

Xunjian Che received his B. E. from Harbin Maozu Guo received his Bachelor and Mas-
Institute of Technology in 2009. Now he is a ter Degrees from Department of Computer
master student with Harbin Institute of Tech- Sciences, Harbin Engineering University in
nology. His research interests are focused 1988 and 1991, respectively, and PhD from
on large-margin learning theory, preference Department of Computer Sciences, Harbin
learning and monotonic classification. Institute of Technology in 1997. Now he is
director of the Natural Computation Division,
Harbin Institute of Technology, Program Ex-
amining Expert of Information Science Di-
vision of NSFC; Senior Member of China
Computer Federation (CCF), and Member of
CCF Artificial Intelligence and Pattern Recognition Society; Member
of Chinese Association for Artificial Intelligence(CAAI), and Standing
Committee Member of Machine Learning Society of CAAI. Research
Fields: machine learning and data mining; computational biology and
bioinformatics; advanced computational models; image process and
computer vision. Prof. GUO has implemented several projects from
the Natural Science Foundation in China(NSFC), National 863Hi-
tech Projects, the Science Fund for Distinguished Young Schol-
Lei Zhang received the B.S. degree in 1995 ars of Heilongjiang Province, International Cooperative Project. He
from the Shenyang Institute of Aeronautical has won 1 Second Prize of the Province Science and Technology
Engineering, Shenyang, China, and the M.S. Progress, and 1 Third Prize of the Province Natural Science. He has
and Ph.D degrees in Electrical and Engineer- published over 100 papers in journals and conferences.
ing from Northwestern Polytechnical Univer-
sity, Xi’an, China, respectively, in 1998 and
2001. From 2001 to 2002, he was a research
associate with the Department of Comput-
ing, The Hong Kong Polytechnic University.
From January 2003 to January 2006, he was
a Postdoctoral Fellow in the Department of
Electrical and Computer Engineering, McMaster University, Canada.
Since January 2006, he has been an Assistant Professor in the De-
partment of Computing, The Hong Kong Polytechnic University. His
research interests include image and video processing, biometrics,
pattern recognition, multisensor data fusion, machine learning and
optimal estimation theory, etc.

David Zhang graduated in Computer Sci-


ence from Peking University. He received
Daren Yu received his M. E. and PhD de-
his MSc in Computer Science in 1982 and
grees from Harbin Institute of Technology,
his PhD in 1985 from the Harbin Institute
Harbin, China, in 1988 and 1996, respec-
of Technology (HIT). From 1986 to 1988
tively. Since 1988, he has been working at
he was a Postdoctoral Fellow at Tsinghua
the School of Energy Science and Engineer-
University and then an Associate Professor
ing, Harbin Institute of Technology. His main
at the Academia Sinica, Beijing. In 1994 he
research interests are in modeling, simula-
received his second PhD in Electrical and
tion, and control of power systems. He has
Computer Engineering from the University of
published more than one hundred confer-
Waterloo, Ontario, Canada. Currently, he is
ence and journal papers on power control
a Head, Department of Computing, and a Chair Professor at the
and fault diagnosis.
Hong Kong Polytechnic University where he is the Founding Director
of the Biometrics Technology Centre (UGC/CRC) supported by the
Hong Kong SAR Government in 1998. He also serves as Visiting
Chair Professor in Tsinghua University, and Adjunct Professor in
Shanghai Jiao Tong University, Peking University, Harbin Institute
of Technology, and the University of Waterloo. He is the Founder
and Editor-in-Chief, International Journal of Image and Graphics
(IJIG); Book Editor, Springer International Series on Biometrics
(KISB); Organizer, the first International Conference on Biometrics
Authentication (ICBA); Associate Editor of more than ten interna-
tional journals including IEEE Transactions and Pattern Recognition;
Technical Committee Chair of IEEE CIS and the author of more than
10 books and 200 journal papers. Professor Zhang is a Croucher Se-
nior Research Fellow, Distinguished Speaker of the IEEE Computer
Society, and a Fellow of both IEEE and IAPR.

View publication stats

You might also like