Professional Documents
Culture Documents
net/publication/271214468
CITATIONS READS
15 4,191
3 authors:
Carlos Soares
University of Porto
342 PUBLICATIONS 3,900 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
MANTIS - Cyber Physical System based Proactive Collaborative Maintenance View project
All content following this page was uploaded by Pedro Strecht on 23 January 2015.
1 Introduction
Combining decision tree models is a topic that has been studied with different
approaches. Gorbunov and Lyubetsky [2] address the issue from a mathematical
point of view, by formulating the problem of constructing a decision tree that is
closest on average to a set of trees. Kargupta and Park [3] present an approach
in which decision trees are converted to the frequency domain using the Fourier
transform. The merging process consists on summing the spectra of each model
and then transform the results back into to the decision tree domain.
Concerning data mining approaches, Provost and Hennessy [4,5] present an
algorithm that evaluates each model with data from the other models to merge.
The merged model is constructed from satisfactory rules, i.e., rules that are
generic enough to be evaluated in the other models. A more common approach
is the combination of rules derived from decision trees. The idea is to convert
decision trees from two models into decision rules by combining the rules into new
rules, reducing their number and finally growing a decision tree of the merged
model. Parts of this process are presented in the doctoral thesis of Williams [6]
and other researchers have contributed by proposing different ways of carrying
out intermediate tasks, such as Andrzejak et al. [7], Hall et al. [8,9] or Bursteinas
and Long [10]. While each approach is different, we identified a set of phases they
share in common, described next.
In the first phase, a decision tree is transformed to a set of rules. Each path
from the root to the leaves creates a rule with a set of possible values for variables
and a class. These have been called “rules set” [8], “hypercubes” [10] or “sets of
iso-parallel boxes” [7]. These designations arise from the fact that a variable can
be considered as a dimension axis in a multi-dimensional space. The set of values
(nominal or numerical) is the domain for each dimension and each rule defines
a region. The representation of the regions is required for the next phase. It is
worth noting that all regions are disjoint from each other and together cover the
entire space.
In the second phase, the regions of both models are combined using a specific
method. Andrzejak et al. [7] call this “unification” and propose a line sweep
algorithm to avoid comparing every region of each model. It is commonly based
on sorting the limits of each region and then analysing where merging can be
done. However this method only applies to numerical variables. Hall et al. [8]
compare all regions with each other. Bursteinas and Long [10] have a similar
method but separate disjoint from overlapping regions. One potential problem
is that combining regions can lead to a class conflict if overlapping regions have
different classes. Andrzejak et al. [7] propose three strategies to assign a class
to the merged region. The first assigns the class with the greatest confidence,
the second, the one with the greater probability and a third strategy, which is
the more complex, involves more passes over the data. Hall et al. [8] explore the
issue in greater detail and propose further strategies, e.g., comparing distances
to the boundaries of the variables. However, this approach seems suitable only
for numerical variables. Bursteinas and Long [10] use a different strategy by
retraining the model with examples for the conflicting class region. If no conflict
arises, that class is assigned, otherwise the region is removed from the merged
model.
The third phase attempts to reduce the number of regions and it is commonly
referred to as “pruning”. This is carried out to avoid having models with very
complex rules. The most direct approach is to identify adjacent regions, i.e.,
regions sharing the same class and values of all variables except for one. If that
variable is nominal, the values of both regions are included, otherwise the join is
only possible if the limits overlap. To further reduce the rules set, Andrzejak et al.
[7] developed a ranking system retaining only the regions with the highest relative
volume and number of training examples. Hall et al. [8] only carry out this
phase to eliminate redundant rules created during the removal of class conflicts.
Bursteinas and Long [10] mention the phase but do not provide details on how
it is performed.
The fourth phase consists in growing a decision tree from the decision regions
representation. Andrzejak et al. [7] attempt to mimic the C5.0 algorithm using
the values in the regions as examples. One problem with this method is that it
is necessary to divide one region in two to perform the splitting, which increases
their number, thus making the model more complex. Hall et al. [8] do not perform
this phase and the merged model is represented as the set of regions. Bursteinas
and Long [10] claim to grow a tree but do not describe the method.
3 Methodology
To carry out the experiments, a system with five processes was developed with
the architecture presented in Fig. 1.
The first process creates the data sets (one for each course in the university)
from the academic database. These contain enrollment data (Section 3.1). The
second process creates decision tree models for each course, analyses them in
order to determine the most important variables and evaluates them to assess
their quality (Section 3.2). The third process groups the models according to
different criteria (Section 3.3). The models in each group are then merged by
the fourth process (Section 3.4). Finally the fifth process evaluates the merged
models from a performance improvement point of view (Section 3.5).
This process has two sub-processes: (1) the models for each course data set
are trained and analysed in order to find out the most important variables for
prediction; (2) the prediction quality of each model is evaluated.
Model training and analysis. The models are decision tree classifiers gen-
erated by C5.0 algorithm [11]. Decision trees have the characteristic of not re-
quiring previous domain knowledge or heavy parameter tuning making them
appropriate for both prediction, exploratory data analysis and are human in-
terpretable. In this study, students are classified as having passed or failed a
course. The C5.0 algorithm measures the importance of variable Iv by determin-
ing the percentage of examples tested in a node by that variable in relation to
all examples (eq. 1).
The methodology to merge all models in a group set is done according to the
experimental set-up presented in Fig. 4. For the process of merging models, each
model must be represented as a set of decision regions. This can take the form
of a decision table, in which each row is a decision region. Therefore, the first
and second models are converted to decision tables and merged, yielding the
model a1 , also in decision table form. Then the third model is also converted to
a decision table and is merged with model a1 yielding model a2 .
Table 2. Groups sets of models
The last merged model an−1 is converted to the decision tree form and is
evaluated against data of all courses in the group. Each one of these sub-processes
and its tasks are detailed in the following sub-sections.
Merge models. The process of merging two models encompasses three sequen-
tial sub-processes: intersection, filtering and reduction as presented in Fig. 6.
Fig. 8. Filtering a merged decision table Fig. 9. Reducing a merged decision table
The weight of the resulting region is the sum of the weights of the regions
that are joined. After all regions have been subjected to reduction, they are again
examined and those that have zero weight are removed. Another consequence
of the reduction is that there may exist variables with the same value in all
decision regions. The columns for these variables are removed from the table.
Fig. 9 shows the result of reducing the decision table of Fig. 8. The reduction
sub-process results in the last successfully merged model of the group so far.
Conversion to decision tree. The last merged model of the group is converted
to the decision tree representation, yielding the group model. For this purpose,
examples are generated randomly, bounded by the limits of each variable from
the decision table and submitted to the C5.0 algorithm to train a model. Each
decision region provides examples which corresponds to a combination of the
set of values of each variable with the set of values of other variables and the
assigned class to the region. The set of values of numerical variables are bounded
by two limits. If the upper limit is missing (+∞), then the maximum observed
value from all courses data sets is used. When the lower limit is zero, the lowest
observed value across all courses data sets is used (e.g., age). These values are
collected as part of the final task of the data extraction process (Section 3.1).
The generation of examples can be controlled with the examples for numer-
ical variables parameter. Setting to limits only generates two examples (one for
each limit) while samples generates examples between the limits with a step of
5 (our initial approach used all values but was infeasible due to memory lim-
itations). The weight of the region can also influence the number of examples
generated, being controlled by a weight examples parameter. If active, the num-
ber of generated examples by each region is multiplied by the weight of that
region. Generating some examples more frequently than others is a way to pre-
serve the importance of a region over others with less weight in the resulting
decision tree.
This process evaluates a group model, using the experimental set-up presented
in Fig. 10. After this process, each model has two measures of performance, one
(F 1) from the model obtained from its own data and another (F 1g ) from the
group model. Hence, ∆F 1 (eq. 2) allows us to measure changes in predictive
power:
∆F 1 = F 1g − F 1 (2)
If ∆F 1 is greater than zero, then there is an improvement in predictive per-
formance by using the group model. If equal to zero, there is no improvement
in predictive performance. If lower than zero, then there is loss of predictive
performance relative to the original model.
4 Results
A set of 24 experiments were run to merge models with the results presented in
Table 3. Each experiment is a combination of values of the parameters weight
attribution (wa), conflict class resolution (ccr), examples for numerical vari-
ables (efnv) and weight examples (we). This allows to compare the impact of
each parameter on the average merging score (M ), improvement in prediction
(∆F 1) and across each group set: scientific areas (∆F 1SA ), number of variables
(∆F 1#v ), importance of variables (∆F 1Iv ), and baseline (∆F 1B ).
Merging score. The average merging score is 76%, which hardly changes
throughout experiments. This implies that none of the parameters has a sig-
nificant role in the ability to merge models. Fig. 11 shows the average merging
score of all experiments across groups in all group sets. The idea behind creating
groups of models is to try to bring together models that are similar the most, i.e.,
with less likelihood of their merging resulting in disjoint regions. We observe that
different group sets affect the merging score. This is particularly noticeable in
grouping by the scientific areas in which group #4 (Humanities) has the highest
merging score (92%) while group #7 (Medicine) has the lowest (52%). Grouping
by number of variables shows that it is not possible to merge models with no
variables (group #1), however, from 1 variable onward, it is always possible to
merge all models into a single group model. Grouping by variable importance
always allow full merging, except in the last group (probably because the models
are less similar). The baseline group set has the average value of the experiments
(76%). Results show that, from a merging ability perspective, merging by sci-
entific areas is not necessarily the best way to group models while number of
variables and importance of variables seem to be more suitable approaches.
References