Less Is More: A Comprehensive Framework For The Number of Components of Ensemble Classifiers

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 14, NO.
8, JULY 2018 1
Less Is More: A Comprehensive Framework for

the Number of Components of Ensemble Classifiers
Hamed Bonab, and Fazli Can
Abstract—The number of component classifiers chosen for an the vote aggregation method mostly characterize an ensemble
ensemble greatly impacts the prediction ability. In this paper, classifier [7].
we use a geometric framework for a priori determining the There are two main categories of vote aggregation methods
ensemble size, which is applicable to most of existing batch and
online ensemble classifiers. There are only a limited number of for combining the votes of component classifiers: weight-
arXiv:1709.02925v2 [cs.LG] 29 Sep 2018
studies on the ensemble size examining Majority Voting (MV) ing methods and meta-learning methods [7], [8]. Weighting
and Weighted Majority Voting (WMV). Almost all of them are methods assign a combining weight to each component and
designed for batch-mode, hardly addressing online environments. aggregate their votes based on these weights (e.g. Majority
Big data dimensions and resource limitations, in terms of time Voting, Performance Weighting, Bayesian Combination). They
and memory, make determination of ensemble size crucial,
especially for online environments. For the MV aggregation are useful when the individual classifiers perform the same
rule, our framework proves that the more strong components task and have comparable success. Meta-learning methods re-
we add to the ensemble, the more accurate predictions we can fer to learning from the classifiers and from the classifications
achieve. For the WMV aggregation rule, our framework proves of these classifiers on training data (e.g. Stacking, Arbiter
the existence of an ideal number of components, which is equal Trees, Grading). They are best suited for situations where
to the number of class labels, with the premise that components
are completely independent of each other and strong enough. certain classifiers consistently mis-classify or correctly classify
While giving the exact definition for a strong and independent certain instances [7]. In this paper we study the ensembles
classifier in the context of an ensemble is a challenging task, our with the weighting combination rule. Meta-learning methods
proposed geometric framework provides a theoretical explanation are out of the scope of this paper.
of diversity and its impact on the accuracy of predictions. We An important aspect of ensemble methods is to determine
conduct a series of experimental evaluations to show the practical
value of our theorems and existing challenges.
how many component classifiers should be included in the final
ensemble, known as the ensemble size or ensemble cardinality
Index Terms—Supervised learning, ensemble size, ensemble [1], [7], [9], [10], [11], [12], [13]. The impact of ensemble
cardinality, voting framework, big data, data stream, law of
diminishing returns
size on efficiency in terms of time and memory and predictive
performance make its determination an important problem
[14], [15]. Efficiency is especially important for online envi-
I. I NTRODUCTION ronments. In this paper, we extend our geometric framework
[1] for pre-determining the ensemble size, applicable to both
VER the last few years, the design of learning systems
O for mining the data generated from real-world problems
has encountered new challenges such as the high dimensional-
batch and online ensembles.
Furthermore, diversity among component classifiers is an
influential factor for having an accurate ensemble [7], [16],
ity of big data, as well as growth in volume, variety, velocity, [17], [18]. Liu et al. [19] empirically studied ensemble size
and veracity—the four V’s of big data1 . In the context of on diversity. Hu [19] explained that component diversity
data dimensions, volume is the amount of data, variety is the leads to uncorrelated votes, which in turn improves predictive
number of types of data, velocity is the speed of data, and performance. However, to the best of our knowledge, there
veracity is the uncertainty of data; generated in real-world is no explanatory theory revealing how and why diversity
applications and processed by the learning algorithm. The among components contributes toward overall ensemble ac-
dynamic information processing and incremental adaptation curacy [20]. Our proposed geometric framework introduces a
of learning systems to the temporal changes are among the theoretical explanation for the understanding of diversity in
most demanding tasks in the literature for a long time [2]. the context of ensemble classifiers.
Ensemble classifiers are among the most successful and The main contributions of this study are the following. We
well-known solutions to supervised learning problems, par- • Present a brief comprehensive review of existing ap-
ticularly for online environments [3], [4], [5]. The main idea proaches for determining the number of component clas-
is to construct a collection of individual classifiers, even with sifiers of ensembles,
weak learners, and combine their votes. The aim is to build • Provide a spatial modeling for ensembles and use the
a stronger classifier, compared to each individual component linear least squares (LSQ) solution [21] for optimizing
classifier [6]. The training mechanism of components and the weights of components of an ensemble classifier,
applicable to both online and batch ensembles,
This is an extended version of the work presented as a short paper at the • Exploit the geometric framework for the first time in
Conference on Information and Knowledge Management (CIKM), 2016 [1].
Manuscript submitted July 2018. the literature, for a priori determining the number of
1 http://www.ibmbigdatahub.com/infographic/four-vs-big-data component classifiers of an ensemble,
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 14, NO. 8, JULY 2018 2
• Explain the impact of diversity among component clas- TABLE I

sifiers of an ensemble on the predictive performance, S YMBOL N OTATIONS OF THE G EOMETRIC F RAMEWORK
from a theoretical perspective and for the first time in
Notation Definition
the literature,
I = {I1 , I2 , ..., In } Instance window, Ii ; (1 ≤ i ≤ n)
• Conduct a series of experimental evaluations on more
than 16 different real-world and synthetic data streams Ensemble of m component classifiers,
ξ = {CS1 , CS2 , ..., CSm }
CSj ; (1 ≤ j ≤ m and 2 ≤ m)
and show the practical value of our theorems and existing
challenges. Classification problem with p class labels,
C = {C1 , C2 , ..., Cp }
Ck ; (1 ≤ k ≤ p and 2 ≤ p)
1 , S 2 , ..., S p >
sij =< Sij ij ij Score-vector for Ii and CSj
II. R ELATED W ORKS
oi =< Oi1 , Oi2 , · · · , Oip > Ideal-point for Ii , Oik ; (1 ≤ k ≤ p)
The importance of ensemble size is discussed in several ai =< A1i , A2i , · · · , Api > Centroid-point for Ii , Aki ; (1 ≤ k ≤ p)
studies. There are two categories of approaches in the lit-
w =< W1 , W2 , · · · , Wm > Weight vector for ξ, Wj ; (1 ≤ j ≤ m)
erature for determining ensemble size. Several ensembles a
priori determine the ensemble size with a fixed value (like Weighted-centroid-point for Ii ,
bi =< Bi1 , Bi2 , · · · , Bip >
Bik ; (1 ≤ k ≤ p)
bagging), while others try to determine the best ensemble
size dynamically by checking the impact of adding new
components to the ensemble [7]. Zhou et al. [22] analyzed
statistical algorithm for determining the size of an ensemble,
the relationship between an ensemble and its components, and
by estimating the required number of classifiers for obtaining
concluded that aggregating many of the components may be
stable aggregated predictions, using majority voting.
the better approach. Through an empirical study, Liu et al. [19]
Pietruczuk et al. [33], [34] recently studied the automatic
showed that a subset of the components of a larger ensemble
adjustment of ensemble size for online environments. Their
can perform comparably to the full ensemble, in terms of
approach determines whether a new component should be
accuracy and diversity. Ulaş et al. [23] discussed approaches
added to the ensemble by using a probability framework
for incrementally constructing a batch-mode ensemble using
and defining a confidence level. However, the diversity im-
different criteria including accuracy, significant improvement,
pact of component classifiers is not taken into account, and
diversity, correlation, and the role of search direction.
there is a possibility of approaching to the infinite number
This led to the idea in ensemble construction, that it is
of components without reaching the confidence level. The
sometimes useful, to let the ensemble extend unlimitedly and
assumption that the error distribution is i.i.d can not be guar-
then prune the ensemble in order to get a more effective
anteed, especially with a higher ensemble size; this reduces
ensemble [7], [24], [25], [26], [27], [28]. Ensemble selection
the improvements due to each extra classifier [29].
methods developed as pruning strategies for ensembles. How-
ever, with today’s data dimensions and resource constraints,
the idea seems impractical. Since the number of data instances III. A G EOMETRIC F RAMEWORK
grow exponentially, especially in online environments, there In this section we propose a geometric framework for
is a potential problem of approaching an infinite number of studying the theoretical side of ensemble classifiers based on
components for an ensemble. As a result, determining an [1]. We mainly focus on online ensembles since they are more
upper bound for the number of components with a reasonable specific models compared to batch-mode ensembles. The main
resource consumption is essential. As mentioned in [29], the difference is that online ensembles are trained and tested over
errors cannot be arbitrarily reduced by increasing the ensemble the course of incoming data while batch-mode ensembles are
size indefinitely. trained and tested once. As a result, batch-mode ensembles are
There are a limited number of studies for batch-mode also applicable to our framework, with a simpler declaration.
ensembles. Latinne et al. [9] proposed a simple empirical We use this geometric framework in [35] for aggregating
procedure for limiting the number of classifiers based on votes and proposing a novel online ensemble for evolving data
the McNemar non-parametric test of significance. Similar stream, called GOOWE.
approaches [30], [31], suggested a range of 10 to 20 base Suppose we have an ensemble of m component classifiers,
classifiers for bagging, depending on the base classifier and ξ = {CS1 , CS2 , · · · , CSm }. Due to resource limitations, we
dataset. are only able to keep the n latest instances of an incoming data
Oshiro et al. [11] cast the idea that there is an ideal stream as an instance window, I = {I1 , I2 , · · · , In }, where In
number of component classifiers within an ensemble. They is the latest instance and all the true-class labels are available.
defined the ideal number as the ensemble size where exploiting We assume that our supervised learning problem has p class
more base classifiers brings no significant performance gain, labels, C = {C1 , C2 , · · · , Cp }. For batch-mode ensembles, I
and only increases computational costs. They showed this by can be considered as the whole set of training data. Table I
using the weighted average area under the ROC curve (AUC), presents the notation of symbols for our geometric framework.
and some dataset density metrics. Fumera et al. [31], [32] Our framework uses a p-dimensional Euclidean space
applied an existing analytical framework for the analysis of for modeling the components’ votes and true class labels.
linearly combined classifiers of bagging, using misclassifi- For a given instance Ii (1 ≤ i ≤ n), each component
cation probability. Hernández-Lobato et al. [12] suggested a classifier CSj (1 ≤ j ≤ m) returns a score-vector as
C3
averaging. For a given instance, It , we have the following
score-polytope
mapping to the centroid-point, at =< A1t , A2t , · · · , Apt >.
C1
m
C2
ideal-point 1 X k
Akt = S (1 ≤ k ≤ p) (1)
m j=1 tj
Theorem 1. For It the loss between the centroid-point at and

ideal-point ot is not greater than the average loss between m
score-vectors and ot , that is to say,
m
1 X
loss(at , ot ) ≤ loss(stj , ot ) (2)
m j=1
Fig. 1. Schema of the geometric framework (obtained from [1]). It with

class label yt = C1 is fed to the ensemble. Each component classifier, CSj , Proof. Based on Minkowski’s inequality for sums [40],
generates a score-vector, Stj . These score-vectors construct a surface in the v v
Euclidean space, called score-polytope. u p X m
u p
m uX
uX X
t ( k 2
θj ) ≤ t (θk )2
j
Pp k=1 j=1 j=1 k=1
1 2 p k
sij =< Sij , Sij , · · · , Sij >, where k=1 Sij = 1. Consid-
ering all the score-vectors in our p-dimensional space, the Letting θjk = Stj
k
− Otk and substituting results
framework builds a polytope of votes which we call the v v
u p m
u p
m uX
uX
score-polytope of Ii . For the true-class label of Ii we have 1 X
k − O k )))2 ≤
X
t (S k − Ok )2
t (m( (Stj
oi =< Oi1 , Oi2 , · · · , Oip > as the ideal-point. Note that oi m j=1 t tj t
k=1 j=1 k=1
is a one-hot vector in this study. However, there are other
supervised problems this assumption is not true—e.g. multi- Since m > 0, we have the following,
label classification. Studying other variations of the supervised v v
problem is out of the scope of this study. A general schema u p 1 X m
u p
m uX
uX
t ( k − O k )2 ≤ 1 X
t (S k − Ok )2
of our geometric framework is presented in Fig. 1. Stj t tj t
m j=1 m j=1
Example. Assume we have a supervised problem with 3 k=1 k=1
class labels, C = {C1 , C2 , C3 }. For a given instance It , Using Eq. 1 and loss definition, Eq. 2 can be achieved.
the true-class label is C2 . The ideal-point would be ot =<
0, 1, 0 >. Discussion. Theorem 1 shows that the performance of an
One could presumably define many different algebraic rules ensemble with the MV aggregation rule is at least equal to the
for vote aggregation [36], [37]—minimum, maximum, sum, average performance of all individual components.
mean, product, median, etc. While these vote aggregation
Theorem 2. For It , let ξl = ξ − {CSl } (1 ≤ l ≤ m) be
rules can be expressed using our geometric framework, we
a subset of ensemble ξ without CSl . Each ξl has atl as its
study the Majority Voting (MV) and Weighted Majority Voting
centroid-point. We have
(WMV) aggregation rules in this paper. In addition, individual
m
vote scores can be aggregated based on two different voting 1 X
schemes [38]; (a) Hard Voting, the score-vector of a com- loss(at , ot ) ≤ loss(atl , ot ) (3)
m
l=1
ponent classifier is first transformed into a one-hot vector,
possibly using a hard-max function, and then combined, (b)
Soft Voting, the score-vector is used for vote aggregation. We Proof. at is the centroid-point of all atl points according to
use soft voting for our framework. the definition. Assume ξl as an individual classifier with score-
The Euclidean norm is used as the loss function, loss(· , · ), vector of atl . Theorem 1 for every ξl results Eq. 3 directly.
for optimization purposes [21]. The Euclidean distance of any
Discussion. Theorem 2 can be generalized for any subset
score-vector and ideal-point expresses the effectiveness of that
definition with (m − f ) components, 1 ≤ f ≤ (m − 2).
component for the given instance. Using aggregation rules, we
This shows that an ensemble with m components performs
aim to define a mapping function from a score-polytope into a
better (or at least equal) compared to the average performance
single vector, and measure the effectiveness of our ensemble.
of ensembles with m − f components. It can be concluded
Wu and Crestani [39] applied a similar geometric framework
that better performance can be achieved if we aggregate
for data fusion of information retrieval systems. In this study,
more component classifiers. However, if we keep adding poor
some of our theorems are obtained and adapted to ensemble
components to the ensemble, it can diminish overall prediction
learning from their framework.
accuracy by increasing the upper bound in Eq. 2. This is in
agreement with the result of the Bayes error reduction analysis
A. Majority Voting (MV) [41], [42]. Setting a threshold, as expressed in [29], [33], [34],
The mapping of a given score-polytope into its centroid [42], can give us the ideal number of components for a specific
can be expressed as the MV aggregation—plurality voting or problem.
p
B. Weighted Majority Voting (WMV) γq =
X
Oik Siq
k
, (1 ≤ q ≤ m) (9)
For this aggregation rule, a weight vector w =< k=1
W1 , W2 , · · · ,P
Wm > for components of ensemble is defined,
lead to m linear equations with m variables (weights). Briefly,
Wj ≥ 0 and Wj = 1 for 1 ≤ j ≤ m. For a given instance,
wΛ = γ, where Λ ∈ Rm×m is the coefficients matrix and
It , we have the following mapping to the weighted-centroid-
γ ∈ Rm is the remainders vector—using Eq. 8 and Eq. 9,
point, bt =< Bt1 , Bt2 , · · · , Btp >.
respectively. The proper weights in the matrix equation are
m
X our intended optimum weight vector. For a more detailed
Btk = k
Wj Stj (1 ≤ k ≤ p) (4)
explanation see [35].
j=1
Discussion. According to Eq. 8, Λ is a symmetric square
Note that giving equal weights to all the components will matrix. If Λ has full rank, our problem has a unique solution.
result in the MV aggregation rule. WMV presents a flexible On the other hand, in the sense of a least squares solution
aggregation rule. No matter how poor a component classifier [21], it is probable that Λ is rank-deficient, and we may not
is, with a proper weight vector we can cancel its effect on have a unique solution. Studying the properties of this matrix
the aggregated results. However, as discussed earlier, this is lead us to the following theorem.
not true for the MV rule. In the following, we give the formal
definition of the optimum weight vector, which we aim to find. Theorem 4. If the number of component classifiers is not
equal to the number of class labels, m 6= p, then the coefficient
Definition 1. Optimum Weight Vector. For an ensemble, ξ, matrix would be rank-deficient, det Λ = 0.
and a given instance, It , weight vector wo with the weighted-
centroid-point bo is the optimum weight vector where for Proof. Since we have p dimensions in our Euclidean space,
any wx with weighted-centroid-point bx the following is true; p independent score-vectors would be needed for the basis
loss(bo , ot ) ≤ loss(bx , ot ). spanning set. Any number of vectors, m, more than p is
dependent on the basis spanning set, and any number of
Theorem 3. For a given instance, It , let the optimum weight vectors, m, less than p is insufficient for constructing the basis
vector, wo , and the weighted-centroid-point bt . The following spanning set.
must hold;
loss(bt , ot ) ≤ min{loss(st1 , ot ), . . . , loss(stm , ot )} (5) Discussion. The above theorem excludes some cases in
which we cannot find optimum weights for aggregating
Proof. Assume that the least loss belongs to component j, votes. There are several numerical solutions for solving rank-
among m score-vectors. We have the following two cases. deficient least-squares problems (e.g. QR factorization, Sin-
(a) Maintaining the performance. Simply giving a weight of gular Value Decomposition (SVD)), however the resulting
1 to j’s component and 0 for the remaining components result solution is relatively expensive, may not be unique, and
in the equality case; loss(bt , ot ) = loss(stj , ot ). optimality is not guaranteed. Theorem 4’s outcome is that
(b) Improving the performance. Using a linear combination the number of independent components for an ensemble is
of j and other components with proper weights result in crucial for providing full-rank coefficient matrix, in the aim
a weighted-centroid-point closer to the ideal-point. We can of an optimal weight vector solution.
always find such a weight vector in the Euclidean space if
other components are not the exact same as j.
C. Diversity among components
Using the squared Euclidean norm as the measure of close-
ness for the linear least squares problem (LSQ) [21] results Theorem 4 shows that for weight vector optimality, m = p
should be true. However, the reverse cannot be guaranteed
min ||o − wS||22 (6) in general. Assuming m = p and letting det Λ = 0 for
w
the parametric coefficient matrix results in some conditions
Where for each instance Ii in the instance window, S ∈ Rm×p
where we have vote agreement, and cannot find a unique
is the matrix with score-vectors sij in each row corresponding
optimum weight vector. As an example, suppose we have
to the component classifier j, w ∈ Rm is the vector of weights
two component classifiers for a binary classification task,
to be determined, and o ∈ Rp is the vector of the ideal-point.
m = p = 2. Letting det Λ = 0, results the following
We use the following function for our optimization solution. 1 2 2 1
equations; S11 + S12 = 1 or S11 + S12 = 1, meaning the
p Xm
X
k
agreement of component classifiers—i.e. the exact same vote
f (W1 , W2 , · · · , Wm ) = ( (Wj Sij ) − Oik )2 (7) vectors. More specifically, this suggests another condition for
k=1 j=1 weight vector optimality: the importance of diversity among
Taking a partial derivation over Wq (1 ≤ q ≤ m), setting the component classifiers.
gradient to zero, ∇f = 0, and finding optimum points give us Fig. 2 presents four mainly different score-vector possibili-
the optimum weight vector. Letting the following summations ties for an ensemble with size three. The true class label of the
as λqj and γq examined instance is C1 for a binary classification problem.
p
All score-vectors are normalized and placed on the main
λqj =
X
k k
Siq Sij , (1 ≤ q, j ≤ m) (8) diagonal of the spatial environment. The counter-diagonal line
k=1
divides the decision boundary for the class label determination
1 1 1 1
0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6

C2
C2
C2
C2
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0
o o o 0
o
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
C1 C1 C1 C1
(a) (b) (c) (d)

Fig. 2. Four score-vector possibilities of an ensemble with size three. The true class label of the instance is C1 for a binary classification problem. If two
of these score-vectors exactly match each other for several data instances, we cannot consider them to be independent and diverse enough components.
based on the probability values. If the component’s score- Our framework suggests the number of class labels of a
vector is in the lower triangular, it is classified C1 and dataset as the ideal number of component classifiers, with the
similarly, if it is in the left triangular part it is classified premise that they generate independent scores and aggregated
C2. Fig. 2 (a) and (b) show the miss-classification and true with optimum weights. However, real-world datasets and ex-
classification situations, respectively. Fig. 2 (c) and (d) show isting ensemble classifiers do not guarantee this premise most
disagreement among components of the ensemble. of the time. Determining the exact value of this ideal point
If for several instances, in a sequence of data, the score- for a given ensemble classifier, over real-world data, is still
vectors of two components are equal (or act predictably a challenging problem due to the different complexities of
similar), they are considered dependent components. There datasets.
are several measurements for quantifying this dependency for
ensemble classifiers (e.g. Q-statistic) [18]. However, most of
IV. E XPERIMENTAL E VALUATION
the measurements in practice use the oracle output of com-
ponents (i.e. only predicted class labels) [18]. Our geometric The experiments conducted in [1] showed that for ensembles
framework shows a potential importance of using score-vectors trained with a specific dataset, we have an ideal number of
for diversity measurements. It is out of the scope of this study components in which having more will deteriorate or at least
to propose a diversity measurement using score-vectors and provide no benefit to our prediction ability. Our extensive
we leave it as a future work. experiments in [35] show the practical value of this geometric
To the best of our knowledge, there is no explanatory framework for aggregating votes.
theory in the literature revealing why and how diversity among Here, through a series of experiments, we first investigate
components contribute toward overall ensemble accuracy [20], the impact of the number of class labels and the number of
[16]. Our geometric modeling of ensemble’s score-vectors component classifiers for MV and WMV using a synthetic
and the optimum weight vector solution provide a theoretical dataset generator; Then, we study impact of miscellaneous
insight for the commonly agreed upon idea that “the classifiers data streams using several real-world and synthetic datasets;
should be different from each other, otherwise the overall Finally, we explore the outcome of our theorems on the
decision will not be better than the individual decisions” [18]. diversity of component classifiers and the practical value of our
Optimum weights can be reached when we have the same study. All the experiments are implemented using the MOA
number of independent and diverse component classifiers as framework [43], and Interleaved-Test-Then-Train is used for
class labels. Diversity has a great impact on the coefficient accuracy measurements. An instance window, with a length of
matrix that consequently impacts the accurate predictions of 500 instances, is used for keeping the latest instances.
an ensemble. For the case of majority voting, adding more
dependent classifiers will dominate the decision of other
components. A. Impact of Number of Class Labels.
Discussion. Our geometric framework supports the idea Setup. To investigate the sole impact of the number of
that there is an ideal number of component classifiers for class labels of the dataset, i.e. the p value, on the accuracy
an ensemble, with which we can reach the most accurate of an ensemble, we use the GOOWE [35] method. It uses
results. Increasing or decreasing the number of classifiers from our optimum weight vector calculation for vote aggregation
this ideal point may deteriorate predictions, or bring no gain in a WMV procedure for vote aggregation. The Hoeffding
to the overall performance of the ensemble. Having more Tree (HT) [44] is used as the component classifier, due to
components than the ideal number of classifiers can mislead its high adaptivity to data stream classification. For a fair
the aggregation rule, especially for majority voting. On the comparison, we modify GOOWE for having MV aggregation
other hand, having fewer is insufficient for constructing an rule by simply providing equal weights to the components.
ensemble which is stronger than the single classifier. We refer These two variations of GOOWE, i.e. WMV and MV, are used
to this situation as “the law of diminishing returns in ensemble for this experiment. Each of these variations trained and tested
construction.” using different ensemble size values, starting from only two
96 84 74
72
82
94
70
80
68
92
Accuracy (%)
Accuracy (%)
Accuracy (%)
78
66
76
90
64
74
62
88
WMV 72 WMV 60 WMV
MV MV MV
86 70 58
24 8 16 32 64 128 24 8 16 32 64 128 24 8 16 32 64 128
Num ber of Com ponent Classifiers Num ber of Com ponent Classifiers Num ber of Com ponent Classifiers
(a) RBF-C2 (b) RBF-C4 (c) RBF-C8

80 80 90
75 80
75
70 70
70
Accuracy (%)
Accuracy (%)
Accuracy (%)
65 60
65
60 50
60
55 40
WMV WMV WMV
MV MV MV
50 55 30
24 8 16 32 64 128 24 8 16 32 64 128 24 8 16 32 64 128
Num ber of Com ponent Classifiers Num ber of Com ponent Classifiers Num ber of Com ponent Classifiers
(d) RBF-C16 (e) RBF-C32 (f) RBF-C64

Fig. 3. Prediction behavior of WMV and MV aggregation rules, in terms of accuracy, for RBF-C datasets with increasing both the number of component
classifiers, m, and the number of class labels, p. The equality case, m = p, is shown on each plot using a vertical dashed green line.
components and doubling at each step—i.e. our investigated However, in the latter set we see that the peak points are
ensemble sizes, m values, are 2, 4, · · · , 128. with m = 128, and (iv) The theoretical vertical line, i.e.
Dataset. Since existing real-world datasets are not con- the equality case m = p, seems to precede the peak point
sistent, in terms of classification complexity, we are only on each plot. We suspect that this might be due to Theorem
able to use synthetic data for this experiment in order to 4’s premise conditions: generating independent scores and
have reasonable comparisons. We choose the popular Random aggregating with optimum weights.
RBF generator, since it is capable of generating data streams
with an arbitrary number of features and class labels [45].
Using this generator, implemented in the MOA framework B. Impact of Data Streams.
[43], we prepare six datasets, each containing one million Setup. There are many factors when the complexity of clas-
instances with 20 attributes, with the default parameter settings sification problems are considered—concept drift, the number
of the RBF generator. The only difference is the number of of features, etc. To this end, we investigate the number of
class labels among datasets which are 2, 4, 8, 16, 32, and 64. component classifiers for WMV and MV on a wide range of
We reflect this in dataset naming as RBF-C2, RBF-C4, · · · , datasets. We use an experimental setup similar to the previous
respectively. experiments on different synthetic and real-world datasets.
Results. Fig. 3 presents prediction accuracy for WMV and We aim to investigate some general patterns in more realistic
MV with increasing component counts, m, on each dataset. To problems.
mark the equality of m and p, we use a vertical dashed green Dataset. We select eight synthetic and eight real-world
line. We can make the following interesting observations: benchmark datasets used for stream classification problems
(i) A weighted aggregation rule becomes more vital with in the literature. A summary of our datasets is given in Table
an increasing number of component classifiers, (ii) WMV is II. For this selection, we aim to have a mixture of different
performing more resiliently in multi-class problems, compared concept drift types, number of features, number of class labels,
to binary classification problems, when compared to MV. The and noise percentages. Synthetic datasets are similar to the
gap between WMV and MV seems to increase with greater ones used for the GOOWE evaluation [35]. For real-world
numbers of class labels, (iii) There is a peak point in the datasets, Sensor, PowerSupply, and HyperPlane datasets are
accuracy value, and it is dependent on the number of class taken from2 . The remainder of real-world datasets are taken
labels. This can be seen by comparing RBF-C2, RBF-C4, and from3 . See [1], [35] for a detailed explanations of the datasets.
RBF-C8 (the first row in Fig. 3) with RBF-C16, RBF-C32, and
RBF-C64 (the second row in Fig. 3) plots. In the former set 2 Access URL: http://www.cse.fau.edu/∼xqzhu/stream.html
we see that after a peak point, the accuracy starts to drop. 3 Access URL: https://www.openml.org/
93.4 90.0 89.6 89.6
89.5
93.2 89.4 89.4
89.0
93.0 89.2 89.2
88.5
Accuracy (%)
Accuracy (%)
Accuracy (%)
Accuracy (%)
92.8 88.0 89.0 89.0
87.5 88.8
92.6 88.8
87.0
92.4 88.6 88.6
86.5
92.2 WMV WMV 88.4 WMV 88.4 WMV

86.0
MV MV MV MV
92.0 85.5 88.2 88.2
24 8 16 32 64 128 24 8 16 32 64 128 24 8 16 32 64 128 24 8 16 32 64 128
Num ber of Com ponent Classifiers Num ber of Com ponent Classifiers Num ber of Com ponent Classifiers Num ber of Com ponent Classifiers
(a) HYP-F (b) HYP-S (c) SEA-F (d) SEA-S
81.0 94 94 98.4
93
80.5 92
98.2
92
80.0 90
91 98.0
Accuracy (%)
Accuracy (%)
Accuracy (%)
Accuracy (%)
79.5 88
90
97.8
79.0 89 86
88 97.6
78.5 84
87
97.4
78.0 WMV WMV 82 WMV WMV
86
MV MV MV MV
77.5 85 80 97.2
24 8 16 32 64 128 24 8 16 32 64 128 24 8 16 32 64 128 24 8 16 32 64 128
(e) TREE-F (f) TREE-S (g) RBF-N (h) RBF-G
Fig. 4. Prediction behavior of WMV and MV aggregation rules, in terms of accuracy, for miscellaneous synthetic datasets, with increasing both the number
of component classifiers, m, and class labels, p. The equality case, m = p, is marked on each plot using a vertical dashed green line.
TABLE II evolving environments, i.e. data streams with concept drift,

S UMMARY OF DATASET C HARACTERISTICS regardless of the type of concept drift.
Dataset #Instance #Att #CL %N Drift Spec. The observations we have with the real-world data streams
HYP-F 1 × 106 10 2 5 Incrm., DS=0.1
HYP-S 1 × 106 10 2 5 Incrm., DS=0.001
provide strong evidence that supports our claim which indi-
SEA-F 1 × 106 3 2 10 Abrupt, #D=9 cates that the number of class labels has an important influence
SEA-S 1 × 106 3 2 10 Abrupt, #D=3 on the ideal number of component classifiers and prediction
TREE-F 1 × 106 10 6 0 Reoc., #D=15 performance. In Fig. 5, we observe that the peak performances,
TREE-S 1 × 106 10 4 0 Reoc., #D=4
RBF-G 1 × 106 20 10 0 Gr., DS=0.01 with one exception, are not observed with the maximum
RBF-N 1 × 106 20 2 0 No Drift ensemble size. In other words, as we increase the number
Airlines 539,383 7 2 - Unknown of component classifiers and move away from the green line
ClickPrediction 399,482 11 2 - Unknown and employ an ensemble of size 128, in all cases, prediction
Electricity 45,311 8 2 - Unknown performance becomes lower than that of a smaller size ensem-
HyperPlane 1 × 105 10 5 - Unknown
CoverType 581,012 54 7 - Unknown ble. The only exception is observed with ClickPrediction; even
PokerHand 1 × 107 10 10 - Unknown with that one, no noticeable improvement is provided with the
PowerSupply 29,925 2 24 - Unknown largest ensemble size. Furthermore, in all data streams, except
Sensor 2,219,802 5 58 - Unknown
ClickPrediction, the peak performances are closer to the green
Note: #CL: No. of Class Labels, %N: Percentage of Noise, DS: Drift Speed, line rather than being closer to the largest ensemble size.
#D: No. of Drifts, Gr.: Gradual.
C. Impact of Diversity.
Results. Fig. 4 and 5 present the prediction accuracy differ- Setup. In order to study the impact of diversity in ensemble
ence for WMV and MV for increasing component classifier classifiers and show the practical value of our theorems, we
counts, m, on each dataset. For marking the equality of m design two different scenarios for the binary classification
and p, we use a vertical dashed green line, similar to the problem. We select binary classification for the purpose of this
previous experiments. As we can see, given more broad types experiment since the difference between WMV and MV are
of datasets, each with completely different complexities, it is almost always insignificant for binary classification, compared
difficult to conclude strict patterns. We have the following to multi-class problems. In addition, multi-class problems can
interesting observations: (i) For almost all the datasets, WMV, potentially modeled as several binary classification problems
with optimum weights, outperforms MV, (ii) We can see the [46].
same results as the previous experiments: There is a peak point To this end, we recruit a well-known and state-of-the-art
in the accuracy value and it is dependent on the number of online ensemble classifier, called leverage bagging (LevBag)
class labels, (iii) The theoretical vertical line, i.e. the equality [47] as the base ensemble method for our comparisons. It is
case m = p, seems to precede the peak point on each plot, based on the OzaBagging ensemble [48], [49] and is proven to
and (iv) Optimum weighting seems to be more resilient in the react well in online environments. It exploits re-sampling with
67 83.2 76.0 59
58
66
75.5
83.0
57
65
56
75.0
Accuracy (%)
Accuracy (%)
Accuracy (%)
Accuracy (%)
82.8
64 55
74.5
54
63 82.6
53
74.0
62
WMV WMV WMV 52 WMV
82.4
MV MV MV MV
61 73.5 51
24 8 16 32 64 128 24 8 16 32 64 128 24 8 16 32 64 128 24 8 16 32 64 128
(a) Airlines (b) ClickPrediction (c) Electricity (d) HyperPlane
91 56 10.2 84
90 55 83
10.1
89
54 82
88
53 81
10.0
Accuracy (%)
Accuracy (%)
Accuracy (%)
Accuracy (%)
87
52 80
86
9.9
51 79
85
50 78
84 9.8
WMV WMV WMV WMV

49 77
83 MV
MV MV MV
9.7
82 48 24 8 16 32 64 128 76
24 8 16 32 64 128 24 8 16 32 64 128 24 8 16 32 64 128
Num ber of Com ponent Classifiers Num ber of Com ponen Classifiers
Num ber of Com ponent Classifiers Num ber of Com ponent Classifiers
(e) CoverType (f) PokerHand (g) PowerSupply (h) Sensor
Fig. 5. Prediction behavior of WMV and MV aggregation rules, in terms of accuracy, for miscellaneous real-world datasets with increasing both the number
of component classifiers, m, and class labels, p. The equality case, m = p, is marked on each plot using a vertical dashed green line.
replacement (i.e. input randomization), using a P oisson(λ) CSs predicts b.

distribution to train diversified component classifiers. N 11 N 00 − N 01 N 10
Qr,s =
We use LevBag in our experiments since it initializes a N 11 N 00 + N 01 N 10
fixed number of component classifiers—i.e. unlike GOOWE, • Scenario 2. We train a hybrid of two different algorithms
where component classifiers are dynamically added and re- as component classifiers, a potentially diverse ensemble.
moved during training in the course of incoming stream data For this, one instance of the Hoeffding Tree (HT) and
[35], for LevBag the number of component classifiers are the Naive Bayes (NB) [45] algorithms are exploited;
fixed from the initialization and the ensemble does not alter both are trained on the same instances of data stream—
them. In addition, LevBag uses error-correcting output codes without input randomization. We call this the Hyb-HTNB
for handling multi-class problems, and transforms them into ensemble.
several binary classification problems [47]. Majority voting is The ensemble sizes for both of these scenarios are two. For
used for vote aggregation, as the baseline of our experiments. each instance, vote aggregation in both scenarios is done using
Design. For our analysis, we train different LevBag ensem- our geometric weighting framework. An instance window of
bles with 2, 4, ..., 64, and 128 components of classifiers— 100 latest incoming instances are kept, and using Eq. 8 and 9
named LevBag-2, LevBag-4, · · · , respectively. The Hoeffding weights are calculated—wΛ = γ.
Tree (HT) [44] is used as the component classifier. Dataset. We examine our experiments using three real-world
and three synthetic data streams, all with two-class labels. For
We design two experimental scenarios and compare them real-world datasets, we use the exact same real-world datasets
with LevBag ensembles as baselines. Each scenario is de- with two class labels as with the previous experiments, see
signed to show the practical value of our theorems with Table II. For synthetic datasets, we generate 500,000 instances
different perspectives. Here is a brief description. of RBF, SEA, and HYP stream generator from the MOA
framework [43]. All the setting of these generators are set
• Scenario 1. We select the two most diverse compo- to the default, except for the number of class labels, which is
nents out of a LevBag-10 ensemble’s pool of component two.
classifiers, called Sel2Div ensemble, and aggregate their Results. Table III shows the prediction accuracy of differ-
votes. For pairwise diversity measurements among the ent ensemble sizes and experimental scenarios for examined
components of the ensemble, the Yule’s Q-statistic [18] datasets. The highest accuracy for each dataset are bold. We
is used. Minku et al. [50] used it for pairwise diversity can see that the ensemble size and component selection has a
measurements of online ensemble learning. Q-statistic is crucial impact on accuracy of prediction.
measured between all the pairs, and the highest diverse o differentiate the significance of differences in accuracy
pair is chosen. For two classifiers CSr and CSs , the values, we exploited the non-parametric Friedman statistical
Q-statistic is defined as below. N ab is the number of test, with α = 0.05 and F (8, 40). The null-hypothesis for this
instances in the instance window that CSr predicts a and statistical test claims that there is no statistically significant
TABLE III
C LASSIFICATION ACCURACY IN P ERCENTAGE (%)— THE HIGHEST ACCURACY FOR EACH DATASET IS BOLD
LevBag Select 2 most diverse Hybrid of HT and NB
Dataset 2 4 8 16 32 64 128 2 2
Airlines 85.955 86.747 87.455 88.136 88.527 89.516 90.638 88.430 86.392
ClickPrediction 95.395 95.516 95.515 95.531 95.532 95.524 95.525 95.520 95.524
Electricity 83.481 84.299 84.172 84.842 84.332 84.797 84.906 85.406 82.538
RBF 78.391 79.210 79.750 79.789 80.065 80.321 80.339 78.575 77.483
SEA 86.011 86.354 86.070 86.246 85.387 84.976 84.872 86.166 84.650
HYP 87.620 87.957 88.839 88.388 88.292 88.176 88.313 87.957 88.283
TABLE IV we do not need to train a near-infinite number of components

T HE M ULTIPLE C OMPARISONS FOR F RIEDMAN S TATISTICAL T EST to have a good ensemble.
R ESULTS . M INIMUM REQUIRED DIFFERENCE OF MEAN RANK IS 2.635.
H IGHER MEAN RANK MEANS BETTER PERFORMANCE . We delivered a framework which heightens the understand-
Idx Ensemble Mean Rank Different From (P < 0.05) ing of the diversity, and explains why diversity contributes to
the accuracy of predictions.
(1) LevBag-2 2.000 (3)(4)(5)(6)(7)(8)
(2) LevBag-4 4.250 (4)(7) Our experimental evaluations showed the practical value of
(3) LevBag-8 4.833 (1) our theorems, and highlighted existing challenges. Practical
(4) LevBag-16 7.000 (1)(2)(9) imperfections across different algorithms and different learn-
(5) LevBag-32 6.333 (1)(9)
(6) LevBag-64 5.750 (1)(9) ing complexities on our various datasets prevent us to clearly
(7) LevBag-128 7.000 (1)(2)(9) show that m = p and diversity are the core decisions to be
(8) Sel2Div 5.250 (1)(9) used in the ensemble design. However, the experimental results
(9) Hyb-HTNB 2.583 (4)(5)(6)(7)(8)
show that the number of class labels has an important effect
on the ensemble size. For example, in 7 out of 8 real world
difference among all examined ensembles, in terms of accu- datasets, the peak performances are closer to the ideal m = p
racy. The resulting two-tailed probability value, P = 0.002, point rather than being closer to the largest ensemble size.
rejects the null-hypothesis and shows that the differences are As a future work, we aim to define some diversity measures
significant. based on this framework, while also studying the coefficient
The Friedman multiple pairwise comparisons are conducted matrix specifications.
and presented in Table IV. It can be seen that there is no
R EFERENCES
significant difference among LevBag-8, LevBag-16, LevBag-
32, LevBag-64, LevBag-128, and Sel2Div ensembles. Given [1] H. Bonab and F. Can, “A theoretical framework on the ideal number of
classifiers for online ensembles in data streams,” in Proc. of Conference
that all are trained using the same component classifier, the on Information and Knowledge Management (CIKM). ACM, 2016, pp.
impact of this result is important; only 2 base classifiers can 2053–2056.
be comparably good with 128 of them, when they trained in [2] F. Can, “Incremental clustering for dynamic information processing,”
ACM Transactions on Information Systems (TOIS), vol. 11, no. 2, pp.
a diverse enough fashion and weighted optimally. 143–164, 1993.
On the other hand, the Hyb-HTNB ensemble performs [3] T. G. Dietterich, “Ensemble methods in machine learning,” in Multiple
equivalently as good as LevBag-2, LevBag-4, and LevBag- Classifier Systems (MCS). Springer, 2000, pp. 1–15.
[4] J. Gama, Knowledge discovery from data streams. CRC Press, 2010.
8, according to statistical significance tests. Hyb-HTNB is a [5] S. Wang, L. L. Minku, and X. Yao, “Resampling-based ensemble
naturally diverse ensemble; we included this in our experiment methods for online class imbalance learning,” IEEE Transactions on
to show the impact of diversity on prediction accuracy. Since Knowledge and Data Engineering (TKDE), vol. 27, no. 5, pp. 1356–
1368, 2015.
NB is a weak classifier compared to HT, it is reasonable [6] B. Krawczyk, L. L. Minku, J. Gama, J. Stefanowski, and M. Woźniak,
that Hyb-HTNB is not performing as good as the Sel2Div “Ensemble learning for data stream analysis: A survey,” Information
ensemble. Fusion, vol. 37, pp. 132–156, 2017.
[7] L. Rokach, “Ensemble-based classifiers,” Artificial Intelligence Review,
vol. 33, no. 1, pp. 1–39, 2010.
V. C ONCLUSION [8] L.-W. Chan, “Weighted least square ensemble networks,” in Interna-
tional Joint Conference on Neural Networks (IJCNN), vol. 2. IEEE,
In this paper, we studied the impact of ensemble size 1999, pp. 1393–1396.
using a geometric framework. The entire decision making [9] P. Latinne, O. Debeir, and C. Decaestecker, “Limiting the number of
process through voting is adapted to a spatial environment and trees in random forests,” in Multiple Classifier Systems (MCS). Springer,
2001, pp. 178–187.
weighting combination rules, including majority voting, are [10] X. Hu, “Using rough sets theory and database operations to construct a
considered for providing better insight. The focus of study is good ensemble of classifiers for data mining applications,” in Proc. of
on online ensembles, however nothing prevents us from using International Conference on Data Mining (ICDM). IEEE, 2001, pp.
233–240.
the proposed model on batch ensembles. [11] T. M. Oshiro, P. S. Perez, and J. A. Baranauskas, “How many trees in
The ensemble size is crucial for online environments, due a random forest?” in Machine Learning and Data Mining in Pattern
to the dimensionality growth of data. We discussed the effect Recognition (MLDM). Springer, 2012, pp. 154–168.
[12] D. Hernández-Lobato, G. Martı́nez-Muñoz, and A. Suárez, “How large
of ensemble size with majority voting and optimal weighted should ensembles of classifiers be?” Pattern Recognition, vol. 46, no. 5,
voting aggregation rules. The highly important outcome is that pp. 1323–1336, 2013.
[13] H. M. Gomes, J. P. Barddal, F. Enembreck, and A. Bifet, “A survey [38] D. J. Miller and L. Yan, “Critic-driven ensemble classification,” IEEE
on ensemble learning for data stream classification,” ACM Computing Transactions on Signal Processing, vol. 47, no. 10, pp. 2833–2844, 1999.
Surveys, vol. 50, no. 2, pp. 23:1–23:36, Mar. 2017. [39] S. Wu and F. Crestani, “A geometric framework for data fusion in
[14] G. Tsoumakas, I. Partalas, and I. Vlahavas, “A taxonomy and short re- information retrieval,” Information Systems, vol. 50, pp. 20–35, 2015.
view of ensemble selection,” in Supervised and Unsupervised Ensemble [40] M. Abramowitz and I. A. Stegun, Handbook of mathematical functions
Methods and Their Applications, 2008, pp. 41–46. with formulas, graphs, and mathematical tables. Courier Corporation,
[15] J. B. Gomes, M. M. Gaber, P. A. Sousa, and E. Menasalvas, “Mining 1964, vol. 55.
recurring concepts in a dynamic feature space,” IEEE Transactions on [41] K. Tumer and J. Ghosh, “Error correlation and error reduction in
Neural Networks and Learning Systems (TNNLS), vol. 25, no. 1, pp. ensemble classifiers,” Connection Science, vol. 8, no. 3-4, pp. 385–404,
95–110, 2014. 1996.
[16] K. Jackowski, “New diversity measure for data stream classification [42] H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining concept-drifting data
ensembles,” Engineering Applications of Artificial Intelligence, vol. 74, streams using ensemble classifiers,” in Proc. of International Conference
pp. 23–34, 2018. on Knowledge Discovery and Data Mining (SIGKDD). ACM, 2003,
[17] L. I. Kuncheva, Combining pattern classifiers: Methods and algorithms. pp. 226–235.
John Wiley & Sons, 2004. [43] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “MOA: massive on-
[18] L. I. Kuncheva and C. J. Whitaker, “Measures of diversity in classifier line analysis,” Journal of Machine Learning Research (JMLR), vol. 11,
ensembles and their relationship with the ensemble accuracy,” Machine pp. 1601–1604, 2010.
Learning, vol. 51, no. 2, pp. 181–207, 2003. [44] P. M. Domingos and G. Hulten, “Mining high-speed data streams,” in
[19] H. Liu, A. Mandvikar, and J. Mody, “An empirical study of building Proc. of International Conference on Knowledge Discovery and Data
compact ensembles,” in Web-Age Information Management (WAIM). Mining (SIGKDD). ACM, 2000, pp. 71–80.
Springer, 2004, pp. 622–627. [45] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavaldà, “New
[20] G. Brown, J. Wyatt, R. Harris, and X. Yao, “Diversity creation methods: ensemble methods for evolving data streams,” in Proc. of International
A survey and categorisation,” Information Fusion, vol. 6, no. 1, pp. 5– Conference on Knowledge Discovery and Data Mining (SIGKDD), 2009,
20, 2005. pp. 139–148.
[21] P. C. Hansen, V. Pereyra, and G. Scherer, Least squares data fitting with [46] R. Rifkin and A. Klautau, “In defense of one-vs-all classification,”
applications. Johns Hopkins University Press, 2013. Journal of Machine Learning Research (JMLR), vol. 5, pp. 101–141,
[22] Z.-H. Zhou, J. Wu, and W. Tang, “Ensembling neural networks: Many 2004.
could be better than all,” Artificial Intelligence, vol. 137, no. 1-2, pp. [47] A. Bifet, G. Holmes, and B. Pfahringer, “Leveraging bagging for
239–263, 2002. evolving data streams,” in Proc. of International Conference on Machine
[23] A. Ulaş, M. Semerci, O. T. Yıldız, and E. Alpaydın, “Incremental con- Learning and Knowledge Discovery in Databases (ECML-PKDD), 2010,
struction of classifier and discriminant ensembles,” Information Sciences, pp. 135–150.
vol. 179, no. 9, pp. 1298–1318, 2009. [48] N. C. Oza, “Online ensemble learning,” Ph.D. dissertation, Computer
[24] L. Rokach, “Collective-agreement-based pruning of ensembles,” Com- Science Division, Univ. California, Berkeley, CA, USA, 09 2001.
putational Statistics & Data Analysis, vol. 53, no. 4, pp. 1015–1026, [49] N. C. Oza and S. Russell, “Experimental comparisons of online and
2009. batch versions of bagging and boosting,” in Proc. of International
[25] D. D. Margineantu and T. G. Dietterich, “Pruning adaptive boosting,” in Conference on Knowledge Discovery and Data Mining (SIGKDD).
International Conference on Machine Learning (ICML), vol. 97, 1997, ACM, 2001, pp. 359–364.
pp. 211–218. [50] L. L. Minku, A. P. White, and X. Yao, “The impact of diversity on online
[26] C. Toraman and F. Can, “Squeezing the ensemble pruning: Faster and ensemble learning in the presence of concept drift,” IEEE Transactions
more accurate categorization for news portals,” in European Conference on Knowledge and Data Engineering (TKDE), vol. 22, no. 5, pp. 730–
on Information Retrieval (ECIR). Springer, 2012, pp. 508–511. 742, 2010.
[27] R. Elwell and R. Polikar, “Incremental learning of concept drift in
nonstationary environments,” IEEE Transactions on Neural Networks
and Learning Systems (TNNLS), vol. 22, no. 10, pp. 1517–1531, 2011.
[28] T. Windeatt and C. Zor, “Ensemble pruning using spectral coefficients,”
IEEE Transactions on Neural Networks and Learning Systems (TNNLS),
vol. 24, no. 4, pp. 673–678, 2013. Hamed Bonab received BS and MS degrees in
[29] K. Tumer and J. Ghosh, “Analysis of decision boundaries in linearly computer engineering from the Iran University
combined neural classifiers,” Pattern Recognition, vol. 29, no. 2, pp. of Science and Technology (IUST), Tehran-Iran,
341–348, 1996. and Bilkent University, Ankara-Turkey, respectively.
[30] E. Bauer and R. Kohavi, “An empirical comparison of voting classifi- Currently, he is a PhD student in the College of
cation algorithms: Bagging, boosting, and variants,” Machine Learning, Information and Computer Sciences at the Univer-
vol. 36, no. 1-2, pp. 105–139, 1999. sity of Massachusetts Amherst, MA. His research
[31] G. Fumera, F. Roli, and A. Serrau, “A theoretical analysis of bagging interests broadly include stream processing, data
as a linear combination of classifiers,” IEEE Transactions on Pattern mining, machine learning, and information retrieval.
Analysis and Machine Intelligence (TPAMI), vol. 30, no. 7, pp. 1293–
1299, 2008.
[32] G. Fumera and F. Roli, “A theoretical and experimental analysis of linear
combiners for multiple classifier systems,” IEEE Transactions on Pattern
Analysis and Machine Intelligence (TPAMI), vol. 27, no. 6, pp. 942–956,
2005.
[33] L. Pietruczuk, L. Rutkowski, M. Jaworski, and P. Duda, “A method Fazli Can received PhD degree (1985) in com-
for automatic adjustment of ensemble size in stream data mining,” in puter engineering, from the Middle East Technical
International Joint Conference on Neural Networks (IJCNN). IEEE, University (METU), Ankara, Turkey. His BS and
2016, pp. 9–15. MS degrees, respectively, on electrical & electronics
[34] ——, “How to adjust an ensemble size in stream data mining?” and computer engineering are also from METU. He
Information Sciences, vol. 381, pp. 46–54, 2017. conducted his PhD research under the supervision of
[35] H. Bonab and F. Can, “GOOWE: Geometrically optimum and online- Prof. Esen Ozkarahan; at Arizona State University
weighted ensemble classifier for evolving data streams,” ACM Transac- (ASU) and Intel, AZ; as a part of the RAP database
tions on Knowledge Discovery from Data (TKDD), vol. 12, no. 2, pp. machine project. He is a faculty member at Bilkent
25:1–25:33, Mar. 2018. University, Ankara, Turkey. Before joining Bilkent
[36] J. Z. Kolter and M. A. Maloof, “Dynamic weighted majority: An he was a tenured full professor at Miami University,
ensemble method for drifting concepts,” Journal of Machine Learning Oxford, OH. He co-edited ACM SIGIR Forum (1995-2002) and is a co-
Research (JMLR), vol. 8, pp. 2755–2790, 2007. founder of the Bilkent Information Retrieval Group. His research interests are
[37] ——, “Dynamic weighted majority: A new ensemble method for track- information retrieval and data mining. His interest in dynamic information
ing concept drift,” in Proc. of International Conference on Data Mining processing dates back to his 1993 incremental clustering paper in ACM TOIS.
(ICDM). IEEE, 2003, pp. 123–130.

Less Is More: A Comprehensive Framework For The Number of Components of Ensemble Classifiers

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Less Is More: A Comprehensive Framework For The Number of Components of Ensemble Classifiers

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 14, NO.

Less Is More: A Comprehensive Framework for

• Explain the impact of diversity among component clas- TABLE I

Theorem 1. For It the loss between the centroid-point at and

Fig. 1. Schema of the geometric framework (obtained from [1]). It with

0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

0.2 0.2 0.2 0.2

(a) (b) (c) (d)

(a) RBF-C2 (b) RBF-C4 (c) RBF-C8

(d) RBF-C16 (e) RBF-C32 (f) RBF-C64

93.4 90.0 89.6 89.6

92.8 88.0 89.0 89.0

92.2 WMV WMV 88.4 WMV 88.4 WMV

(a) HYP-F (b) HYP-S (c) SEA-F (d) SEA-S

(e) TREE-F (f) TREE-S (g) RBF-N (h) RBF-G

TABLE II evolving environments, i.e. data streams with concept drift,

(a) Airlines (b) ClickPrediction (c) Electricity (d) HyperPlane

WMV WMV WMV WMV

(e) CoverType (f) PokerHand (g) PowerSupply (h) Sensor

replacement (i.e. input randomization), using a P oisson(λ) CSs predicts b.

TABLE IV we do not need to train a near-infinite number of components

You might also like