Professional Documents
Culture Documents
8, JULY 2018 1
Abstract—The number of component classifiers chosen for an the vote aggregation method mostly characterize an ensemble
ensemble greatly impacts the prediction ability. In this paper, classifier [7].
we use a geometric framework for a priori determining the There are two main categories of vote aggregation methods
ensemble size, which is applicable to most of existing batch and
online ensemble classifiers. There are only a limited number of for combining the votes of component classifiers: weight-
arXiv:1709.02925v2 [cs.LG] 29 Sep 2018
studies on the ensemble size examining Majority Voting (MV) ing methods and meta-learning methods [7], [8]. Weighting
and Weighted Majority Voting (WMV). Almost all of them are methods assign a combining weight to each component and
designed for batch-mode, hardly addressing online environments. aggregate their votes based on these weights (e.g. Majority
Big data dimensions and resource limitations, in terms of time Voting, Performance Weighting, Bayesian Combination). They
and memory, make determination of ensemble size crucial,
especially for online environments. For the MV aggregation are useful when the individual classifiers perform the same
rule, our framework proves that the more strong components task and have comparable success. Meta-learning methods re-
we add to the ensemble, the more accurate predictions we can fer to learning from the classifiers and from the classifications
achieve. For the WMV aggregation rule, our framework proves of these classifiers on training data (e.g. Stacking, Arbiter
the existence of an ideal number of components, which is equal Trees, Grading). They are best suited for situations where
to the number of class labels, with the premise that components
are completely independent of each other and strong enough. certain classifiers consistently mis-classify or correctly classify
While giving the exact definition for a strong and independent certain instances [7]. In this paper we study the ensembles
classifier in the context of an ensemble is a challenging task, our with the weighting combination rule. Meta-learning methods
proposed geometric framework provides a theoretical explanation are out of the scope of this paper.
of diversity and its impact on the accuracy of predictions. We An important aspect of ensemble methods is to determine
conduct a series of experimental evaluations to show the practical
value of our theorems and existing challenges.
how many component classifiers should be included in the final
ensemble, known as the ensemble size or ensemble cardinality
Index Terms—Supervised learning, ensemble size, ensemble [1], [7], [9], [10], [11], [12], [13]. The impact of ensemble
cardinality, voting framework, big data, data stream, law of
diminishing returns
size on efficiency in terms of time and memory and predictive
performance make its determination an important problem
[14], [15]. Efficiency is especially important for online envi-
I. I NTRODUCTION ronments. In this paper, we extend our geometric framework
[1] for pre-determining the ensemble size, applicable to both
VER the last few years, the design of learning systems
O for mining the data generated from real-world problems
has encountered new challenges such as the high dimensional-
batch and online ensembles.
Furthermore, diversity among component classifiers is an
influential factor for having an accurate ensemble [7], [16],
ity of big data, as well as growth in volume, variety, velocity, [17], [18]. Liu et al. [19] empirically studied ensemble size
and veracity—the four V’s of big data1 . In the context of on diversity. Hu [19] explained that component diversity
data dimensions, volume is the amount of data, variety is the leads to uncorrelated votes, which in turn improves predictive
number of types of data, velocity is the speed of data, and performance. However, to the best of our knowledge, there
veracity is the uncertainty of data; generated in real-world is no explanatory theory revealing how and why diversity
applications and processed by the learning algorithm. The among components contributes toward overall ensemble ac-
dynamic information processing and incremental adaptation curacy [20]. Our proposed geometric framework introduces a
of learning systems to the temporal changes are among the theoretical explanation for the understanding of diversity in
most demanding tasks in the literature for a long time [2]. the context of ensemble classifiers.
Ensemble classifiers are among the most successful and The main contributions of this study are the following. We
well-known solutions to supervised learning problems, par- • Present a brief comprehensive review of existing ap-
ticularly for online environments [3], [4], [5]. The main idea proaches for determining the number of component clas-
is to construct a collection of individual classifiers, even with sifiers of ensembles,
weak learners, and combine their votes. The aim is to build • Provide a spatial modeling for ensembles and use the
a stronger classifier, compared to each individual component linear least squares (LSQ) solution [21] for optimizing
classifier [6]. The training mechanism of components and the weights of components of an ensemble classifier,
applicable to both online and batch ensembles,
This is an extended version of the work presented as a short paper at the • Exploit the geometric framework for the first time in
Conference on Information and Knowledge Management (CIKM), 2016 [1].
Manuscript submitted July 2018. the literature, for a priori determining the number of
1 http://www.ibmbigdatahub.com/infographic/four-vs-big-data component classifiers of an ensemble,
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 14, NO. 8, JULY 2018 2
C3
averaging. For a given instance, It , we have the following
score-polytope
mapping to the centroid-point, at =< A1t , A2t , · · · , Apt >.
C1
m
C2
ideal-point 1 X k
Akt = S (1 ≤ k ≤ p) (1)
m j=1 tj
class labels, C = {C1 , C2 , C3 }. For a given instance It , Using Eq. 1 and loss definition, Eq. 2 can be achieved.
the true-class label is C2 . The ideal-point would be ot =<
0, 1, 0 >. Discussion. Theorem 1 shows that the performance of an
One could presumably define many different algebraic rules ensemble with the MV aggregation rule is at least equal to the
for vote aggregation [36], [37]—minimum, maximum, sum, average performance of all individual components.
mean, product, median, etc. While these vote aggregation
Theorem 2. For It , let ξl = ξ − {CSl } (1 ≤ l ≤ m) be
rules can be expressed using our geometric framework, we
a subset of ensemble ξ without CSl . Each ξl has atl as its
study the Majority Voting (MV) and Weighted Majority Voting
centroid-point. We have
(WMV) aggregation rules in this paper. In addition, individual
m
vote scores can be aggregated based on two different voting 1 X
schemes [38]; (a) Hard Voting, the score-vector of a com- loss(at , ot ) ≤ loss(atl , ot ) (3)
m
l=1
ponent classifier is first transformed into a one-hot vector,
possibly using a hard-max function, and then combined, (b)
Soft Voting, the score-vector is used for vote aggregation. We Proof. at is the centroid-point of all atl points according to
use soft voting for our framework. the definition. Assume ξl as an individual classifier with score-
The Euclidean norm is used as the loss function, loss(· , · ), vector of atl . Theorem 1 for every ξl results Eq. 3 directly.
for optimization purposes [21]. The Euclidean distance of any
Discussion. Theorem 2 can be generalized for any subset
score-vector and ideal-point expresses the effectiveness of that
definition with (m − f ) components, 1 ≤ f ≤ (m − 2).
component for the given instance. Using aggregation rules, we
This shows that an ensemble with m components performs
aim to define a mapping function from a score-polytope into a
better (or at least equal) compared to the average performance
single vector, and measure the effectiveness of our ensemble.
of ensembles with m − f components. It can be concluded
Wu and Crestani [39] applied a similar geometric framework
that better performance can be achieved if we aggregate
for data fusion of information retrieval systems. In this study,
more component classifiers. However, if we keep adding poor
some of our theorems are obtained and adapted to ensemble
components to the ensemble, it can diminish overall prediction
learning from their framework.
accuracy by increasing the upper bound in Eq. 2. This is in
agreement with the result of the Bayes error reduction analysis
A. Majority Voting (MV) [41], [42]. Setting a threshold, as expressed in [29], [33], [34],
The mapping of a given score-polytope into its centroid [42], can give us the ideal number of components for a specific
can be expressed as the MV aggregation—plurality voting or problem.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 14, NO. 8, JULY 2018 4
p
B. Weighted Majority Voting (WMV) γq =
X
Oik Siq
k
, (1 ≤ q ≤ m) (9)
For this aggregation rule, a weight vector w =< k=1
W1 , W2 , · · · ,P
Wm > for components of ensemble is defined,
lead to m linear equations with m variables (weights). Briefly,
Wj ≥ 0 and Wj = 1 for 1 ≤ j ≤ m. For a given instance,
wΛ = γ, where Λ ∈ Rm×m is the coefficients matrix and
It , we have the following mapping to the weighted-centroid-
γ ∈ Rm is the remainders vector—using Eq. 8 and Eq. 9,
point, bt =< Bt1 , Bt2 , · · · , Btp >.
respectively. The proper weights in the matrix equation are
m
X our intended optimum weight vector. For a more detailed
Btk = k
Wj Stj (1 ≤ k ≤ p) (4)
explanation see [35].
j=1
Discussion. According to Eq. 8, Λ is a symmetric square
Note that giving equal weights to all the components will matrix. If Λ has full rank, our problem has a unique solution.
result in the MV aggregation rule. WMV presents a flexible On the other hand, in the sense of a least squares solution
aggregation rule. No matter how poor a component classifier [21], it is probable that Λ is rank-deficient, and we may not
is, with a proper weight vector we can cancel its effect on have a unique solution. Studying the properties of this matrix
the aggregated results. However, as discussed earlier, this is lead us to the following theorem.
not true for the MV rule. In the following, we give the formal
definition of the optimum weight vector, which we aim to find. Theorem 4. If the number of component classifiers is not
equal to the number of class labels, m 6= p, then the coefficient
Definition 1. Optimum Weight Vector. For an ensemble, ξ, matrix would be rank-deficient, det Λ = 0.
and a given instance, It , weight vector wo with the weighted-
centroid-point bo is the optimum weight vector where for Proof. Since we have p dimensions in our Euclidean space,
any wx with weighted-centroid-point bx the following is true; p independent score-vectors would be needed for the basis
loss(bo , ot ) ≤ loss(bx , ot ). spanning set. Any number of vectors, m, more than p is
dependent on the basis spanning set, and any number of
Theorem 3. For a given instance, It , let the optimum weight vectors, m, less than p is insufficient for constructing the basis
vector, wo , and the weighted-centroid-point bt . The following spanning set.
must hold;
loss(bt , ot ) ≤ min{loss(st1 , ot ), . . . , loss(stm , ot )} (5) Discussion. The above theorem excludes some cases in
which we cannot find optimum weights for aggregating
Proof. Assume that the least loss belongs to component j, votes. There are several numerical solutions for solving rank-
among m score-vectors. We have the following two cases. deficient least-squares problems (e.g. QR factorization, Sin-
(a) Maintaining the performance. Simply giving a weight of gular Value Decomposition (SVD)), however the resulting
1 to j’s component and 0 for the remaining components result solution is relatively expensive, may not be unique, and
in the equality case; loss(bt , ot ) = loss(stj , ot ). optimality is not guaranteed. Theorem 4’s outcome is that
(b) Improving the performance. Using a linear combination the number of independent components for an ensemble is
of j and other components with proper weights result in crucial for providing full-rank coefficient matrix, in the aim
a weighted-centroid-point closer to the ideal-point. We can of an optimal weight vector solution.
always find such a weight vector in the Euclidean space if
other components are not the exact same as j.
C. Diversity among components
Using the squared Euclidean norm as the measure of close-
ness for the linear least squares problem (LSQ) [21] results Theorem 4 shows that for weight vector optimality, m = p
should be true. However, the reverse cannot be guaranteed
min ||o − wS||22 (6) in general. Assuming m = p and letting det Λ = 0 for
w
the parametric coefficient matrix results in some conditions
Where for each instance Ii in the instance window, S ∈ Rm×p
where we have vote agreement, and cannot find a unique
is the matrix with score-vectors sij in each row corresponding
optimum weight vector. As an example, suppose we have
to the component classifier j, w ∈ Rm is the vector of weights
two component classifiers for a binary classification task,
to be determined, and o ∈ Rp is the vector of the ideal-point.
m = p = 2. Letting det Λ = 0, results the following
We use the following function for our optimization solution. 1 2 2 1
equations; S11 + S12 = 1 or S11 + S12 = 1, meaning the
p Xm
X
k
agreement of component classifiers—i.e. the exact same vote
f (W1 , W2 , · · · , Wm ) = ( (Wj Sij ) − Oik )2 (7) vectors. More specifically, this suggests another condition for
k=1 j=1 weight vector optimality: the importance of diversity among
Taking a partial derivation over Wq (1 ≤ q ≤ m), setting the component classifiers.
gradient to zero, ∇f = 0, and finding optimum points give us Fig. 2 presents four mainly different score-vector possibili-
the optimum weight vector. Letting the following summations ties for an ensemble with size three. The true class label of the
as λqj and γq examined instance is C1 for a binary classification problem.
p
All score-vectors are normalized and placed on the main
λqj =
X
k k
Siq Sij , (1 ≤ q, j ≤ m) (8) diagonal of the spatial environment. The counter-diagonal line
k=1
divides the decision boundary for the class label determination
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 14, NO. 8, JULY 2018 5
1 1 1 1
C2
C2
C2
0.4 0.4 0.4 0.4
0
o o o 0
o
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
C1 C1 C1 C1
based on the probability values. If the component’s score- Our framework suggests the number of class labels of a
vector is in the lower triangular, it is classified C1 and dataset as the ideal number of component classifiers, with the
similarly, if it is in the left triangular part it is classified premise that they generate independent scores and aggregated
C2. Fig. 2 (a) and (b) show the miss-classification and true with optimum weights. However, real-world datasets and ex-
classification situations, respectively. Fig. 2 (c) and (d) show isting ensemble classifiers do not guarantee this premise most
disagreement among components of the ensemble. of the time. Determining the exact value of this ideal point
If for several instances, in a sequence of data, the score- for a given ensemble classifier, over real-world data, is still
vectors of two components are equal (or act predictably a challenging problem due to the different complexities of
similar), they are considered dependent components. There datasets.
are several measurements for quantifying this dependency for
ensemble classifiers (e.g. Q-statistic) [18]. However, most of
IV. E XPERIMENTAL E VALUATION
the measurements in practice use the oracle output of com-
ponents (i.e. only predicted class labels) [18]. Our geometric The experiments conducted in [1] showed that for ensembles
framework shows a potential importance of using score-vectors trained with a specific dataset, we have an ideal number of
for diversity measurements. It is out of the scope of this study components in which having more will deteriorate or at least
to propose a diversity measurement using score-vectors and provide no benefit to our prediction ability. Our extensive
we leave it as a future work. experiments in [35] show the practical value of this geometric
To the best of our knowledge, there is no explanatory framework for aggregating votes.
theory in the literature revealing why and how diversity among Here, through a series of experiments, we first investigate
components contribute toward overall ensemble accuracy [20], the impact of the number of class labels and the number of
[16]. Our geometric modeling of ensemble’s score-vectors component classifiers for MV and WMV using a synthetic
and the optimum weight vector solution provide a theoretical dataset generator; Then, we study impact of miscellaneous
insight for the commonly agreed upon idea that “the classifiers data streams using several real-world and synthetic datasets;
should be different from each other, otherwise the overall Finally, we explore the outcome of our theorems on the
decision will not be better than the individual decisions” [18]. diversity of component classifiers and the practical value of our
Optimum weights can be reached when we have the same study. All the experiments are implemented using the MOA
number of independent and diverse component classifiers as framework [43], and Interleaved-Test-Then-Train is used for
class labels. Diversity has a great impact on the coefficient accuracy measurements. An instance window, with a length of
matrix that consequently impacts the accurate predictions of 500 instances, is used for keeping the latest instances.
an ensemble. For the case of majority voting, adding more
dependent classifiers will dominate the decision of other
components. A. Impact of Number of Class Labels.
Discussion. Our geometric framework supports the idea Setup. To investigate the sole impact of the number of
that there is an ideal number of component classifiers for class labels of the dataset, i.e. the p value, on the accuracy
an ensemble, with which we can reach the most accurate of an ensemble, we use the GOOWE [35] method. It uses
results. Increasing or decreasing the number of classifiers from our optimum weight vector calculation for vote aggregation
this ideal point may deteriorate predictions, or bring no gain in a WMV procedure for vote aggregation. The Hoeffding
to the overall performance of the ensemble. Having more Tree (HT) [44] is used as the component classifier, due to
components than the ideal number of classifiers can mislead its high adaptivity to data stream classification. For a fair
the aggregation rule, especially for majority voting. On the comparison, we modify GOOWE for having MV aggregation
other hand, having fewer is insufficient for constructing an rule by simply providing equal weights to the components.
ensemble which is stronger than the single classifier. We refer These two variations of GOOWE, i.e. WMV and MV, are used
to this situation as “the law of diminishing returns in ensemble for this experiment. Each of these variations trained and tested
construction.” using different ensemble size values, starting from only two
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 14, NO. 8, JULY 2018 6
96 84 74
72
82
94
70
80
68
92
Accuracy (%)
Accuracy (%)
Accuracy (%)
78
66
76
90
64
74
62
88
WMV 72 WMV 60 WMV
MV MV MV
86 70 58
24 8 16 32 64 128 24 8 16 32 64 128 24 8 16 32 64 128
Num ber of Com ponent Classifiers Num ber of Com ponent Classifiers Num ber of Com ponent Classifiers
75 80
75
70 70
70
Accuracy (%)
Accuracy (%)
Accuracy (%)
65 60
65
60 50
60
55 40
WMV WMV WMV
MV MV MV
50 55 30
24 8 16 32 64 128 24 8 16 32 64 128 24 8 16 32 64 128
Num ber of Com ponent Classifiers Num ber of Com ponent Classifiers Num ber of Com ponent Classifiers
components and doubling at each step—i.e. our investigated However, in the latter set we see that the peak points are
ensemble sizes, m values, are 2, 4, · · · , 128. with m = 128, and (iv) The theoretical vertical line, i.e.
Dataset. Since existing real-world datasets are not con- the equality case m = p, seems to precede the peak point
sistent, in terms of classification complexity, we are only on each plot. We suspect that this might be due to Theorem
able to use synthetic data for this experiment in order to 4’s premise conditions: generating independent scores and
have reasonable comparisons. We choose the popular Random aggregating with optimum weights.
RBF generator, since it is capable of generating data streams
with an arbitrary number of features and class labels [45].
Using this generator, implemented in the MOA framework B. Impact of Data Streams.
[43], we prepare six datasets, each containing one million Setup. There are many factors when the complexity of clas-
instances with 20 attributes, with the default parameter settings sification problems are considered—concept drift, the number
of the RBF generator. The only difference is the number of of features, etc. To this end, we investigate the number of
class labels among datasets which are 2, 4, 8, 16, 32, and 64. component classifiers for WMV and MV on a wide range of
We reflect this in dataset naming as RBF-C2, RBF-C4, · · · , datasets. We use an experimental setup similar to the previous
respectively. experiments on different synthetic and real-world datasets.
Results. Fig. 3 presents prediction accuracy for WMV and We aim to investigate some general patterns in more realistic
MV with increasing component counts, m, on each dataset. To problems.
mark the equality of m and p, we use a vertical dashed green Dataset. We select eight synthetic and eight real-world
line. We can make the following interesting observations: benchmark datasets used for stream classification problems
(i) A weighted aggregation rule becomes more vital with in the literature. A summary of our datasets is given in Table
an increasing number of component classifiers, (ii) WMV is II. For this selection, we aim to have a mixture of different
performing more resiliently in multi-class problems, compared concept drift types, number of features, number of class labels,
to binary classification problems, when compared to MV. The and noise percentages. Synthetic datasets are similar to the
gap between WMV and MV seems to increase with greater ones used for the GOOWE evaluation [35]. For real-world
numbers of class labels, (iii) There is a peak point in the datasets, Sensor, PowerSupply, and HyperPlane datasets are
accuracy value, and it is dependent on the number of class taken from2 . The remainder of real-world datasets are taken
labels. This can be seen by comparing RBF-C2, RBF-C4, and from3 . See [1], [35] for a detailed explanations of the datasets.
RBF-C8 (the first row in Fig. 3) with RBF-C16, RBF-C32, and
RBF-C64 (the second row in Fig. 3) plots. In the former set 2 Access URL: http://www.cse.fau.edu/∼xqzhu/stream.html
we see that after a peak point, the accuracy starts to drop. 3 Access URL: https://www.openml.org/
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 14, NO. 8, JULY 2018 7
89.5
93.2 89.4 89.4
89.0
93.0 89.2 89.2
88.5
Accuracy (%)
Accuracy (%)
Accuracy (%)
Accuracy (%)
87.5 88.8
92.6 88.8
87.0
92.4 88.6 88.6
86.5
81.0 94 94 98.4
93
80.5 92
98.2
92
80.0 90
91 98.0
Accuracy (%)
Accuracy (%)
Accuracy (%)
Accuracy (%)
79.5 88
90
97.8
79.0 89 86
88 97.6
78.5 84
87
97.4
78.0 WMV WMV 82 WMV WMV
86
MV MV MV MV
77.5 85 80 97.2
24 8 16 32 64 128 24 8 16 32 64 128 24 8 16 32 64 128 24 8 16 32 64 128
Num ber of Com ponent Classifiers Num ber of Com ponent Classifiers Num ber of Com ponent Classifiers Num ber of Com ponent Classifiers
Fig. 4. Prediction behavior of WMV and MV aggregation rules, in terms of accuracy, for miscellaneous synthetic datasets, with increasing both the number
of component classifiers, m, and class labels, p. The equality case, m = p, is marked on each plot using a vertical dashed green line.
C. Impact of Diversity.
Results. Fig. 4 and 5 present the prediction accuracy differ- Setup. In order to study the impact of diversity in ensemble
ence for WMV and MV for increasing component classifier classifiers and show the practical value of our theorems, we
counts, m, on each dataset. For marking the equality of m design two different scenarios for the binary classification
and p, we use a vertical dashed green line, similar to the problem. We select binary classification for the purpose of this
previous experiments. As we can see, given more broad types experiment since the difference between WMV and MV are
of datasets, each with completely different complexities, it is almost always insignificant for binary classification, compared
difficult to conclude strict patterns. We have the following to multi-class problems. In addition, multi-class problems can
interesting observations: (i) For almost all the datasets, WMV, potentially modeled as several binary classification problems
with optimum weights, outperforms MV, (ii) We can see the [46].
same results as the previous experiments: There is a peak point To this end, we recruit a well-known and state-of-the-art
in the accuracy value and it is dependent on the number of online ensemble classifier, called leverage bagging (LevBag)
class labels, (iii) The theoretical vertical line, i.e. the equality [47] as the base ensemble method for our comparisons. It is
case m = p, seems to precede the peak point on each plot, based on the OzaBagging ensemble [48], [49] and is proven to
and (iv) Optimum weighting seems to be more resilient in the react well in online environments. It exploits re-sampling with
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 14, NO. 8, JULY 2018 8
67 83.2 76.0 59
58
66
75.5
83.0
57
65
56
75.0
Accuracy (%)
Accuracy (%)
Accuracy (%)
Accuracy (%)
82.8
64 55
74.5
54
63 82.6
53
74.0
62
WMV WMV WMV 52 WMV
82.4
MV MV MV MV
61 73.5 51
24 8 16 32 64 128 24 8 16 32 64 128 24 8 16 32 64 128 24 8 16 32 64 128
Num ber of Com ponent Classifiers Num ber of Com ponent Classifiers Num ber of Com ponent Classifiers Num ber of Com ponent Classifiers
91 56 10.2 84
90 55 83
10.1
89
54 82
88
53 81
10.0
Accuracy (%)
Accuracy (%)
Accuracy (%)
Accuracy (%)
87
52 80
86
9.9
51 79
85
50 78
84 9.8
Fig. 5. Prediction behavior of WMV and MV aggregation rules, in terms of accuracy, for miscellaneous real-world datasets with increasing both the number
of component classifiers, m, and class labels, p. The equality case, m = p, is marked on each plot using a vertical dashed green line.
TABLE III
C LASSIFICATION ACCURACY IN P ERCENTAGE (%)— THE HIGHEST ACCURACY FOR EACH DATASET IS BOLD
LevBag Select 2 most diverse Hybrid of HT and NB
Dataset 2 4 8 16 32 64 128 2 2
Airlines 85.955 86.747 87.455 88.136 88.527 89.516 90.638 88.430 86.392
ClickPrediction 95.395 95.516 95.515 95.531 95.532 95.524 95.525 95.520 95.524
Electricity 83.481 84.299 84.172 84.842 84.332 84.797 84.906 85.406 82.538
RBF 78.391 79.210 79.750 79.789 80.065 80.321 80.339 78.575 77.483
SEA 86.011 86.354 86.070 86.246 85.387 84.976 84.872 86.166 84.650
HYP 87.620 87.957 88.839 88.388 88.292 88.176 88.313 87.957 88.283
[13] H. M. Gomes, J. P. Barddal, F. Enembreck, and A. Bifet, “A survey [38] D. J. Miller and L. Yan, “Critic-driven ensemble classification,” IEEE
on ensemble learning for data stream classification,” ACM Computing Transactions on Signal Processing, vol. 47, no. 10, pp. 2833–2844, 1999.
Surveys, vol. 50, no. 2, pp. 23:1–23:36, Mar. 2017. [39] S. Wu and F. Crestani, “A geometric framework for data fusion in
[14] G. Tsoumakas, I. Partalas, and I. Vlahavas, “A taxonomy and short re- information retrieval,” Information Systems, vol. 50, pp. 20–35, 2015.
view of ensemble selection,” in Supervised and Unsupervised Ensemble [40] M. Abramowitz and I. A. Stegun, Handbook of mathematical functions
Methods and Their Applications, 2008, pp. 41–46. with formulas, graphs, and mathematical tables. Courier Corporation,
[15] J. B. Gomes, M. M. Gaber, P. A. Sousa, and E. Menasalvas, “Mining 1964, vol. 55.
recurring concepts in a dynamic feature space,” IEEE Transactions on [41] K. Tumer and J. Ghosh, “Error correlation and error reduction in
Neural Networks and Learning Systems (TNNLS), vol. 25, no. 1, pp. ensemble classifiers,” Connection Science, vol. 8, no. 3-4, pp. 385–404,
95–110, 2014. 1996.
[16] K. Jackowski, “New diversity measure for data stream classification [42] H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining concept-drifting data
ensembles,” Engineering Applications of Artificial Intelligence, vol. 74, streams using ensemble classifiers,” in Proc. of International Conference
pp. 23–34, 2018. on Knowledge Discovery and Data Mining (SIGKDD). ACM, 2003,
[17] L. I. Kuncheva, Combining pattern classifiers: Methods and algorithms. pp. 226–235.
John Wiley & Sons, 2004. [43] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “MOA: massive on-
[18] L. I. Kuncheva and C. J. Whitaker, “Measures of diversity in classifier line analysis,” Journal of Machine Learning Research (JMLR), vol. 11,
ensembles and their relationship with the ensemble accuracy,” Machine pp. 1601–1604, 2010.
Learning, vol. 51, no. 2, pp. 181–207, 2003. [44] P. M. Domingos and G. Hulten, “Mining high-speed data streams,” in
[19] H. Liu, A. Mandvikar, and J. Mody, “An empirical study of building Proc. of International Conference on Knowledge Discovery and Data
compact ensembles,” in Web-Age Information Management (WAIM). Mining (SIGKDD). ACM, 2000, pp. 71–80.
Springer, 2004, pp. 622–627. [45] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavaldà, “New
[20] G. Brown, J. Wyatt, R. Harris, and X. Yao, “Diversity creation methods: ensemble methods for evolving data streams,” in Proc. of International
A survey and categorisation,” Information Fusion, vol. 6, no. 1, pp. 5– Conference on Knowledge Discovery and Data Mining (SIGKDD), 2009,
20, 2005. pp. 139–148.
[21] P. C. Hansen, V. Pereyra, and G. Scherer, Least squares data fitting with [46] R. Rifkin and A. Klautau, “In defense of one-vs-all classification,”
applications. Johns Hopkins University Press, 2013. Journal of Machine Learning Research (JMLR), vol. 5, pp. 101–141,
[22] Z.-H. Zhou, J. Wu, and W. Tang, “Ensembling neural networks: Many 2004.
could be better than all,” Artificial Intelligence, vol. 137, no. 1-2, pp. [47] A. Bifet, G. Holmes, and B. Pfahringer, “Leveraging bagging for
239–263, 2002. evolving data streams,” in Proc. of International Conference on Machine
[23] A. Ulaş, M. Semerci, O. T. Yıldız, and E. Alpaydın, “Incremental con- Learning and Knowledge Discovery in Databases (ECML-PKDD), 2010,
struction of classifier and discriminant ensembles,” Information Sciences, pp. 135–150.
vol. 179, no. 9, pp. 1298–1318, 2009. [48] N. C. Oza, “Online ensemble learning,” Ph.D. dissertation, Computer
[24] L. Rokach, “Collective-agreement-based pruning of ensembles,” Com- Science Division, Univ. California, Berkeley, CA, USA, 09 2001.
putational Statistics & Data Analysis, vol. 53, no. 4, pp. 1015–1026, [49] N. C. Oza and S. Russell, “Experimental comparisons of online and
2009. batch versions of bagging and boosting,” in Proc. of International
[25] D. D. Margineantu and T. G. Dietterich, “Pruning adaptive boosting,” in Conference on Knowledge Discovery and Data Mining (SIGKDD).
International Conference on Machine Learning (ICML), vol. 97, 1997, ACM, 2001, pp. 359–364.
pp. 211–218. [50] L. L. Minku, A. P. White, and X. Yao, “The impact of diversity on online
[26] C. Toraman and F. Can, “Squeezing the ensemble pruning: Faster and ensemble learning in the presence of concept drift,” IEEE Transactions
more accurate categorization for news portals,” in European Conference on Knowledge and Data Engineering (TKDE), vol. 22, no. 5, pp. 730–
on Information Retrieval (ECIR). Springer, 2012, pp. 508–511. 742, 2010.
[27] R. Elwell and R. Polikar, “Incremental learning of concept drift in
nonstationary environments,” IEEE Transactions on Neural Networks
and Learning Systems (TNNLS), vol. 22, no. 10, pp. 1517–1531, 2011.
[28] T. Windeatt and C. Zor, “Ensemble pruning using spectral coefficients,”
IEEE Transactions on Neural Networks and Learning Systems (TNNLS),
vol. 24, no. 4, pp. 673–678, 2013. Hamed Bonab received BS and MS degrees in
[29] K. Tumer and J. Ghosh, “Analysis of decision boundaries in linearly computer engineering from the Iran University
combined neural classifiers,” Pattern Recognition, vol. 29, no. 2, pp. of Science and Technology (IUST), Tehran-Iran,
341–348, 1996. and Bilkent University, Ankara-Turkey, respectively.
[30] E. Bauer and R. Kohavi, “An empirical comparison of voting classifi- Currently, he is a PhD student in the College of
cation algorithms: Bagging, boosting, and variants,” Machine Learning, Information and Computer Sciences at the Univer-
vol. 36, no. 1-2, pp. 105–139, 1999. sity of Massachusetts Amherst, MA. His research
[31] G. Fumera, F. Roli, and A. Serrau, “A theoretical analysis of bagging interests broadly include stream processing, data
as a linear combination of classifiers,” IEEE Transactions on Pattern mining, machine learning, and information retrieval.
Analysis and Machine Intelligence (TPAMI), vol. 30, no. 7, pp. 1293–
1299, 2008.
[32] G. Fumera and F. Roli, “A theoretical and experimental analysis of linear
combiners for multiple classifier systems,” IEEE Transactions on Pattern
Analysis and Machine Intelligence (TPAMI), vol. 27, no. 6, pp. 942–956,
2005.
[33] L. Pietruczuk, L. Rutkowski, M. Jaworski, and P. Duda, “A method Fazli Can received PhD degree (1985) in com-
for automatic adjustment of ensemble size in stream data mining,” in puter engineering, from the Middle East Technical
International Joint Conference on Neural Networks (IJCNN). IEEE, University (METU), Ankara, Turkey. His BS and
2016, pp. 9–15. MS degrees, respectively, on electrical & electronics
[34] ——, “How to adjust an ensemble size in stream data mining?” and computer engineering are also from METU. He
Information Sciences, vol. 381, pp. 46–54, 2017. conducted his PhD research under the supervision of
[35] H. Bonab and F. Can, “GOOWE: Geometrically optimum and online- Prof. Esen Ozkarahan; at Arizona State University
weighted ensemble classifier for evolving data streams,” ACM Transac- (ASU) and Intel, AZ; as a part of the RAP database
tions on Knowledge Discovery from Data (TKDD), vol. 12, no. 2, pp. machine project. He is a faculty member at Bilkent
25:1–25:33, Mar. 2018. University, Ankara, Turkey. Before joining Bilkent
[36] J. Z. Kolter and M. A. Maloof, “Dynamic weighted majority: An he was a tenured full professor at Miami University,
ensemble method for drifting concepts,” Journal of Machine Learning Oxford, OH. He co-edited ACM SIGIR Forum (1995-2002) and is a co-
Research (JMLR), vol. 8, pp. 2755–2790, 2007. founder of the Bilkent Information Retrieval Group. His research interests are
[37] ——, “Dynamic weighted majority: A new ensemble method for track- information retrieval and data mining. His interest in dynamic information
ing concept drift,” in Proc. of International Conference on Data Mining processing dates back to his 1993 incremental clustering paper in ACM TOIS.
(ICDM). IEEE, 2003, pp. 123–130.