You are on page 1of 21

Knowledge-Based Systems 219 (2021) 106897

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

The ensemble of density-sensitive SVDD classifier based on maximum


soft margin for imbalanced datasets

Xinmin Tao , Wei Chen, Xiangke Li, Xiaohan Zhang, Yetong Li, Jie Guo
College of Engineering and Technology, Northeast Forestry University, Harbin, Heilongjiang 150040, China

article info a b s t r a c t

Article history: Imbalanced problems have recently attracted much attention due to their prevalence in numerous
Received 22 December 2020 domains of great importance to the data mining community. However, conventional bi-class clas-
Received in revised form 15 February 2021 sification approaches, e.g., Support vector machine (SVM), generally perform poorly on imbalanced
Accepted 23 February 2021
datasets as they are originally designed to generalize from the training data, and pay little attention
Available online 26 February 2021
to the minority class. In the paper, we extend traditional support vector domain description (SVDD)
Keywords: and propose a novel density-sensitive SVDD classifier based on maximum soft margin (DSMSM-SVDD)
Imbalanced datasets for imbalanced datasets. In the proposed approach, the relative density-based penalty weights are
Support vector machine incorporated into the optimization objective function to represent the importance of the data samples.
Support vector domain description Through optimizing the objective function with the relative density-based penalty weights, the training
Relative density majority samples with high relative densities are more likely to lie inside the hypersphere, thus
Maximum soft margin eliminating noise effects on traditional SVDD. In addition, to make full use of the minority class samples
to refine the boundary in training, the maximum soft margin regularization term is also introduced
in the proposed technique inspired by the idea of maximizing soft margin of traditional SVM. This
method allows the optimal domain description boundary to more skew toward the minority class than
traditional SVDD and thus improves the classification accuracy. Eventually, AdaBoost ensemble version
of DSMSM-SVDD is developed so as to further improve the generalization performance and stability in
dealing with imbalanced datasets. The extensive experimental results on various datasets demonstrate
that the proposed approach significantly outperforms other existing algorithms when dealing with the
imbalanced classification problems in terms of G-Mean, F-Measure and AUC performance measures.
© 2021 Elsevier B.V. All rights reserved.

1. Introduction nonlinear discrimination capability and better generalization per-


formance than those of other existing bi-class classification al-
Classification is a fundamental task of data mining and knowl- gorithms [14–16]. Like other bi-class classification approaches,
traditional SVM algorithm usually assumes that the training two
edge discovery in databases, which has been extensively applied
class data used for learning are relatively balanced. However,
in machine learning [1–3], pattern recognition [4–6], and data
in many practical applications, available training instances are
mining [7–9], etc. The construction of classification model pri-
commonly imbalanced in quantity or even quality, which is char-
marily includes two categories: bi-class classification and one-
acterized as having much more instances of certain classes than
class classification. Bi-class classification involves identifying a
others. Particularly for a bi-class application, imbalanced prob-
model that best fits the given two class training data, and hence lems are referred to as the ones in which there are a large percent
can reveal the relationship between the feature set and class of instances in one class, usually called majority class, while there
label. As a representative of bi-class classification, support vector are only a few in the other class, often called minority class. Due
machine (SVM) [10–13], which is based on the structural risk to the difficulties of data collection process limited by certain rea-
minimization of the statistical learning theory, has been widely sons, imbalanced problems often occur in a variety of real-world
regarded as one of the most favorable algorithms for its higher bi-class applications including fraud detection [17,18], medical
diagnosis [19,20], intrusion detection [21,22], and so on. When
encountered by imbalanced problems, the performance of tradi-
∗ Corresponding author.
tional SVM tends to significantly deteriorate. Concretely, SVM is
E-mail addresses: taoxinmin@nefu.edu.cn (X.M. Tao),
chenwei2311@nefu.edu.cn (W. Chen), 481027179@qq.com (X.K. Li),
more biased toward the majority class affected by imbalanced
15612731190@163.com (X.H. Zhang), 67420102@qq.com (Y.T. Li), training data, since the rules that predict the higher number of
670482386@qq.com (J. Guo). instances as the majority class during the learning process is

https://doi.org/10.1016/j.knosys.2021.106897
0950-7051/© 2021 Elsevier B.V. All rights reserved.
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

However, traditional SVDD does not consider the distribution


Nomenclature information of the training data on the learned hypersphere,
SVM Support Vector Machine which may cause its failure to accurately reflect the intrinsic
CSVM Cost-sensitive SVM data description of majority class affected by noises and thus
produce poor identification performance. In addition, in some
SVDD Support Vector Data Description
real-world cases, there do exist, although very few, minority
CSVDD cost-sensitive SVDD
class instances. These valuable minority class instances are vi-
RU Random Under-sampling tally important for SVDD to generate the final hypersphere and
RO Random Over-sampling therefore, how to utilize them to guide SVDD learning is another
SMOTE Synthetic Minority Over-sampling tech- major concern in improving SVDD performance in addressing
nique imbalanced classification problems.
BSMOTE Borderline-SMOTE Borderline To solve these problems, we propose a novel density-sensitive
samples–based Synthetic Minority SVDD classifier based on maximum soft margin (DSMSM-SVDD)
Over-sampling technique in this study. On one hand, we assign different penalty weights
NN Neural network based on relative densities [26] to the training data in the pro-
CSB Cost-Sensitive Boosting Series posed approach. The relative densities are calculated by the expo-
Bagging Bootstrap sampling ensemble nentially weighted Parzen-window density estimation technique,
AdaBoost Adaptive Boosting ensemble which can effectively reflect the relative density distribution of
every instance in the feature space. Through optimizing the ob-
AdaC Cost-sensitive Adaptive boosting en-
jective function incorporated by the different penalty weights,
semble with a cost item inside the
the training majority instances with high relative densities are
exponential function
more likely to lie inside the hypersphere than those with low rel-
UCI The UC Irvine Machine Learning Repos-
ative densities, thus eliminating noise effects. On the other hand,
itory
inspired by the idea of maximizing soft margin of traditional
FN False Negative
SVM, we incorporate the maximum soft margin regularization
FP False Positive term into the optimization objective function. The incorporation
TN True Negative can enable the proposed approach to make full use of the few
TP True Positive minority class data to refine the boundary around the majority
ROSVM Random over-sampling + SVM class in training so that the optimal domain description bound-
SMOTESVM SMOTE over-sampling + SVM ary obtained by the proposed approach tends to skew toward
BSMOTESVM BSMOTE over-sampling + SVM the minority class and improves the generalization performance.
WKSMOTESVM Weighted Kernel SMOTE+SVM Eventually, Adaboost ensemble scheme using DSMSM-SVDD as a
RUSVM Random under-sampling + SVM base classifier is developed in this study so as to further improve
the generalization performance and stability.
CSO Cost-sensitive Oversampling
The main contributions of this work are summarized as fol-
SDAENN Stacked Denoising Autoencoder Neural
lows:
Network
We propose a novel ensemble of density-sensitive SVDD clas-
CSO-SDAENN Cost-sensitive oversampling+ SDAENN sifier based on maximum soft margin (DSMSM-SVDD) for imbal-
AdaSVM AdaBoost scheme with SVM as a base anced classification problems.
classifier To avoid the effect of noises or outliers on the trained hy-
AdaCSVM AdaCost scheme with SVM as a base persphere, DSMSM-SVDD algorithm first introduces the relative
classifier density-based penalty weights in SVDD model to reflect the im-
AdaBoostICSVM AdaBoost-SVM with weights adjust- portance of different training instances. The introduction of rela-
ment based on Instance Categorization tive density-based penalty weights enables the training instances
with high relative densities more likely to fall into the hyper-
sphere than those with low relative densities, which is beneficial
to produce the more appropriate classification boundary and thus
favorable to the overall classification accuracy metric adopted improves the generalization capability.
by SVM [23]. Compared to bi-class classification, which requires In order to utilize fully rare valuable minority class instances
two class instances available for training to obtain two-class sep- to refine the classification boundary, the maximum soft margin
aration interface, one-class classification only involves learning regularization term is incorporated into the optimization ob-
a description of the characteristics with respect to the majority jective function in the proposed DSMSM-SVDD approach. This
class, and hence only requires the majority instances available in incorporation of the maximum soft margin regularization term
the training data. For one unknown instance, bi-class classifiers makes optimal domain description boundary obtained by the
need to classify it by the trained two-class separation inter-
proposed approach skew toward the minority class and thus
face while one-class classifiers identify whether it belong to the
enhances the generalization performance.
majority class using the trained hypersphere. When addressing
In addition, to further improve the generalization performance
imbalanced classification issues, one-class classification algorithm
usually only depends on the majority instances easy to be ob- and enhance the robustness, we developed an ensemble algo-
tained and requires no minority instances available for training rithm based on AdaBoost scheme using DSMSM-SVDD as base
the classifier. As one of the most widely used one-class classi- classifiers. Theoretical and empirical analyses demonstrate that
fication approaches, support vector domain description method the proposed DSMSM-SVDD ensemble approach significantly out-
(SVDD) [24,25] tries to construct a hypersphere surrounding most performs the other existing algorithms when dealing with the im-
of majority class data in the feature space. A test sample is labeled balanced classification problems in terms of G-Mean, F-Measure
as major class if it is enclosed by this hypersphere and otherwise and AUC performance evaluation measures.
minority class. This character makes it suitable for the imbal- The remainder of the paper is organized as follows. In Sec-
anced classification scenario without minority instances available. tion 2, some related previous works are presented. In Section 3,
2
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

we provide a brief introduction of cost-sensitive SVDD (CSVDD). instance, Zhang et al. [48] proposed a stacked denoising autoen-
The relative density-based penalty weights and maximum soft coder neural network (CSO-SDAENN) algorithm based on cost-
margin regularization term are defined in Section 4, which also sensitive oversampling by combining the cost-sensitive learning
includes the detailed description of the proposed DSMSM-SVDD with denoising autoencoder neural network (DAENN). Although
ensemble approach. In Section 5, the experimental results of some successful results are reported in the literatures, it is very
the proposed method compared to other algorithms on vari- difficult to learn and determine the value of cost in the practical
ous datasets selected from UCI repository (the UC Irvine Ma- application, such that cost-sensitive learnings are application-
chine Learning Repository) are presented. Finally, we draw some dependent classification algorithm and are only applied in a cer-
conclusions in Section 6. tain context [49]. Ensemble schemes can effectively handle the
imbalanced dataset problems by the final weighted voting from
2. Related work all classifiers learned by the incorporation of different resam-
pling strategies such as bagging [50], boosting [51], voting [52],
In order to solve the imbalanced classification issues, various and stacking [53]. Among them, boosting ensemble strategy has
solutions have been reported in the literature [27–29], which been indicated by some studies to be an effective technique
can roughly be categorized as data level and algorithm level. for improving the generalization performance of existing learn-
At the data level, the solution objective is to re-balance the ing algorithms [23]. Although various learning algorithms can
be integrated into boosting scheme, ensembles using support
data distribution by re-sampling in the data space and then use
vector machines (SVM) as a base classifier have been reported
the rebalanced datasets to train conventional bi-classifier, which
to achieve good classification performance [23]. The simplest
mainly includes under-sampling the instances of the majority
scheme of SVM ensembles is to embed SVMs into a standard
class [30,31] and over-sampling the instances of the minority
Adaboost framework which is the most popular boosting method
class [32–34], and sometimes involves the combination of the
proposed by Freund and Schapire [54]. In order to boost more
above two techniques. When combined with bi-class classifica-
weights on minority class instances, researchers attempted to
tion algorithms, especially SVM, the under-sampling techniques
make the boosting framework cost-sensitive by adjusting weights
often tend to cause the classification boundary skewing toward
of instances according to not only their classification outputs of
the majority class. This is due to the fact that under-sampling
previous classifier but also their class labels, such as AdaCost [55],
techniques only extract the subset of the majority class instances
CSB1 and CSB2 [56], and a series of AdaC [57]. Although cost-
to train the classifier, and consequently neglect the whole struc-
sensitive Adaboost schemes were reported to perform relatively
tural distribution information of the majority class. On the other
well and stably, they still require users to pre-specify misclassi-
hand, traditional over-sampling techniques re-balance training
fication costs and belong to application-dependent algorithms. In
instances only by randomly replicating the original minority class
addition, in order to avoid the bias of SVM toward the majority
instances. However, these methods do not add any new useful
class due to accuracy-oriented in addressing imbalanced datasets
information for minority class during learning, which can lead
issues, Lee et al. [58] applied cost-sensitive SVMs as weak learners
to model over-fitting. As an extended variant of traditional over-
of standard Adaboost scheme. However, such approach which
sampling technique, Synthetic Minority Over-sampling Technique only considers replacing original accuracy-oriented SVMs with
(SMOTE) [35,36] and its improved versions [37–41] re-balance cost-oriented SVMs in boosting scheme is inconsistent with the
the data distribution by creating synthetic minority class in- Adaboost strategies based on exponential loss function and thus
stances among randomly selected minority class instances. Al- achieves no significant performance improvement [23].
though it was reported that SMOTE-like versions show better From the above analysis, we can find that both data-level
performance in some imbalanced classification cases, the scope and algorithm level approaches usually seek to serve or improve
of classifier decision domain would be reduced influenced by bi-class classification algorithms. In addition to bi-class classifi-
the increase of the synthetic minority class instances, which cation, one-class classification algorithm has been attempted to
thus results in the failure of avoiding model over-fitting [42]. deal with imbalanced problems by some researchers, especially
Besides, the generation of synthetic instances is also likely to for highly skewed imbalanced datasets [59,60] since it usually
produce additional noisy overlapping samples [42], thus reducing only depends on the majority instances easy to be obtained
classification accuracy. and requires no minority instances available for training the
At the algorithm level, the solutions try to adjust the pa- classifier. Due to its special characters, one-class classification
rameters of existing classifier learning algorithms or modify the has been extensively applied in numerous real-world domains
existing classifier learning framework so as to bias toward mi- including machine fault detection [61], medical diagnosis [62],
nority class, such as the cost-sensitive learnings and ensemble image retrieval [63,64], etc. One of the most widely used one-
schemes. Cost-sensitive learnings [43–45] can reduce the over- class classification approaches, support vector domain description
all misclassification rate by assigning the large misclassification method (SVDD) [24,25] tries to construct a hypersphere sur-
cost to minority instances while the low one to majority in- rounding most of majority class data in the feature space. A
stances during classifier learning. For instance, Zhou et al. [46] test sample is labeled as major class if it is enclosed by this
proposed a cost-sensitive neural network (NN) by means of sam- hypersphere, and minority class otherwise. When dealing with
pling and threshold-moving. By combining two above techniques imbalanced problems, as the extended variant of SVM for one-
via hard or soft voting schemes, the hard-ensemble and soft- classification, SVDD does not require any minority class data
ensemble methods were also developed. In order to classify non- available since the construction of its hypersphere is only de-
stationary and imbalanced data streams, online cost-sensitive pendent on majority class data, which enables it applicable for
neural network classifiers are presented using one-layer NNs [47]. imbalanced problems especially for the extreme case without
In the proposed classifiers, two separate cost-sensitive strategies: no minority instances available [65]. However, in some real-
a fixed and an adaptive misclassification cost matrix are used world cases, there do exist, although very few, minority class
to handle class imbalance. In addition, autoencoder neural net- instances. In order to utilize these rare minority class instances
work based on deep learning, which has gained a huge success to refine the boundary of SVDD, Lazzaretti et al. [66] further put
in machine leaning domain has also been attempted by some forward a cost-sensitive SVDD (CSVDD), which incorporates the
researchers to address imbalanced classification problems. For total misclassification cost into optimization objective function,
3
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

instance xj , respectively. ϕ (xi ), ϕ xj represent the image of each


( )
thus improving the classification performance of SVDD. Despite
some advantages, CSVDD still suffers from some limitations: (1) majority training instance xi and each minority training instance
It does not consider the influence of the different distribution xi in high-dimensional feature space, respectively. C = [C1 , C2 ]
of the training data on the resulting hypersphere, which can are the nonnegative penalty constant factors which control the
cause its boundary to pass through high-density region and fails trade-off between the volume of the hypersphere S and the
to reflect the intrinsic data description of majority class, thus training errors. It can be seen that if the nonnegative penalty
producing poor identification performance. (2) Despite that the constant factors are large, the misclassification rate of the training
different nonnegative penalty constant factors are employed to instances decreases correspondingly, otherwise, and some ‘negli-
adjust the sum of total misclassification cost of the two class gible’ instances tend to be omitted during training, thus enlarging
instances in CSVDD, the improvement of its generalization per- the radius of the hypersphere S. In order to deal with imbalanced
formance would not be expected especially when only a few datasets and avoid the difficulty of setting the optimal misclassifi-
minority instances are available, which is the often case in highly cation cost values, the nonnegative penalty constant for minority
skewed imbalanced datasets. This is because that the sum of total class is usually set to be C2 = IR × C1 , IR = N denotes imbalance
N
misclassification cost tends to be zero in such cases and thus
different employed nonnegative penalty constant factors would ratio such that only C1 is left to be pre-specified. By introducing
have no effect on the performance of CSVDD. In addition, CSVDD Lagrange multipliers to solve the above constrained optimization
is generally characterized by no good stability due to the little problems, the following dual formulation is obtained as:
influence of minority class on the trained model [67].
⎧ ⎫
⎨N∑
+N N +N
∑ )⎬
αi yi k (xi , xi ) − αi αj yi yj k xi , xj
(
argmax (3)
3. Cost-sensitive SVDD α
i,j=1
⎩ ⎭
i=1

In one-class classification, the objective of traditional SVDD is s.t.


to construct the best domain description of the given training N +N
target class (majority class), which always covers all or most of

αi yi = 1 (4)
target class instances, and identifies any other instances divergent
i=1
from this obtained description as outliers. Traditional SVDD uses
the one-class (majority class) training data to find the optimal 0 ≤ αi ≤ C1 , ∀i, yi = 1 (5)
hypersphere. However, in some real-world cases, there do exist, 0 ≤ αj ≤ C2 , ∀j, yj = −1 (6)
although very few, minority class instances. Although these mi-
nority class instances are too few to be used to construct a bi-class where αi , αj ≥ 0 denote the Lagrange multipliers of the major-
classifier, they can be incorporated into learning process of SVDD ity instance xi and the minority instance xj , respectively. Note
to refine the boundary around the majority class instances. There- that the training instances xi with 0 < αi < C1 and xj with
fore, as the extended variant of traditional SVDD, cost-sensitive 0 < αj < C2 are referred to support vectors (SVs) of majority
SVDD (CSVDD) improves the domain description through incor- class and minority class, respectively, which are only required
porating the minority class instances in the training procedure. training
( instances for expressing the boundary of the hypersphere
S. k xi , xj denotes the kernel function. One of well-known kernel
)
Without loss of generality, in this paper, we will focus on the 2
functions is Gaussian kernel: k xi , xj = exp(− xi − xj  /σ 2 ), σ
( ) 
kernel-based CSVDD algorithm, which has been proved more
flexible and more successful{ recently. The detailed description is is Gaussian kernel width.
provided below. Let X = (xi, yi ), i = 1, 2, . . . , N + N be a given
}
set of training instances, where xi ∈ RD represents D-dimensional 4. Density-sensitive SVDD classifier based on maximum soft
feature of the ith instance, and yi ∈ {1, −1} denotes its class margin
label. If xi belongs to the majority class, yi = 1 and otherwise
yi = −1. N , N are the number of the majority class training 4.1. Relative density-based penalty weights
instances and the minority class training instances, respectively.
Let ϕ (·) represent the map from X to F and k: X × X → R
Although CSVDD can produce a more flexible domain descrip-
be a positive definitive kernel function, and then the input data
tion boundary of the target class than traditional SVDD in dealing
X can be implicitly mapped to a feature space F usually with
with imbalanced problems, it still suffers from some drawbacks.
very high dimensionality by kernel trick k(u, v) = ⟨ϕ (u) , ϕ (v)⟩,
One is that, CSVDD does not consider the distribution of the
∀u, v ∈ X . This strategy commonly allows us to construct simple
training instances in the optimization objective function, and
models (such as linear) in the feature space, which can still
consequently would cause the resulting hypersphere incapable
exhibit complicated behaviors in the input space. The constrained
optimization objective function with the minority class instances of adequately reflecting the distribution of the target class af-
is expressed as follows: fected by the outliers. Specifically, both the nonnegative penalty
⎧ ⎫ constant factors C1 , C2 are invariable during the training proce-
⎨ ∑ ∑ ⎬ dure in CSVDD, which indicates all misclassified instances are
argmin R 2 + C1 ξi + C2 ξj (1) equally treated. This would make CSVDD sensitive to the outliers
a,R,ξi ,ξj
i,yi =1 j,yj =−1 in the training set during the learning process and thus leads
⎩ ⎭
to ‘‘over-fitting’’. In addition, like traditional SVM, the optimal
s.t.
domain description boundary of CSVDD is only determined by the
∥ϕ (xi ) − a∥2 ≤ R2 + ξi , ξi ≥ 0, ∀i = 1, 2, . . . , N small portion of training instances called the support vectors and
ϕ xj − a2 ≥ R2 − ξj , ξj ≥ 0, ∀j = 1, 2, . . . , N
 ( )  has no association with the remaining other nonsupport vectors.
(2)
However, those nonsupport vectors with higher densities usually
where a ∈ F and R > 0 are the center and the radius of the has strong impact on the determination of the optimal domain
hypersphere S, respectively; ∥·∥ denotes the Euclidean distance description during training, and therefore the assignment of high
in the kernel-defined feature space F . ξi , ξj are the slack variables penalty weight to those instances with higher densities seem to
for each majority training instance xi and each minority training be more favorable to accurately identify the optimal hypersphere.
4
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

Motivated by the above ideas, we incorporate relative density- Inspired by the idea of traditional SVM, we introduce max-
based misclassification penalty weights into the optimization ob- imum soft margin regularization term to the optimization ob-
jective function of CSVDD, which can effectively reflect the im- jective function of CSVDD, which enables it to make fully use
portance of different instances in the learning hypersphere. Con- of the rare minority data to refine the classification boundary,
cretely, the proposed approach pays more attention to the in- thus improving the generalization performance when dealing
stances with higher densities by assigning high penalty weights, with imbalanced problems. Combining the above two improve-
so that those instances tend to be included in the optimal hyper- ment strategies, we present a novel SVDD which incorporates
sphere as much as possible. In contrast, the proposed approach the relative density-based penalty weights and maximum soft
assigns low penalty weights to the instances with lower densities, margin regularization term into the optimization objective func-
which makes those instances more likely to be excluded from the tion to improve the generalization performance when dealing
optimal hypersphere, thus eliminating the effect of the outliers. In with imbalanced problems, especially with the rare minority data
this study, we utilize the exponentially weighted Parzen-window available. The proposed approach is described in detail below.
Given a set of training instances (xi , yi , ρi ) , i = 1, 2, . . . ,
{
density estimation technique to calculate the relative density for
each training instance. Here, assume that X = [x1 , x2 , . . . , xN ] is N + N , where ρi is the estimated relative density of xi , which
}
a given training majority set, where N is the number of majority is calculated by exponentially weighted Parzen-window density
training instances. The relative density ρi ≥ 1 for xi is expressed estimation technique described in Section 4.1. To obtain a more
as follows: flexible description of the majority class, we first transform the
Par (xi )
{ }
training samples into a high dimensional feature space F using a
ρi = exp ω × , ∀i = 1, 2, . . . , N (7) nonlinear mapping function ϕ (·), and then compute the smallest
ς
enclosing hypersphere S which is characterized by its center a ∈
N (
F and radius R > 0. The above idea can be formulated into the
√ ) (
∑ 2 )
Par (xi ) = (1/N ) 1/ (2π ) s exp − (1/2s) xi − xj 
D

following optimization problem:
j=1 ⎧ ⎫
N N
(8) ⎨ ∑ ∑ ⎬
∑N argmin R2 − Md2 + C1 ρi ξi + C2 ρj ξj (9)
where ς = (1/N) i=1 Par (xi ); D is the feature dimension of a,R,d,ξi ,ξj ⎩
i,yi =1 j,yj =−1

input data; ω is the weight factor and s is the smoothing param-
yi R2 − ⟨ϕ (xi ) − a, ϕ (xi ) − a⟩ ≥ d2 − ξi
( )
eter of the Parzen-window density estimation technique. Note (10)
that the higher ρi for data xi means the more compact the region ξi ≥ 0, ∀i = 1, 2, . . . , N + N (11)
where xi is located among its corresponding class. From the above
definition, we can find that the density for each majority instance where d is the distance between the hypersphere S and the
is only dependent on majority training instances rather than all closest majority or minority class training instances identified
training instances. This is because that in this study we just used correctly; M ≥ 1 is the regularization parameter where controls
to denote the relative densities to each other among majority the trade-off between the volume of the hypersphere S and the
class, and therefore is called relative density here. In addition, it is margin between the majority class and the minority class.
worth pointing out that in dealing with imbalanced problems, the By introducing Lagrange multipliers, the above constrained
minority class instances are rare, which makes the corresponding optimization problems can be formulated into:
relative densities difficult to be accurately estimated. Moreover, N
∑ N

all minority instances are regarded to be more valuable than L a, R, d, ξi , ξj , α, β = R2 − Md2 + C1 ρi ξ i + C 2 ρj ξj
( )
majority instances in imbalance scenarios. Therefore, considering i,yi =1 j,yj =−1
that the penalty constants C2 for minority class is significantly
N +N
larger than the penalty constants C1 for majority class, we usually ∑ )] N∑
+N
αi d2 − ξi − yi R2 − ⟨ϕ (xi ) − a, ϕ (xi ) − a⟩ − βi ξi
[ (
need to set the relative densities of all minority class instances to +
uniformly be 1. i=1 i=1
(12)
4.2. The introduction of maximum soft margin
where) α = α1 , α2 , . . . , αN +N , αi ≥ 0 and β = β1 , β2 , . . . ,
( ) (

Recall that CSVDD can utilize rare minority class instances to βN +N , βi ≥ 0 are the Lagrange multiplier vectors. According to
improve the classification performance by incorporating the dif- the Karush–Kuhn–Tucker (KKT) condition, we can obtain the dual
ferent misclassification costs into optimization objective function. formulations of the above optimization problems: (the detailed
However, through the definition of the optimization objective process of derivation is described in Appendix).
function, we can find that the improvement of classification per-
⎧ ⎫
⎨N∑
+N N +N
)⎬
formance is implicitly based on an assumption that the total

αi yi k (xi , xi ) − αi αj yi yj k xi , xj
(
argmax (13)
misclassification cost of all training minority instances are not α
i,j=1
⎩ ⎭
i=1
zero. In some cases, especially highly skewed imbalanced datasets
with only a few minority instances available, due to the fact that s.t.
the total misclassification cost of all training minority instances N +N
is often zero, the different misclassification costs fail to adjust

αi yi = 1 (14)
the classification boundary as expected. However, in numerous
i=1
real-world imbalanced problems, there do exist, although rare,
N +N
minority class instances. For example, in machine fault detection, ∑
in addition to a large number of measurements under normal αi = M (15)
working conditions, there may be also some valuable measure- i=1

ments under faulty situations. Although they are not sufficient to αi ≥ 0, ∀i = 1, 2, . . . , N + N (16)
be used to construct bi-classifier, they can be incorporated into
0 ≤ αi ≤ C1 ρi , ∀i, yi = 1, i = 1, 2, . . . , N (17)
training process of SVDD to refine the hypersphere enclosing the
majority data. 0 ≤ αj ≤ C2 ρj , ∀j, yj = −1, j = 1, 2, . . . , N (18)
5
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

As shown in Eq. (13), the optimization problem is a con- The minority training instances that lie outside the minority class
strained quadratic convex optimization problem. α can be calcu- boundary satisfy:
lated by the convex quadratic programming methods.
αj = C2 ρj ⇒∥ ϕ xj − a ∥2 < R2 + d2 , ξj > 0
( )
(22)
Let γi∗ = αi yi , ∀i = 1, 2, . . . , N + N, and we obtain the center
a and radius R of hypersphere S as follows: We can draw the following conclusions from Eq. (22): (1) For
a given majority class training instance xi , only when it is located
N +N
∑ on or outside of the majority class boundary, its corresponding αi
a= γi∗ ϕ (xi ) (19) is nonzero. (2) For a given minority class training instance xj , only
i=1 when it is located on or outside of the minority class boundary,
its corresponding αj is nonzero. (3) For the other remaining
⎛ ⎞
[ Nn N +N N +N
1 1 ∑
instances, their corresponding αi , αj are zero and have no effect
∑ ∑
R2 = γi∗ k (xl , xi ) + γi∗ γj∗ k xi , xj ⎠
( )
⎝1 − 2
2 Nn on the final optimal solution.
l=1 i=1 i,j=1
⎛ ⎞⎤
Na N +N N +N
1 ∑ ∑ ∑ 4.3. Range determination of the regularization parameter M
γi k (xm , xi ) +

γi γj k xi , xj ⎠⎦
∗ ∗
( )
+ ⎝1 − 2
Na
m=1 i=1 i,j=1 For a given majority class instance xi , if it lies outside of the
(20) majority class boundary, i.e., ∥ ϕ (xi ) − a ∥2 > R2 − d2 , its corre-
sponding slack variable ξi > 0 (refer to Eq. (22)). According to the
where Nn represents the number of the support vectors of the KKT conditions (more details can be found in Appendix)−βi ξi =
majority class; Na represents the number of the support vectors 0, thus βi = 0, and then αi = C1 ρi (refer to Eq. (A.5) in Appendix).
of the minority class; xl is the support vector of the majority In the same manner, for a given minority class( instance xj , if it lies
class; xm is the support vector of the minority class; k(xi , xj ) is outside the minority class boundary, i.e., ∥ ϕ xj − a ∥2 < R2 + d2 ,
)
the Gaussian kernel function. its corresponding ξj > 0 (refer to Eq. (22)). According to the KKT
To determine whether a test instance xnew is within the hyper- conditions −βj ξj = 0, thus βj = 0, and then αj = C2 ρj (refer to
sphere S, we firstly need to calculate the distance of xnew to the Eq. (A.6) in Appendix). Subsequently, according to the equation
center of the hypersphere S. The concrete formula of the distance
∑N +N
constraints: i=1 αi = M and αi ≥ 0, ∀i = 1, 2, . . . , N + N, we
can be expressed as: can obtain:
 2 Pm+ Pm−
 N +N  ∑ ∑
M > C1 ρi + C 2 ρj

ϕ ( ) γ ϕ ( )
2
 
dnew = xne w − ∗
xi
 (23)
 i 
 i=1  i=1 j=1

where Pm+ is the number of the majority class instances which lie
N +N
∑ N +N
∑ outside of the majority class boundary. Pm− is the number of the
γi∗ γj∗ k xi , xj − 2 γi∗ k (xnew , xi ) minority samples which lie outside the minority class boundary.
( )
=1+ (21)
i,j=1 i=1
Furthermore, as 0 ≤ αi ≤ C1 ρi , ∀i, yi = 1, i = 1, 2, . . . , N,
0 ≤ αj ≤ C2 ρj , ∀j, yj = −1, j = 1, 2, . . . , N, and for a given
A test instance xnew is accepted as the target class (majority majority class training instance xi , only when it is located on
class) if its distance to the center of the hypersphere S is smaller or outside of the majority class boundary, its corresponding αi
than the radius R of the hypersphere S, that is, dnew 2 ≤ R2 , and is nonzero; for a given minority class training instance xj , only
otherwise rejected as outliers (minority class). when it is located on or outside of the minority class boundary,
For the convenience of analysis, we firstly give the following its corresponding αj is nonzero We can also conclude:
definitions: Pm+ +Qm+ Pm− +Qm−
Decision boundary: ∥ ϕ (xi ) − a ∥2 = R2 or ∥ ϕ xj − a ∥2 = R2 .
( )
∑ ∑
Majority class boundary: ∥ ϕ ((xi )) − a ∥2 = R2 − d2 . M < C1 ρi + C2 ρj (24)
Minority class boundary: ∥ ϕ xj − a ∥2 = R2 + d2 i=1 j=1

According to the KKT optimality conditions, we can obtain: where Qm+ is the number of the majority class instances which
The majority training instances that lie inside of the majority class lie on the majority class boundary. Qm− is the number of the
boundary satisfy: minority class samples which lie on the minority class boundary.
In other words, Pm+ +Qm+ is the total of the majority class support
αi = 0 ⇒∥ ϕ (xi ) − a ∥2 < R2 − d2 , ξi = 0 vectors and Pm− +Qm− is the number of the minority class support
The minority training instances that lie inside of the minority vectors.
class boundary satisfy: Combining the above two inequality constraints, we can ob-
tain the following expression:
αj = 0 ⇒∥ ϕ xj − a ∥2 > R2 + d2 , ξj = 0
( )
Pm+ Pm− Pm+ +Qm+ Pm− +Qm−
∑ ∑ ∑ ∑
The majority training instances that lie on the majority class C1 ρi + C 2 ρj < M < C 1 ρi + C 2 ρj (25)
boundary satisfy: i=1 j=1 i=1 j=1

0 < αi < C1 ρi ⇒∥ ϕ (xi ) − a ∥2 = R2 − d2 , ξi = 0 4.4. The necessity of introducing relative density-based penalty
weights
The minority training instances that lie on the minority class
boundary satisfy:
The inequality (24) constrains the upper bound of regulariza-
0 < αj < C2 ρj ⇒∥ ϕ xj − a ∥2 = R2 + d2 , ξj = 0 tion coefficient M. As Pm+ + Qm+ < N and Pm− + Qm− < N,
( )
inequality (24) can further be simplified as:
The majority training instances that lie outside of the majority
Pm+ +Qm+ Pm− +Qm− N N
class boundary satisfy: ∑ ∑ ∑ ∑
M < C1 ρi + C2 ρj < C 1 ρi + C 2 ρj (26)
αi = C1 ρi ⇒∥ ϕ (xi ) − a ∥2 > R2 − d2 , ξi > 0 i=1 j=1 i=1 j=1

6
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

Assume that DSMSM-SVDD does not consider the relative width σ , the regularization coefficient M and the number of
density of each instance, let ρi = 1, ρj = 1, i = 1, 2, . . . , N , j = maximum iterations T .
1, 2, . . . N, and inequality (26) can be formulated into an alterna-
The procedure:
tive expression:
1: Initialize the weight vectors: t = 1, Dt (i) = m 1
,i =
M < C1 N + C2 N (27)
1, 2, . . . , m.
The inequality constraint (27) indicates that the upper bound 2: while t ≤ T do
of M entirely depends on the value of nonnegative penalty con- 3: Use Dt (i) to randomly select m training instances with
stant factors C = [C1 , C2 ]. As mentioned in Section 3, for handling replacement from the original training datasets, and put
the imbalanced classification problems, the ratio of the value of the selected ones in the new current training dataset S tr .
nonnegative penalty constant factors for minority class to one for 4: Train the tth DSMSM-SVDD classifier, ht → Y using S tr
majority class is empirically set to C2 = N C1 , such that M < 2C1 N. and the given parameters.
N
When C1 is set to be small for good generalization performance, 5: for i = 1 to m
M value is relatively small constrained by its upper bound and 6: Predict the class label of xi by the above-trained base
thus the effect of the maximum soft margin regularization term classifier ht , and let the predicted class label be ŷi,t .
on the improvement of the generalization performance is not 7: end for ∑m t
significant. Accordingly, ρi ≥ 1, ρj ≥ 1 would increase the 8: ϵt = i=1 D (i) × I [ŷi[ ,t ̸ = yi ], where
] I [·] is an indicator
upper bound of M and thus enable the value of M more flexi- function,
[ if
] ŷ i, t ̸ = y i , I ŷ i, t ̸ = y i = 1 and otherwise
ble, which provides theoretical evidence that introducing relative I ŷi,t ̸ = yi = 0.
density-based penalty weights is necessary. To facilitate reading, 9: if ϵt ≥ 1/2 then
the notation and symbols used in this paper are summarized in 10: T = t − 1 and abort loop
Table 1. 11: else if ϵt = 0 then αt = 10, where ϵt = 0 noted that all
predicted
4.5. The ensemble scheme of DSMSM-SVDD xi belong to the same class, and the maximum value of
αt is set to be 10.
Recall that ensemble of classifiers has been proven to be an 12: else if 0 < ϵt < 1/2 then
effective strategy for improving generalization performance by
combining each decision of individual classifier into a final voting 1 1 − ϵt
αt = min(10, ln( ))
result. When dealing with imbalanced datasets, standard bi-class 2 ϵt
learning methods pay less attention to the minority instances 13: end if
since they are designed with the aim of maximizing overall classi- 14: Update and normalize instance weight vectors:
fication accuracy. Such an aim tends to result in poor performance
D(t ) (i) exp(αt (I ŷi,t ̸ = yi − I ŷi,t = yi ))
[ ] [ ]
on the minority class due to the introduction of bias error [23]. D(t +1) (i) = ,
This introduced bias error can be successfully alleviated in Ad- Zt
aBoost algorithm by focusing more on misclassified instances. i = 1, 2, . . . , m
Specifically, given an imbalanced dataset, the instances in the
D(t +1) (i) is a normalization factor.
∑m
where Zt =
minority class are often misclassified by standard classification i=1

algorithms due to the effect of overall accuracy-oriented opti- 15: end while
mization objective. In such case, when Adaboost technique is 16: Return
applied, those often misclassified minority class instances can Output: ht and αt for all built DSMSM-SVDD classifiers.
be assigned more weights to increase the chance to be selected
into the next training dataset. Hence, the AdaBoost technique When the ensemble scheme stops, a number of built DSMSM-
seems to have great potentials in improving the classification SVDD classifiers are obtained. We can use them to determine to
performance on the minority class. In fact, the weighting strategy which a test instance xp belongs. Suppose that DSMSM-SVDD1 ,
of AdaBoost can be seemed as a resampling technique combining DSMSM-SVDD2 , · · ·, DSMSM-SVDDL are L base classifiers returned
both over-sampling and under-sampling. Therefore, in essence, it by the above procedure and their corresponding predicted re-
also belongs to data-level technique, which enables it applicable sults are ŷp,t , t = 1, 2, . . . , L. The final class label of xp is then
for most classification methods without any modification about determined by
them. The above-mentioned advantages of AdaBoost make it an L

attractive technique in dealing with the imbalanced datasets. H (x) = argmax( αt I(ŷp,t = y)) (28)
In addition, one-class classifier usually has poor stability since y∈{1,−1}
t =1
it only depends on the majority instances to train the model
and rarely utilizes the minority instances to refine the classifica- 4.6. The computation complexity of the proposed method
tion boundary. Therefore, in this study, we develop an ensemble
scheme using DSMSM-SVDD as basic classifier so as to further Recall that the proposed DSMSM-SVDD method consists of
improve its generalization performance and stability in deal- training process and prediction process. In the training phase of
ing with imbalanced datasets. The Pseudocode for the proposed the modal, the proposed approach primarily involves the calcula-
DSMSM-SVDD ensemble scheme is shown in the following. tion of relative densities for all training majority instances and
SVDD modal training using all training instances. The compu-
Algorithm (Ensemble Scheme Using DSMSM-SVDD as Basic Classi- tation complexity of calculating relative densities for all train-
fier). ing majority instances includes calculating Euclidean distances
Input: the labeled training instances {(x1 , y1 , ρ1 ) , (x2 , y2 , ρ2 ) , among all training majority instances, and all relative densities,
. . . , (xm , ym , ρm )}, where xi ∈ RD is an instance with D-tuple of which are respectively O(N 2 ) and O(N), where N denotes the
attribute values and yi ∈ Y = {−1, 1} is a label, m = N + N number of all training majority instances. The computational
is the size of the whole training datasets, as well as the optimal complexity of training DSMSM-SVDD modal is O(D(N + N)2 ),
nonnegative majority penalty constant factor C1 , Gaussian kernel where D is the feature dimension of input data, and N represents
7
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

Table 1
Notation and symbols used in this paper.
Symbols Description
(xi , yi ) The ith training instance and its corresponding class label
D The feature dimension of input data
N, N The numbers of training majority(minority) instances
S The obtained hypersphere
a The center of the obtained hypersphere
R The radius of the hypersphere
ξi The slack variable for the ith majority instance
ξi The slack variable for the jth minority instance
ϕ (·) The map function for original space to kernel ones
C1 The penalty constant for majority class
C2 The penalty constant for minority class
IR The imbalance ratio
k (·) The kernel function
σ Gaussian kernel width
αi Lagrange multiplier of the ith majority instance
αj Lagrange multiplier of the jth minority instance
ρi The relative density of the ith majority instance
ρj The relative density of the jth minority instance
ω The weight factor
s The smoothing parameter
d The distance
M The regularization parameter
xl , xm xl is SV of majority class; xm is SV of minority class
Na , Nn Na , Nn are the numbers of majority and minority class SVs
Pm+ The number of the majority instances which lie outside of the majority class boundary
Pm− The number of the minority instances which lie outside the minority class boundary
Qm+ The number of the majority instances which lie on the majority class boundary
Qm− The number of the minority samples which lie on the minority class boundary

the number of all training minority instances. Note that the Table 2
calculation of Euclidean distances and relative densities for all Confusion matrix.
majority instances belongs to preprocessing so that they can be Predict positive class Predict negative class
pre-implemented before training. Therefore, the overall compu- True positive class TP FN
tational complexity of the proposed approach in training phase True negative class FP TN

is O(D(N + N)2 ), which has the same order of magnitude as the


original SVDD algorithm. In the prediction phase of the DSMSM-
SVDD modal, its computational complexity is O(DNS ), where NS = To quantitatively evaluate the classification performance in
Nn + Na , and Nn represents the number of the support vectors imbalanced problems, several alternative classification evaluation
of the majority class; Na represents the number of the support measures have been defined. Most of those evaluation measures
vectors of the minority class. Therefore, the computational com- are dependent on the following confusion matrix as illustrated in
plexity of the ensemble scheme using DSMSM-SVDD classifiers in Table 2, where the columns are the predicted class and the rows
execution phase is O (TDNS ). are the true class. In the confusion matrix, TN (True Negatives) is
the number of majority class instances (negative class) correctly
5. Experimental results and analyses classified, TP (True Positives) is the number of minority class
instances (positive class) correctly classified, FN (False Negatives)
To evaluate the performance of the proposed method in deal- is the number positive instances incorrectly classified as negative,
ing with imbalanced problems, we conducted several classifica- and FP (False Positives) is the number of negative instances
tion experiments on the artificial datasets and several benchmark
incorrectly classified as positive. If only the performance of the
datasets selected from UCI Repository. Experimental environment
minority (positive) class is considered, two measures are impor-
settings include Windows 7, CPU: Intel i7, 3.4G processor, and
tant: Sensitivity and Precision. Sensitivity, which is also called
simulation software: Matlab2010b.
positive class accuracy, is defined as the ratio of True Positives
(TP) to the number of all positive instances. Precision is defined
5.1. Performance evaluation
as the ratio of True Positives (TP) to the number of all instances
Evaluation measures play a key role in assessing the clas- predicted as positive.
sification performance of the model. Traditional accuracy-based F-Measure is suggested to integrate these two measures into
evaluation measure, which is the most commonly used one, is an average. In principle, F-Measure denotes a harmonic mean
no longer an appropriate measure for classification performance between Sensitivity and Precision. The harmonic mean of two
when dealing with imbalanced datasets since the minority class numbers tends to be closer to the smaller of the two. Therefore, a
has very little effect on the accuracy as compared to the majority high F-Measure metric ensures that both Sensitivity and Precision
class. However, in imbalanced classification problems, the correct are reasonably high. When the performance of both classes is
classification of instances in the minority class is more important concerned, an alternative measure called Specificity is required
than the contrary case. For example, in a disease diagnostic prob- to be defined, which represents the ratio of True Negatives (TN)
lem where the disease cases are usually quite rare as compared to the number of all negative instances. G-Mean is suggested
with normal populations, the recognition goal is to detect people as the balanced performance between these two classes, which
with disease. Hence, a favorable classification model should be is intrinsically defined as the geometric mean of sensitivity and
one that provides a higher identification rate on the disease specificity. If the G-Mean value is high, both Sensitivity and
category. Specificity are expected to be high simultaneously.
8
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

The five performance evaluation measures are defined as fol- denote the target class training instances and five additional
lows: noise samples are labeled in two-dimensional feature space. The
Sensitivity of positive class sample (Sensitivity): points labeled by the black circle denote support vectors, and
the black solid lines represent the classification boundaries of
TPR = Sensitivity = TP/(TP + FN)
the resulting hypersphere optimized after CSVDD and DSMSM-
Precision of positive class sample (Precision): SVDD techniques, which are demonstrated in Fig. 1(a) and (b),
respectively.
Precision = TP/(TP + FP) As has been shown in Fig. 1(a), due to its sensitivity to the
Specificity of negative class sample (Specificity): noises and the effect of the low nonnegative penalty constant
factor for generalization performance, several target class training
TNR = Specificity = TN/(TN + FP) instances are also rejected as outliers, eventually producing an
Geometric mean accuracy (G-M): unreasonable classification boundary. On the other hand, as has
√ been shown in Fig. 1(b), DSMSM-SVDD tends to cover most of
G−M= Sensitivity · Specificity the target class training instances with relatively high densities
and simultaneously reject the noisy samples with low densities.
F-Measure metric of positive class sample (F-M):
Through comparing the results shown in Fig. 1(a) and (b), we can
2 × Sensitivity × Precision conclude that CSVDD does not take into account the different
F−M=
Sensitivity + Precision distributions of the training data in its optimization objective
function, which results in the misclassification of several target
Area under Receiving Operator Characteristic (AUC) is another
class training instances located in edge region and thus under-
widely used evaluation metric for the performance of classifiers
fitting. On the contrary, in the proposed DSMSM-SVDD approach,
especially in imbalanced datasets scenarios. It is referred to as
the relative density-based penalty weights are introduced to the
the area under ROC graph and is not sensitive to the distribution
optimization objective function, which enables the resulting hy-
of two classes. The ROC graph can be obtained by plotting the
persphere to include the region with relatively high densities and
True Positive Rate (Sensitivity) over the False Positive Rate (1-
simultaneously reject the noises with low densities, thus avoiding
Specificity). In order to facilitate plotting ROC curve, we adopt
under-fitting and thus improving the generalization capacity of
positive class membership probabilities output of the proposed
DSMSM-SVDD.
algorithm as scores instead of hard out obtained by sign function
as probabilistic SVM suggested by the reference [68]. The positive
5.2.2. Influence of the maximum soft margin regularization term
class membership probability for xi for the proposed algorithm is
Subsequently, to intuitively illustrate the necessity of adding
calculated by the sigmoid function:
the maximum soft margin regularization term in optimization
1 objective function of the proposed approach, we conducted the
scorei = (29)
1 + exp(−(dnew 2 − R2 )) following comparative experiments with CSVDD on the two-
dimensional artificial datasets including two class imbalanced
where dnew 2 is the output of the proposed algorithm, and R is data. The majority class is the same as the previously used
the radius of the obtained hypersphere S by the proposed algo- dataset, which is also regarded as target class, while the minority
rithm. Similarly, the output of the proposed DSMSM-SVDD en-
class contains only 20 training instances generated from the
semble method is transformed into the positive class membership
Gaussian distribution with mean [1, 1] and variance [0.5, 0.5]. The
probability as follows:
imbalance ratio is about equal to 10:1. For CSVDD and the pro-
∑L
t =1 αt I(ŷp,t = positive class) posed method, we conducted 5-fold stratified cross-validation to
scorei = ∑L (30) determine the best parameter configurations: the preliminary ex-
t =1 αt periments proved that the proposed method reach promising per-
In this study, F-Measure, G-Mean, and AUC are used as the formance when the value of C1 and σ were chosen from a specific
performance measures to compare different methods. scope. Therefore, the majority class penalty constant C1 and σ by
grid search method from the set {10−3 , 10−2 , 10−1 , 100 , 101 , 102 }
5.2. Performance comparison on artificial datasets and {2−3 , 2−2 , 2−1 , 20 , 21 , 22 , 23 }. G-Mean metric is selected as
the cross-validation criterion as it is only a criterion that con-
5.2.1. Influence of the relative density-based penalty weights siders all values in the confusion matrix and thus can provide
To intuitively illustrate the influence of relative density-based more reliable measure. To highlight the essentiality of introducing
penalty weights in the proposed DSMSM-SVDD approach on the the maximum soft margin regularization term in the following
classification boundary, we carried out the following compara- comparison experiments, for DSMSM-SVDD, we set the same
tive experiments with CSVDD on the two-dimensional artificial optimal parameters as those of CSVDD: ρi = ρj = 1, C =
datasets, which contains 200 target class training instances gen- [C1 , C2 ] = [0.1, 1], and σ = 1 with the exception of M = 10.
erated from the Gaussian distribution with mean [−1.4, −1.4] Comparison classification results between CSVDD and DSMSM-
and variance [0.6, 0.6] and five noisy samples. For the conve- SVDD are shown in Fig. 2. The blue circle points denote the
nience of comparison, we empirically set nonnegative penalty target class training instances and the black plus points denote
constant factor C1 = 0.1, Gaussian kernel width σ = 4.5 for minority class instances in two-dimensional feature space. The
CSVDD and the proposed method with the only difference that points labeled by the black circle denote support vectors. The
the propose method has relative density-based penalty weights. black solid lines represent the resulting classification boundaries
For the proposed DSMSM-SVDD, we set the weight factor ω = 2, of the hyperspheres optimized by CSVDD and DSMSM-SVDD,
the smoothing parameter s = 10. In addition, to evaluate the which are shown in Fig. 2(a) and (b), respectively.
effectiveness of the relative density-based penalty weights inde- From the results shown in Fig. 2(a) and (b), we can find
pendently, the used DSMSM-SVDD model does not contain the that although CSVDD can correctly classify the majority class
maximum soft margin regularization term so that M is set to be 0. training instances and the minority class training instances, its
Classification boundaries obtained by CSVDD and DSMSM-SVDD obtained classification boundary is still tightly around the ma-
on the artificial datasets are shown in Fig. 1. The blue circle points jority class training instances. It is mainly due to the fact that
9
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

Fig. 1. The effect of relative density-based penalty weights on the classification results..

Fig. 2. The effect of maximum soft margin regularization term on the classification results.

misclassification cost plays no role in improving the resulting 5.2.3. The relationship between M and C
hypersphere in this case, thus producing a poor generalization In order to intuitively demonstrate the relationship between
performance. In contrast, due to the introduction of the maxi- the parameters M and C = [C1 , C2 ], we carried out separately
mum soft margin regularization term, which allows the proposed the following two experiments on the artificial datasets used in
approach to use the rare minority training instances to refine the Section 5.2.2 previously. The goal of one experiment is to verify
the effect of C on the range of M. We set C = [0.1, 1] and observe
obtained hypersphere during training, the classification boundary
the change of the resulting classification boundary position and
of DSMSM-SVDD tends to shift toward the middle of the two
support vectors distribution through gradually increasing M from
class training instances, thus producing a better generalization
1 to 1000. Other parameters for the proposed method are the
performance. From the results shown in Fig. 2(a) and (b), we same as the before. In addition, For the convenience of indepen-
can conclude that when dealing with some imbalanced problems, dently investigating the relationship between C and M, we let the
especially with only a few minority class instances, since the sum parameters ρi = 1, ρj = 1 for all training instances in DSMSM-
of misclassification cost equals to zero in such cases, the penalty SVDD and only consider the influence of the maximum soft
terms of misclassified cost play no role in adjusting classification margin regularization term. In such case, according to Eq. (25),
boundary thus producing an undesirable hypersphere. On the we can further obtain the following simpler expression about it:
other hand, because the proposed approach depends on not only
C1 Pm+ + C2 Pm− < M < C1 Pm+ + Qm+ + C2 Pm− + Qm−
( ) ( )
(31)
the penalty terms but also the margin between two class training
instances to refine the classification boundary, the problem of The experimental results are shown in Fig. 3.
classification boundary skewing toward the majority class can As shown in Fig. 3, it can be seen that when the weight
be effectively avoided. Consequently, the generalization capabil- of the maximum soft margin regularization term is small, such
as M = 1, the classification result of DSMSM-SVDD seems to
ity can significantly be improved even in dealing with severely
resemble that of CSVDD in terms of classification boundary po-
imbalanced cases. Compared to CSVDD, DSMSM-SVDD seems to
sition. It indicates that the maximum soft margin regularization
behave similarly to SVM due to the introduction of the maximum
term plays little effect on classification performance due to the
soft margin generalization term. However, compared to SVM the small M value. When M is increased from 1 to 10 or 30 the
advantage of DSMSM-SVDD is that it is not significantly affected resulting classification boundary tends to shift toward the middle
by a small number of minority class instances, which makes it of two class training instances and exhibits the same classifica-
more suitable for dealing with imbalanced datasets, especially tion characteristics as ones obtained by SVM under the balanced
highly imbalanced ones. datasets, reducing the possibilities of misclassifying majority class
10
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

Fig. 3. Classification results of DSMSM-SVDD change with different M values when fixed C .

instances and thus improving the generalization ability of the


proposed approach. However, along with the further increase of
M, such as M = 50 and M = 100, the resulting classification
boundary are too severely skewed to be depicted in Fig. 3 and
the distribution of the obtained support vectors is undesirable.
According to the inequality (31), we can further conclude that
when the value of C is small, such as C = [0.1, 1], along with the
increase of M, the number of support vectors tends to increase
in order to satisfy the constraint of M upper bound. As can be
seen in Fig. 3, almost all the majority class training instances
become support vectors when M = 50, or 100, which thus
leads to the occurrence of over-fitting phenomenon. In extreme
cases, when M is too large, for example, M = 1000, even
though all the majority class training instances are regarded as
support vectors, the constraints illustrated in (31) cannot be
satisfied due to the effect of big M value, which would impose
all the minority class training instances to be support vectors as Fig. 4. The G-Mean metrics under different M values on the synthetic datasets.
well, resulting in completely over-fitting. Although in such cases
the obtained classification boundary is likely to move toward
the middle of two class training instances, due to the fact that
As illustrated in Fig. 5, we can find that when M = 10, C =
nearly all training instances become support vectors, the compu-
[0.01, 0.1], the obtained classification boundary tends to shift to-
tation and space complexities would be remarkably increased. In
ward the majority class and thus generate a large number of mis-
summary, we can draw the following conclusions: the determi-
nation of M is dependent on the value of C . When C is small, M classified majority instances affected by relatively smaller mis-
should be set be as small as possible within a reasonable range. classification penalty constant. The corresponding G-Mean metric
Otherwise, limited by the constraint of upper bound of M, the obtained by 5-cross validation results is 0.852. This is because
number of support vectors would be increased and consequently that if M is constant and simultaneously C is too small, especially
results in model over-fitting. The 5-cross validation experimental for majority class, more support vectors are required to satisfy
results obtained by the proposed algorithm under different M upper bound constraint of M, which can eventually result in the
value settings with respect to G-Mean metrics are quantitatively poor generalization performance. When M = 10, C = [0.1, 1],
summarized in Fig. 4. due to the effect of relatively large misclassification penalty con-
Subsequently, to verify the effect of M on the range of C , stant, more support vectors are no longer required to satisfy the
we carried out the other experiments. In the experiments, we constraints of M and thus the classification boundary gradually
set M = 10,100 and then respectively observe the change of moves toward the middle of the two classes influenced by the
the resulting classification boundary and the obtained support maximum soft margin regularization term. The corresponding G-
vectors distribution by gradually increasing the value of C in such Mean metric obtained by 5-cross validation results is 0.973. In
two cases. The remaining parameters settings are the same as such case, not only the generalization performance is improved
those of the above one. The experimental results are shown in but also the computation and space complexities of the proposed
Fig. 5. approach are significantly reduced. However, when C is further
11
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

Fig. 5. Classification results of DSMSM-SVDD with different C when M = 10 or M = 100.

increased, such as C = [1, 10] or C = [10, 100], the result- minority one and the rest classes are merged into the majority
ing classification boundary is almost unmodified compared with class. In addition, in order to obtain the imbalanced datasets
when C = [0.1, 1]. The corresponding G-Mean metrics under two with higher imbalance ratios, we select two UCI datasets with
parameter configurations obtained by 5-cross validation results high dimensions and large sizes including Pageblock and Yeast
are the same as C = [0.1, 1]. On the contrary, when M = 100, ones to generate some imbalanced datasets through different
the classification boundary does not move toward the middle class combinations. The detailed class combinations regarding
of the two classes until C = [1, 10]. This is because that if M these generated datasets are indicated in the Category column in
is relatively large and simultaneously C is small, more support Table 3.
vectors are required constrained by upper bound of M, which
can result in over-fitting and eventually deteriorate the gener- 5.3.2. Influence of relative density parameters on DSMSM-SVDD
alization performance, such as C = [0.01, 0.1] , M = 100 or For the proposed DSMSM-SVDD, besides M value which is
C = [0.1, 1] , M = 100. In summary, the theoretical and empirical associated with the maximum soft margin regularization term,
results indicate that the adjustment range of C is influenced by there are also two other parameters which have significant in-
the value of M. When the value of M is set be relatively large, fluence on the performance of the proposed method: smoothing
in order to reduce the number of support vectors and thus avoid parameters s and weight factors ω need to be pre-specified re-
over-fitting, C must be accordingly increased. garding the relative density-based penalty weights. In the previ-
In order to avoid the choice of two separate parameters, we ous cases, we discussed the relation of M value and the penalty
give a simple strategy to set M according to C , which is usually constant C , and gave the simple strategy about the M value
required to be pre-specified along with the Gaussian width. To setting. In this section, in order to choose appropriate values for
effectively deal with imbalanced datasets, we generally set a the two parameters, we need to investigate the effect of different
relatively large value to M for DSMSM-SVDD so as to enhance the smoothing parameters s and weight factors ω on the classification
effect of the maximum soft margin regularization term. Therefore, performance of the proposed approach. We performed separately
we only consider the upper bound constraint of M. ( Since ρi, >
) 1 the two classification experiments on 5 selected datasets from Ta-
ρj > 1 for all training instances, we let M = NCN1 Pm+ + Qm+ + ble 3 including: Wine, Iris, Abalone, Pima and Ecoli. One is to ana-
N ×IR×C1
( ) lyze the classification performance of the proposed DSMSM-SVDD
Pm− + Qm− , which can still satisfy the upper bound under different smoothing parameters s ranged from 0.1 to 30.
N
shown in Eq. (25). Pm+ + Qm+ /N and Pm− + Qm− /N denote
( ) ( )
Similarly, the other is to analyze the classification performance
the fractions of the majority class and minority class support vec- of the proposed DSMSM-SVDD by increasing the weight factors
tors, respectively. Generally, in order to guarantee the satisfactory ω from 0 to 10 with interval 1. Note that the weight factor equal
generalization performance, we both set them to be 10%, such to 0 means that no relative densities are introduced into the pro-
that M = 10% × 2 × C1 × N. posed DSMSM-SVDD. For each parameter setting, we conducted
5-fold stratified cross-validation to determine the best parameter
5.3. Performance comparison on real-world datasets configurations: the majority class penalty constant C1 and σ by
grid search method from the set {10−2 , 10−1 , 100 , 101 , 102 } and
5.3.1. Experimental data configuration {2−3 , 2−2 , 2−1 , 20 , 21 , 22 , 23 }. M is set to be 10% × 2 × C1 × N
To evaluate the proposed method, 15 imbalanced datasets according to the previously given setting strategy. In order to
available from the UCI Machine Learning Repository [69] (http:// possibly preserve original between-class ratios, 5-fold stratified
archive.ics.uci.edu/ml/datasets.html) are used in this study. More cross validation was used and each experiments was repeated 3
detailed information about the 15 datasets is shown in Table 3. times to report their averaged metrics values to avoid random-
Those datasets with more than two classes were converted into ness influences on the results. The results of the two classification
bi-class datasets by means of one versus others strategy, where experiments in terms of G-Mean, F-Measure and AUC metric are
one of the classes with the relatively small size is labeled the shown in Figs. 6 and 7, respectively.
12
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

Table 3
Description of the experimental datasets.
Dataset Attribute Minority class/Majority class Category Imbalance ratio
Wine 13 48/130 1:others 1:2.71
Iris 4 50/100 2:others 1:2.00
Abalone 8 67/259 16: 6 1:3.86
Pima 8 268/500 1:0 1:1.87
Ecoli 7 52/284 ‘pp’:others 1:5.46
Libra 90 24/336 15:others 1:14.00
Vehicle 18 199/647 ‘van’:others 1:3.25
Balance 4 49/576 ‘B’:others 1:11.76
Haberman 3 81/225 2:1 1:2.78
Car 6 65/1663 ‘v-good’:others 1:25.58
Liver 6 145/200 1:2 1:1.38
Seed 7 70/140 1:others 1:2.00
Spect 22 55/212 1:2 1:3.85
Pageblock 10 560/4913 others:1 1:8.77
Pageblock1 10 329/5144 2:others 1:15.63
Yeast 8 429/1055 2:others 1:2.46
Yeast1 8 244/1240 3:others 1:5.08
Yeast2 8 163/1321 4:others 1:8.10
Yeast3 8 51/463 5:1 1:9.07
Yeast4 8 35/429 7:2 1:12.25
Yeast5 8 20/463 9:1 1:23.15
Yeast6 8 51/1433 5:others 1:28.09
Yeast7 8 44/1440 6:others 1: 32.72
Yeast8 8 35/1449 7:others 1:41.40

Fig. 6. The change of three metrics of DSMSM-SVDD with different smoothing parameters.

Fig. 7. The change of three metrics of DSMSM-SVDD with different weight factors.

From the results, we can find that the averaged G-Mean, F- utilized just for the magnification or reduction of relative densi-
Measure and AUC metrics values on all selected UCI datasets ties absolute values of different samples while the relative ratio
do not significantly change along with the increasing of the of them keeps unchanged. In addition, the change of absolute
smoothing parameters and the density weights except for the values can be compensated due to the effect of later optimal
density weights equal to 0, where its all averaged performance penalty constant factors and thus have little effect on the clas-
metrics are relatively lower than those with relative densities. sification performance. Therefore, according to the above results
and discussion, to ensure good generalization performance for the
Since the 0 weight factor indicates no relative densities in the
proposed algorithm, the smoothing parameter can be set to the
proposed DSMSM-SVDD modal, the significant increase on three
range from 2 to 10, and the density weight can be empirically set
metrics from 0 to 1 can show the introduction of relative den- to 2 or 3.
sities is helpful to improving the performance of the proposed
algorithm. Moreover, no significant change on performance under 5.3.3. Comparison of computational complexity with C-SVDD
other parameters setting is possibly because that the smoothing In order to compare the computational complexity between
parameters (magnification times) and the weight factors are both the proposed DSMSM-SVDD and traditional C-SVDD, we provided
13
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

the averaged training time and the number of support vectors M is set to be 10% × 2 × C1 × N according to the previously
(SVs) in Table 4 on 15 selected UCI datasets. Since the class pred- given setting strategy. The parameters for all compared methods
ication of unknown instances by SVDD-variants depends on the are optimized using stratified cross-validation in the training
number of support vectors discussed in Section 4.6, it can denote dataset based on G-Means performance measure. Table 5 shows
the computational complexity of the proposed DSMSM-SVDD in the experimental results of the mean and standard deviation for
execution phase. From the results, the averaged training time of all compared techniques on 24 datasets. The best measures are
the proposed DSMSM-SVDD are comparable to classical C-SVDD. highlighted in bold.
However, the average number of support vectors in the proposed From the results shown in Table 5, we can find that con-
DSMSM-SVDD is significantly smaller than that of C-SVDD on all ventional SVM performs poorly when dealing with imbalanced
15 datasets except for Wine. Therefore, we can conclude that datasets, especially for highly imbalanced ones such Balance,
the computational complexity of the proposed DSMSM-SVDD in Libra, Ecoli, Pageblock, Pageblock1, and all Yeast sub datasets
training phase is comparable to that of C-SVDD and is much lower except for Yeast2 and Yeast5. In particular, for Yeast6, Yeast7
than that of C-SVDD in execution phase. and Yeast8 with imbalance ratio greater than 28:1, the classifi-
cation performance of conventional SVM becomes significantly
5.3.4. Experimental results and discussion on UCI datasets deteriorated due to the extreme domination of majority class.
In order to verify the effectiveness of our proposed ensem- Compared to SVM, CSVDD works not better than conventional
ble DSMSM-SVDD approach in handling imbalanced datasets, SVM in some imbalanced datasets with relatively small imbalance
we performed comparable experiments on all datasets listed in ratio while outperforms conventional SVM in some highly imbal-
Table 3 with other imbalanced classification techniques: (1) SVM, anced datasets including Pageblock1, and all Yeast sub datasets
(2) CSVDD, (3) random under-sampling based SVM (RUSVM), (4) except Yeast2 and Yeast5. This is mainly because that compared
random over-sampling based SVM (ROSVM), (5) SMOTE over- to conventional SVM, CSVDD belongs to one-class classifier and
sampling based SVM (SMOTESVM), (6) BSMOTE over-sampling primarily depends on majority class instances to learn classifica-
based SVM (BSMOTESVM), (7) WKSMOTESVM over-sampling tion boundary, which makes it not significantly sensitive to the
based SVM proposed in [41], (8) cost-sensitive SVM (CSVM), small number of minority class instances. In addition, from the
(9) CSO-SDAENN, (10) AdaBoost SVM ensemble (AdaSVM), (11) results shown in Table 5, we can also find that CSVDD fails to ob-
AdaCost SVM ensemble (AdaCSVM) (12) SVM-based AdaBoost tain satisfactory results in terms of AUC, which shows that CSVDD
with weights determined by instance categorization (AdaBoos- classifier has no better robustness compared to SVM classifier.
tIC) presented in [58]. As a representative bi-class classifier, Unlike CSVDD, the proposed DSMSM-SVDD algorithm incorpo-
conventional SVM serves as the baseline method. CSVDD is clas- rates the relative density-based penalty weights and the maxi-
sical cost sensitive one-class classifier and is more related to mum soft margin regularization term into SVDD model, which
the proposed DSMSM-SVDD. Four data pre-processing meth- makes it able to effectively avoid the effect of noises and make
ods which include Random under-sampling (RU), Random over- use of rare minority class instances to refine the resulting hy-
sampling (RO), the synthetic positive over-sampling technique persphere, thus resulting in better performance regardless of
(SMOTE), the borderline BSMOTE (BSMOTE), and WKSMOTE are G-Mean, F-Measure and AUC metrics. Compared to other algo-
selected to combine with an SVM algorithm. CSVM is widely rithms, the proposed DSMSM-SVDD ensemble method can obtain
used algorithm-level imbalanced classification method based on the best results in at least one of performance measures in all
cost-sensitive strategy. Since the proposed ensemble DSMSM- datasets. In addition, performance measures for the proposed
SVDD approach belongs to ensemble learning, the benchmarking method do not vary significantly over the five-fold cross vali-
method must include ensemble-like methods. Consequently, we dation and 3 iterations on all datasets except Yeast5, indicating
selected three ensemble methods including AdaBoost with SVM that the proposed method has good stability. In order to eval-
(AdaSVM), AdaCost with SVM (AdaCSVM), and AdaBoostIC with uate the classification performance of different methods across
SVM (AdaBoostICSVM). AdaSVM served as a baseline ensemble multiple datasets, we calculate and analyze mean rankings of
method. AdaCSVM is selected since it is the most useful cost- performance measures for different methods on these datasets
sensitive boosting variants of AdaBoost. In addition, AdaBoost- instead of comparing directly obtained performance measures
ICSVM is the hybrid ensemble method which used cost-sensitive according to Ref. [70]. Fig. 8 shows the results of mean rank-
SVM based on instance categorization as base learner. The tech- ings for our proposed method and other compared ones on 24
niques were implemented using Matlab on a PC with 64-bit datasets in terms of G-Mean, F-Measure and AUC, respectively.
operation system, 4.00 GB RAM, and 3.40 GHz CPU. In order to For each dataset, the method with the best performance can
possibly preserve original between-class ratios, 5-fold stratified be assigned a mean ranking of 1 while the worst performing
cross validation was used and each experiment was repeated 3 method can be assigned a mean ranking of 12. From the results
times to report their averaged metrics values to avoid random- in Fig. 8, we can find that conventional SVM produces greater
ness influences on the results. For SMOTE, the number of nearest mean rankings in terms of G-Mean and F-Measure metrics than
minority neighbors (NN) to be found for each minority instance those of other techniques, indicating that the classification per-
to generate synthetic instances are selected among values (3, 5, 7, formance of traditional SVM severely deteriorates affected by
9). Similarly, the number of nearest (k) used to identify borderline the imbalanced data. This is due to the fact that in the trained
instances for BSMOTE was also selected among values (3, 5, 7, SVM model, the number of majority class support vectors is
9). For CSVDD and CSVM, the misclassification cost values for usually bigger than that of minority class support vectors due
majority class and minority class are set inversely proportional to the domination of majority class, the classification boundary
to the imbalance ratio. For AdaCost-SVM and AdaC2-SVM, the tends to skew toward the minority class or even over the re-
cost setting is the same as that for CSVM, and the only difference gion of the minority, thus producing low classification accuracy
is that cost values for AdaCost-SVM is normalized to lie within on the minority class. From the results indicated in Table 5,
[0,1]. For all ensemble versions, the maximum iteration number we find that as a representative cost-sensitive method, CSVM
is set to 15 so as to avoid over fit the minority class due to the cannot show good performance when dealing with imbalanced
effect of different cost values [56]. The majority class penalty datasets due to its application-dependence and the effect of im-
constant C1 and σ are selected by grid search method from the proper misclassification cost values. In addition, compared to
set {10−2 , 10−1 , 100 , 101 , 102 } and {2−3 , 2−2 , 2−1 , 20 , 21 , 22 , 23 }. RUSVM, all oversampling methods such as ROSVM, SMOTESVM,
14
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

Table 4
The averaged training time (sec) and SVs by DSMSM-SVDD and C-SVDD.
Datasets Wine Iris Abalone Pima Ecoli
C-SVDD 0.028(50) 0.011(35) 0.023(65) 0.126(235) 0.021(71)
DSMSM-SVDD 0.031(50) 0.012(21) 0.023(37) 0.123(110) 0.022(41)
Datasets Libra Vehicle Balance Haberman Car
C-SVDD 1.23(156) 0.567(133) 0.103(51) 0.017(37) 0.813(512)
DSMSM-SVDD 1.25(117) 0.512(67) 0.101(25) 0.017(15) 0.803(131)
Datasets Liver Seed Spect Pageblock Yeast
C-SVDD 0.123(48) 0.015(32) 0.092(56) 11.133(787) 3.357(313)
DSMSM-SVDD 0.121(16) 0.015(11) 0.093(23) 11.216(315) 3.279(115)

Fig. 8. Mean ranking of all compared imbalanced classification techniques on all tested datasets.

BSMOTESVM, and WKSMOTESVM obtains better results in the respectively and are much smaller than a significance level of
mean rankings of all three performance metrics. This is primarily α = 0.05, which indicates that there exists sufficient evidence
because that the RU-SVM belongs to under-sampling technique to reject the hypothesis and thus all compared methods do not
and tends to risk loss of some informative majority instances perform similarly. Since the null hypothesis is rejected for all
thus deteriorating subsequent classification performance. Com- performance measures, a post-hoc test is applied to make pair-
pared to other methods, all ensemble variants show relatively wise comparisons of the proposed method and other imbalanced
good performance in dealing with imbalanced datasets due to classification methods. Holm’s test was used in this study where
the effect of boosting scheme. This result indicates that boosting the proposed method was treated as the control method. The
strategy can improve the generalization performance of the SVM Holm’s test is a non-parametric equivalent of multiple t-test that
classifier. However, AdaSVM is accuracy-oriented and its weight- adjusts α to compensate for multiple comparisons in a step-down
ing strategy may favor the majority class since it contributes procedure. The null hypothesis is that the proposed method does
more to the overall classification accuracy, which can result in not perform better than all other methods as the control algo-
no significant improvement of the classification performance in rithm. Table 6 shows all adjusted α values and the corresponding
some datasets. Although AdaCSVM and AdaBoostICSVM methods p-values for each compared results.
can overcome the shortcomings of AdaSVM by adopting cost- From the results, we can find that the null hypothesis is
oriented boosting strategy, they still suffer from the inconsistency rejected for all pairwise comparison at a significance level of
between accuracy-oriented SVM base learner and cost-oriented α = 0.05 except WKSMOTESVM regarding G-Mean, indicat-
boosting scheme. Additionally, inappropriate cost-sensitive val- ing that the proposed method outperforms the other compared
ues also hinder the further improvement of their generalization methods with significant difference. Although the p-value of the
performance [23]. Compared to other ensemble methods, the pairwise comparison for WKSMOTESVM is slightly higher than
proposed method significantly outperforms other methods with 0.05, which is 0.189, the proposed algorithm still outperforms
regard to the mean rankings of all evaluation metrics. This is it with weak predominance according to the above results. In
primarily because that used base classifier DSMSM-SVDD signifi- addition, as shown in Table 6, CSVDD obtains higher adjusted α
cantly outperforms SVM in dealing with imbalanced datasets and values and p-values in terms of G-Mean and F-Measure compared
its classification boundary is not affected by rare minority in- to those of SVM with the proposed method as a control method,
stances. In addition, the proposed method can obtain satisfactory which indicates that CSVDD can perform relatively better than
AUC mean ranking, which indicates that the robustness of the SVM on G-Mean and F-Measure evaluation metrics when dealing
proposed approach is significantly improved due to the effect of with imbalanced datasets, especially highly imbalanced ones. In
boosting scheme strategy. addition, the adjusted α values and p-values obtained by all over-
In order to test whether differences in terms of mean rankings sampling methods including ROSVM, SMOTESVM, BSMOTESVM
among different methods are merely a matter of chance, we and WKSMOTESVM are higher than those of RUSVM when the
perform Friedman test followed by Holm’s test to verify the proposed method is used as a control method. The results demon-
statistical significance of the proposed method compared to the strate that the oversampling techniques can generally show su-
other imbalanced classification methods with regard to the cal- perior performance in most cases compared to under sampling
culated mean rankings. The null hypothesis is that all compared methods since removal of majority instances may lead to the loss
methods perform similarly in mean rankings without significant of some important information from the datasets, especially in
difference. After Friedman test, the p-values for all three perfor- cases where the dataset is small. As a classical algorithm-level
mance measures are 1.0513e−26, 4.6187e−18, and 5.1331e−19, imbalanced classification technique, CSVM needs to determine
15
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

Table 5
Results of different imbalanced classification methods on all used datasets using SVM.
Dataset Wine Iris Abalone
Measure G-M F-M AUC G-M F-M AUC G-M F-M AUC
SVM 0.813 ± 0.096 0.737 ± 0.119 0.884 ± 0.073 0.969 ± 0.035 0.960 ± 0.039 0.996 ± 0.006 0.921 ± 0.069 0.891 ± 0.080 0.985 ± 0.013
CSVDD 0.674 ± 0.076 0.558 ± 0.092 0.299 ± 0.061 0.746 ± 0.048 0.687 ± 0.051 0.224 ± 0.041 0.886 ± 0.039 0.741 ± 0.060 0.111 ± 0.038
DSMSM-SVDD 0.8512 ± 0.031 0.767 ± 0.035 0.905 ± 0.033 0.968 ± 0.025 0.961 ± 0.016 0.986 ± 0.033 0.928 ± 0.023 0.892 ± 0.056 0.986 ± 0.011
ROSVM 0.845 ± 0.063 0.746 ± 0.073 0.922 ± 0.056 0.966 ± 0.027 0.950 ± 0.037 0.997 ± 0.003 0.932 ± 0.041 0.863 ± 0.068 0.987 ± 0.012
SMOTESVM 0.826 ± 0.066 0.723 ± 0.080 0.902 ± 0.072 0.966 ± 0.022 0.949 ± 0.028 0.998 ± 0.002 0.942 ± 0.037 0.886 ± 0.065 0.987 ± 0.010
BSMOTESVM 0.843 ± 0.069 0.700 ± 0.110 0.905 ± 0.079 0.944 ± 0.034 0.949 ± 0.056 0.995 ± 0.005 0.932 ± 0.046 0.898 ± 0.073 0.980 ± 0.016
WKSMOTESVM 0.866 ± 0.076 0.778 ± 0.092 0.917 ± 0.061 0.976 ± 0.048 0.957 ± 0.051 0.997 ± 0.041 0.946 ± 0.039 0.871 ± 0.060 0.987 ± 0.038
RUSVM 0.805 ± 0.092 0.741 ± 0.091 0.858 ± 0.084 0.964 ± 0.040 0.913 ± 0.046 0.997 ± 0.005 0.935 ± 0.056 0.870 ± 0.046 0.980 ± 0.022
CSVM 0.847 ± 0.060 0.745 ± 0.080 0.903 ± 0.053 0.950 ± 0.036 0.923 ± 0.057 0.996 ± 0.005 0.931 ± 0.035 0.837 ± 0.073 0.983 ± 0.014
CSO-SDAENN 0.807 ± 0.025 0.725 ± 0.031 0.883 ± 0.015 0.955 ± 0.037 0.933 ± 0.028 0.992 ± 0.015 0.933 ± 0.027 0.841 ± 0.015 0.985 ± 0.028
AdaSVM 0.855 ± 0.075 0.755 ± 0.100 0.903 ± 0.070 0.968 ± 0.031 0.947 ± 0.052 0.998 ± 0.002 0.923 ± 0.041 0.852 ± 0.076 0.987 ± 0.010
AdaCSVM 0.849 ± 0.063 0.751 ± 0.090 0.913 ± 0.076 0.971 ± 0.027 0.958 ± 0.035 0.997 ± 0.003 0.935 ± 0.028 0.869 ± 0.051 0.987 ± 0.011
AdaBoostICSVM 0.863 ± 0.037 0.774 ± 0.054 0.906 ± 0.053 0.968 ± 0.029 0.956 ± 0.035 0.996 ± 0.005 0.925 ± 0.059 0.879 ± 0.085 0.980 ± 0.027
Our method 0.873 ± 0.032 0.789 ± 0.029 0.920 ± 0.033 0.975 ± 0.033 0.961 ± 0.021 0.998 ± 0.017 0.940 ± 0.023 0.895 ± 0.025 0.991 ± 0.017
Dataset Pima Ecoli Libra
Measure G-M F-M AUC G-M F-M AUC G-M F-M AUC
SVM 0.660 ± 0.042 0.575 ± 0.056 0.800 ± 0.024 0.902 ± 0.053 0.826 ± 0.061 0.964 ± 0.030 0.845 ± 0.103 0.813 ± 0.129 0.996 ± 0.005
CSVDD 0.697 ± 0.037 0.642 ± 0.048 0.301 ± 0.036 0.929 ± 0.026 0.847 ± 0.048 0.696 ± 0.026 0.679 ± 0.256 0.614 ± 0.270 0.240 ± 0.143
DSMSM-SVDD 0.751 ± 0.027 0.651 ± 0.015 0.805 ± 0.023 0.925 ± 0.012 0.851 ± 0.016 0.966 ± 0.017 0.918 ± 0.016 0.882 ± 0.021 0.991 ± 0.025
ROSVM 0.729 ± 0.040 0.651 ± 0.053 0.805 ± 0.025 0.930 ± 0.041 0.846 ± 0.071 0.960 ± 0.032 0.911 ± 0.104 0.869 ± 0.125 0.995 ± 0.006
SMOTESVM 0.729 ± 0.055 0.641 ± 0.069 0.807 ± 0.035 0.940 ± 0.039 0.859 ± 0.050 0.959 ± 0.042 0.839 ± 0.149 0.802 ± 0.188 0.995 ± 0.007
BSMOTESVM 0.735 ± 0.037 0.618 ± 0.065 0.808 ± 0.035 0.934 ± 0.041 0.863 ± 0.050 0.962 ± 0.038 0.916 ± 0.078 0.840 ± 0.155 0.994 ± 0.013
WKSMOTESVM 0.747 ± 0.037 0.650 ± 0.048 0.815 ± 0.036 0.945 ± 0.026 0.857 ± 0.048 0.966 ± 0.026 0.919 ± 0.256 0.894 ± 0.270 0.996 ± 0.143
RUSVM 0.701 ± 0.048 0.658 ± 0.052 0.815 ± 0.036 0.938 ± 0.026 0.838 ± 0.058 0.960 ± 0.031 0.877 ± 0.133 0.887 ± 0.087 0.993 ± 0.009
CSVM 0.726 ± 0.043 0.647 ± 0.066 0.808 ± 0.033 0.937 ± 0.030 0.862 ± 0.051 0.962 ± 0.030 0.885 ± 0.120 0.857 ± 0.150 0.994 ± 0.009
CSO-SDAENN 0.731 ± 0.031 0.627 ± 0.025 0.813 ± 0.022 0.935 ± 0.032 0.853 ± 0.021 0.963 ± 0.035 0.883 ± 0.019 0.851 ± 0.025 0.989 ± 0.022
AdaSVM 0.723 ± 0.035 0.644 ± 0.045 0.810 ± 0.031 0.939 ± 0.029 0.857 ± 0.061 0.961 ± 0.027 0.908 ± 0.115 0.871 ± 0.169 0.998 ± 0.003
AdaCSVM 0.739 ± 0.054 0.647 ± 0.065 0.811 ± 0.040 0.936 ± 0.043 0.856 ± 0.056 0.958 ± 0.042 0.908 ± 0.110 0.883 ± 0.134 0.997 ± 0.004
AdaBoostICSVM 0.733 ± 0.047 0.643 ± 0.068 0.813 ± 0.031 0.932 ± 0.043 0.851 ± 0.053 0.957 ± 0.039 0.925 ± 0.094 0.902 ± 0.117 0.997 ± 0.004
Our method 0.779 ± 0.015 0.657 ± 0.038 0.827 ± 0.021 0.949 ± 0.017 0.861 ± 0.031 0.975 ± 0.027 0.931 ± 0.013 0.915 ± 0.077 0.996 ± 0.015
Dataset Vehicle Balance Haberman
Measure G-M F-M AUC G-M F-M AUC G-M F-M AUC
SVM 0.968 ± 0.016 0.950 ± 0.026 0.998 ± 0.001 0.000 ± 0.000 0.000 ± 0.000 0.928 ± 0.030 0.460 ± 0.116 0.299 ± 0.125 0.657 ± 0.086
CSVDD 0.773 ± 0.041 0.604 ± 0.048 0.207 ± 0.035 0.000 ± 0.000 0.000 ± 0.000 0.500 ± 0.000 0.656 ± 0.053 0.504 ± 0.064 0.336 ± 0.048
DSMSM-SVDD 0.971 ± 0.032 0.945 ± 0.023 0.985 ± 0.011 0.728 ± 0.025 0.361 ± 0.016 0.852 ± 0.015 0.660 ± 0.031 0.508 ± 0.022 0.686 ± 0.018
ROSVM 0.978 ± 0.011 0.951 ± 0.019 0.998 ± 0.001 0.710 ± 0.064 0.296 ± 0.056 0.842 ± 0.052 0.630 ± 0.102 0.482 ± 0.120 0.716 ± 0.094
SMOTESVM 0.984 ± 0.008 0.963 ± 0.015 0.998 ± 0.001 0.733 ± 0.154 0.352 ± 0.127 0.849 ± 0.068 0.637 ± 0.060 0.494 ± 0.075 0.720 ± 0.057
BSMOTESVM 0.973 ± 0.013 0.948 ± 0.027 0.997 ± 0.001 0.690 ± 0.129 0.308 ± 0.077 0.834 ± 0.059 0.644 ± 0.054 0.454 ± 0.089 0.720 ± 0.053
WKSMOTESVM 0.985 ± 0.041 0.955 ± 0.048 0.998 ± 0.035 0.747 ± 0.042 0.379 ± 0.031 0.851 ± 0.035 0.656 ± 0.053 0.504 ± 0.064 0.726 ± 0.048
RUSVM 0.973 ± 0.013 0.939 ± 0.027 0.997 ± 0.002 0.727 ± 0.086 0.283 ± 0.104 0.864 ± 0.058 0.593 ± 0.074 0.500 ± 0.082 0.707 ± 0.073
CSVM 0.980 ± 0.010 0.955 ± 0.023 0.997 ± 0.001 0.717 ± 0.059 0.309 ± 0.057 0.831 ± 0.051 0.635 ± 0.073 0.490 ± 0.087 0.707 ± 0.066
CSO-SDAENN 0.981 ± 0.036 0.947 ± 0.025 0.993 ± 0.023 0.733 ± 0.028 0.333 ± 0.031 0.863 ± 0.015 0.641 ± 0.019 0.501 ± 0.023 0.719 ± 0.026
AdaSVM 0.980 ± 0.009 0.956 ± 0.019 0.997 ± 0.002 0.721 ± 0.073 0.304 ± 0.069 0.830 ± 0.063 0.645 ± 0.057 0.502 ± 0.068 0.726 ± 0.062
AdaCSVM 0.978 ± 0.015 0.952 ± 0.033 0.997 ± 0.001 0.737 ± 0.100 0.379 ± 0.083 0.844 ± 0.057 0.637 ± 0.103 0.497 ± 0.119 0.727 ± 0.066
AdaBoostICSVM 0.973 ± 0.012 0.958 ± 0.021 0.997 ± 0.003 0.737 ± 0.082 0.370 ± 0.061 0.846 ± 0.064 0.625 ± 0.111 0.489 ± 0.123 0.722 ± 0.067
Our method 0.988 ± 0.017 0.958 ± 0.025 0.998 ± 0.007 0.765 ± 0.035 0.381 ± 0.051 0.865 ± 0.055 0.667 ± 0.035 0.519 ± 0.055 0.727 ± 0.062
Dataset Car Liver Seed
Measure G-M F-M AUC G-M F-M AUC G-M F-M AUC
SVM 0.977 ± 0.039 0.945 ± 0.045 0.999 ± 0.000 0.666 ± 0.068 0.599 ± 0.070 0.749 ± 0.062 0.902 ± 0.038 0.870 ± 0.050 0.979 ± 0.012
CSVDD 0.934 ± 0.014 0.940 ± 0.067 0.063 ± 0.015 0.557 ± 0.080 0.595 ± 0.078 0.407 ± 0.065 0.894 ± 0.036 0.840 ± 0.068 0.104 ± 0.035
DSMSM-SVDD 0.979 ± 0.025 0.956 ± 0.033 0.999 ± 0.012 0.657 ± 0.017 0.641 ± 0.019 0.737 ± 0.021 0.910 ± 0.027 0.889 ± 0.013 0.977 ± 0.028
ROSVM 0.992 ± 0.013 0.939 ± 0.024 0.999 ± 0.000 0.687 ± 0.041 0.646 ± 0.049 0.749 ± 0.025 0.923 ± 0.051 0.889 ± 0.057 0.977 ± 0.017
SMOTESVM 0.977 ± 0.047 0.973 ± 0.051 0.999 ± 0.001 0.686 ± 0.050 0.647 ± 0.066 0.755 ± 0.052 0.936 ± 0.042 0.905 ± 0.056 0.980 ± 0.018
BSMOTESVM 0.990 ± 0.016 0.955 ± 0.055 0.999 ± 0.001 0.693 ± 0.045 0.636 ± 0.080 0.754 ± 0.053 0.918 ± 0.033 0.882 ± 0.103 0.977 ± 0.011
WKSMOTESVM 0.994 ± 0.014 0.973 ± 0.067 0.999 ± 0.015 0.695 ± 0.080 0.659 ± 0.078 0.755 ± 0.065 0.934 ± 0.036 0.903 ± 0.068 0.974 ± 0.035
RUSVM 0.990 ± 0.017 0.938 ± 0.065 0.999 ± 0.000 0.684 ± 0.065 0.654 ± 0.047 0.745 ± 0.068 0.913 ± 0.068 0.868 ± 0.063 0.972 ± 0.026
CSVM 0.991 ± 0.015 0.933 ± 0.051 0.999 ± 0.000 0.685 ± 0.047 0.649 ± 0.051 0.752 ± 0.059 0.915 ± 0.070 0.877 ± 0.091 0.978 ± 0.019
CSO-SDAENN 0.991 ± 0.023 0.941 ± 0.015 0.999 ± 0.000 0.692 ± 0.027 0.643 ± 0.011 0.753 ± 0.035 0.917 ± 0.018 0.880 ± 0.026 0.977 ± 0.033
AdaSVM 0.991 ± 0.015 0.942 ± 0.047 0.999 ± 0.000 0.685 ± 0.043 0.647 ± 0.053 0.740 ± 0.052 0.922 ± 0.053 0.881 ± 0.067 0.978 ± 0.021
AdaCSVM 0.988 ± 0.020 0.932 ± 0.055 0.999 ± 0.000 0.684 ± 0.036 0.643 ± 0.050 0.755 ± 0.044 0.913 ± 0.045 0.874 ± 0.055 0.976 ± 0.020
AdaBoostICSVM 0.983 ± 0.032 0.938 ± 0.049 0.999 ± 0.000 0.684 ± 0.041 0.646 ± 0.048 0.742 ± 0.047 0.928 ± 0.031 0.898 ± 0.037 0.978 ± 0.009
Our method 0.995 ± 0.015 0.977 ± 0.023 0.999 ± 0.000 0.697 ± 0.047 0.667 ± 0.025 0.753 ± 0.033 0.933 ± 0.035 0.915 ± 0.031 0.985 ± 0.023
Dataset Spect Pageblock Pageblock1
Measure G-M F-M AUC G-M F-M AUC G-M F-M AUC
SVM 0.626 ± 0.110 0.500 ± 0.115 0.633 ± 0.100 0.227 ± 0.038 0.099 ± 0.031 0.914 ± 0.020 0.049 ± 0.062 0.011 ± 0.015 0.977 ± 0.009
CSVDD 0.524 ± 0.209 0.427 ± 0.218 0.352 ± 0.096 0.250 ± 0.011 0.113 ± 0.037 0.915 ± 0.016 0.507 ± 0.012 0.621 ± 0.014 0.428 ± 0.013
DSMSM-SVDD 0.721 ± 0.018 0.586 ± 0.023 0.801 ± 0.017 0.857 ± 0.017 0.627 ± 0.021 0.907 ± 0.035 0.915 ± 0.023 0.609 ± 0.016 0.951 ± 0.015
ROSVM 0.704 ± 0.062 0.532 ± 0.097 0.814 ± 0.040 0.851 ± 0.023 0.600 ± 0.034 0.912 ± 0.018 0.882 ± 0.030 0.606 ± 0.035 0.945 ± 0.018
SMOTESVM 0.744 ± 0.070 0.587 ± 0.114 0.812 ± 0.085 0.850 ± 0.020 0.604 ± 0.034 0.913 ± 0.016 0.881 ± 0.030 0.603 ± 0.043 0.946 ± 0.018
BSMOTESVM 0.724 ± 0.081 0.545 ± 0.090 0.820 ± 0.043 0.861 ± 0.014 0.632 ± 0.039 0.923 ± 0.013 0.916 ± 0.016 0.630 ± 0.040 0.972 ± 0.009
WKSMOTESVM 0.752 ± 0.209 0.607 ± 0.218 0.812 ± 0.096 0.857 ± 0.011 0.637 ± 0.037 0.915 ± 0.016 0.917 ± 0.012 0.621 ± 0.014 0.948 ± 0.013
RUSVM 0.702 ± 0.076 0.550 ± 0.091 0.793 ± 0.062 0.814 ± 0.034 0.569 ± 0.030 0.902 ± 0.024 0.865 ± 0.031 0.596 ± 0.036 0.941 ± 0.015
CSVM 0.706 ± 0.085 0.540 ± 0.103 0.794 ± 0.051 0.849 ± 0.018 0.597 ± 0.031 0.911 ± 0.015 0.882 ± 0.022 0.605 ± 0.036 0.947 ± 0.019
CSO-SDAENN 0.711 ± 0.021 0.548 ± 0.017 0.806 ± 0.026 0.851 ± 0.023 0.603 ± 0.018 0.913 ± 0.025 0.897 ± 0.026 0.612 ± 0.017 0.946 ± 0.013
AdaSVM 0.727 ± 0.059 0.567 ± 0.096 0.816 ± 0.053 0.850 ± 0.020 0.601 ± 0.031 0.911 ± 0.017 0.881 ± 0.018 0.605 ± 0.032 0.945 ± 0.018
AdaCSVM 0.743 ± 0.086 0.584 ± 0.132 0.822 ± 0.065 0.847 ± 0.044 0.612 ± 0.0575 0.897 ± 0.024 0.908 ± 0.029 0.613 ± 0.044 0.953 ± 0.019
AdaBoostICSVM 0.729 ± 0.111 0.556 ± 0.126 0.836 ± 0.077 0.856 ± 0.022 0.621 ± 0.013 0.903 ± 0.033 0.909 ± 0.041 0.623 ± 0.027 0.963 ± 0.051
Our method 0.771 ± 0.053 0.625 ± 0.037 0.827 ± 0.021 0.860 ± 0.021 0.639 ± 0.027 0.921 ± 0.019 0.925 ± 0.035 0.627 ± 0.051 0.976 ± 0.026
Dataset Yeast Yeast1 Yeast2
Measure G-M F-M AUC G-M F-M AUC G-M F-M AUC
SVM 0.611 ± 0.041 0.510 ± 0.051 0.791 ± 0.033 0.663 ± 0.040 0.564 ± 0.048 0.858 ± 0.034 0.853 ± 0.049 0.776 ± 0.060 0.976 ± 0.009
CSVDD 0.673 ± 0.033 0.545 ± 0.041 0.325 ± 0.033 0.765 ± 0.032 0.569 ± 0.061 0.232 ± 0.031 0.893 ± 0.031 0.778 ± 0.040 0.106 ± 0.031
DSMSM-SVDD 0.712 ± 0.022 0.596 ± 0.019 0.805 ± 0.021 0.772 ± 0.023 0.601 ± 0.011 0.860 ± 0.021 0.925 ± 0.016 0.779 ± 0.025 0.971 ± 0.031
ROSVM 0.712 ± 0.024 0.589 ± 0.040 0.794 ± 0.022 0.778 ± 0.033 0.575 ± 0.062 0.858 ± 0.031 0.914 ± 0.037 0.746 ± 0.044 0.975 ± 0.007
SMOTESVM 0.716 ± 0.032 0.593 ± 0.041 0.798 ± 0.025 0.777 ± 0.045 0.575 ± 0.064 0.858 ± 0.031 0.923 ± 0.019 0.744 ± 0.055 0.976 ± 0.007
BSMOTESVM 0.710 ± 0.028 0.591 ± 0.063 0.796 ± 0.019 0.783 ± 0.037 0.634 ± 0.034 0.860 ± 0.033 0.926 ± 0.023 0.795 ± 0.048 0.973 ± 0.009
WKSMOTESVM 0.721 ± 0.033 0.595 ± 0.041 0.799 ± 0.033 0.785 ± 0.032 0.596 ± 0.061 0.858 ± 0.031 0.926 ± 0.031 0.788 ± 0.040 0.976 ± 0.033
RUSVM 0.708 ± 0.046 0.589 ± 0.032 0.796 ± 0.037 0.770 ± 0.029 0.557 ± 0.050 0.854 ± 0.030 0.890 ± 0.045 0.735 ± 0.052 0.973 ± 0.012
CSVM 0.706 ± 0.024 0.583 ± 0.040 0.797 ± 0.018 0.780 ± 0.038 0.576 ± 0.037 0.854 ± 0.030 0.914 ± 0.037 0.749 ± 0.055 0.974 ± 0.008
CSO-SDAENN 0.710 ± 0.013 0.588 ± 0.025 0.801 ± 0.012 0.781 ± 0.018 0.583 ± 0.029 0.855 ± 0.020 0.917 ± 0.017 0.752 ± 0.021 0.973 ± 0.023
AdaSVM 0.709 ± 0.025 0.586 ± 0.047 0.795 ± 0.021 0.782 ± 0.044 0.577 ± 0.050 0.860 ± 0.046 0.919 ± 0.032 0.765 ± 0.047 0.975 ± 0.012
AdaCSVM 0.714 ± 0.026 0.591 ± 0.034 0.795 ± 0.027 0.786 ± 0.036 0.597 ± 0.051 0.858 ± 0.029 0.918 ± 0.029 0.752 ± 0.048 0.975 ± 0.008
AdaBoostICSVM 0.717 ± 0.051 0.594 ± 0.051 0.799 ± 0.022 0.783 ± 0.029 0.588 ± 0.046 0.859 ± 0.043 0.919 ± 0.033 0.753 ± 0.056 0.974 ± 0.011
Our method 0.727 ± 0.025 0.619 ± 0.033 0.815 ± 0.027 0.789 ± 0.041 0.613 ± 0.063 0.867 ± 0.032 0.933 ± 0.017 0.781 ± 0.076 0.975 ± 0.015
Dataset Yeast3 Yeast4 Yeast5
Measure G-M F-M AUC G-M F-M AUC G-M F-M AUC
SVM 0.832 ± 0.092 0.766 ± 0.114 0.982 ± 0.010 0.923 ± 0.066 0.747 ± 0.073 0.957 ± 0.043 0.711 ± 0.235 0.663 ± 0.240 0.785 ± 0.157
(continued on next page)

16
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

Table 5 (continued).
CSVDD 0.868 ± 0.056 0.769 ± 0.104 0.129 ± 0.054 0.925 ± 0.051 0.841 ± 0.098 0.072 ± 0.049 0.721 ± 0.100 0.665 ± 0.144 0.221 ± 0.127
DSMSM-SVDD 0.882 ± 0.022 0.773 ± 0.018 0.971 ± 0.029 0.920 ± 0.037 0.851 ± 0.021 0.961 ± 0.015 0.725 ± 0.017 0.661 ± 0.025 0.819 ± 0.013
ROSVM 0.888 ± 0.057 0.706 ± 0.103 0.981 ± 0.010 0.910 ± 0.059 0.737 ± 0.118 0.969 ± 0.031 0.725 ± 0.226 0.460 ± 0.200 0.817 ± 0.152
SMOTESVM 0.898 ± 0.043 0.708 ± 0.097 0.981 ± 0.016 0.914 ± 0.067 0.708 ± 0.120 0.974 ± 0.026 0.710 ± 0.233 0.565 ± 0.231 0.837 ± 0.085
BSMOTESVM 0.903 ± 0.055 0.789 ± 0.116 0.974 ± 0.027 0.883 ± 0.048 0.832 ± 0.130 0.963 ± 0.015 0.663 ± 0.116 0.639 ± 0.290 0.774 ± 0.152
WKSMOTESVM 0.898 ± 0.056 0.769 ± 0.104 0.979 ± 0.054 0.925 ± 0.051 0.841 ± 0.098 0.972 ± 0.049 0.721 ± 0.100 0.665 ± 0.144 0.821 ± 0.127
RUSVM 0.873 ± 0.077 0.684 ± 0.109 0.958 ± 0.040 0.910 ± 0.081 0.484 ± 0.127 0.965 ± 0.036 0.679 ± 0.295 0.188 ± 0.137 0.768 ± 0.185
CSVM 0.888 ± 0.061 0.716 ± 0.072 0.979 ± 0.015 0.886 ± 0.114 0.728 ± 0.141 0.976 ± 0.022 0.658 ± 0.293 0.479 ± 0.237 0.798 ± 0.155
CSO-SDAENN 0.890 ± 0.021 0.718 ± 0.015 0.971 ± 0.013 0.891 ± 0.018 0.733 ± 0.029 0.875 ± 0.025 0.717 ± 0.015 0.496 ± 0.021 0.797 ± 0.023
AdaSVM 0.897 ± 0.081 0.725 ± 0.084 0.979 ± 0.021 0.918 ± 0.050 0.737 ± 0.087 0.974 ± 0.024 0.718 ± 0.320 0.492 ± 0.299 0.811 ± 0.194
AdaCSVM 0.894 ± 0.043 0.709 ± 0.105 0.982 ± 0.011 0.913 ± 0.050 0.786 ± 0.071 0.973 ± 0.026 0.712 ± 0.223 0.531 ± 0.194 0.804 ± 0.176
AdaBoostICSVM 0.886 ± 0.054 0.717 ± 0.064 0.977 ± 0.013 0.912 ± 0.084 0.757 ± 0.106 0.971 ± 0.036 0.720 ± 0.139 0.542 ± 0.122 0.816 ± 0.212
Our method 0.896 ± 0.013 0.797 ± 0.035 0.982 ± 0.031 0.927 ± 0.076 0.879 ± 0.051 0.977 ± 0.031 0.742 ± 0.190 0.667 ± 0.113 0.830 ± 0.168
Dataset Yeast6 Yeast7 Yeast8
Measure G-M F-M AUC G-M F-M AUC G-M F-M AUC
SVM 0.112 ± 0.170 0.068 ± 0.110 0.805 ± 0.069 0.766 ± 0.122 0.617 ± 0.035 0.986 ± 0.011 0.653 ± 0.131 0.493 ± 0.164 0.842 ± 0.107
CSVDD 0.822 ± 0.043 0.261 ± 0.061 0.175 ± 0.042 0.956 ± 0.009 0.631 ± 0.069 0.042 ± 0.009 0.784 ± 0.105 0.497 ± 0.118 0.191 ± 0.082
DSMSM-SVDD 0.823 ± 0.013 0.303 ± 0.011 0.891 ± 0.029 0.957 ± 0.037 0.621 ± 0.021 0.981 ± 0.015 0.855 ± 0.017 0.501 ± 0.021 0.947 ± 0.023
ROSVM 0.818 ± 0.092 0.291 ± 0.064 0.893 ± 0.066 0.944 ± 0.046 0.533 ± 0.107 0.985 ± 0.007 0.853 ± 0.081 0.324 ± 0.084 0.948 ± 0.038
SMOTESVM 0.809 ± 0.068 0.271 ± 0.058 0.888 ± 0.059 0.965 ± 0.020 0.543 ± 0.101 0.986 ± 0.006 0.855 ± 0.082 0.285 ± 0.074 0.938 ± 0.034
BSMOTESVM 0.811 ± 0.069 0.304 ± 0.067 0.891 ± 0.065 0.933 ± 0.049 0.587 ± 0.116 0.985 ± 0.006 0.841 ± 0.101 0.516 ± 0.192 0.938 ± 0.033
WKSMOTESVM 0.825 ± 0.043 0.309 ± 0.061 0.895 ± 0.042 0.956 ± 0.009 0.631 ± 0.069 0.985 ± 0.009 0.860 ± 0.105 0.507 ± 0.118 0.951 ± 0.082
RUSVM 0.768 ± 0.092 0.299 ± 0.070 0.872 ± 0.072 0.896 ± 0.072 0.520 ± 0.106 0.983 ± 0.008 0.802 ± 0.231 0.302 ± 0.131 0.935 ± 0.069
CSVM 0.819 ± 0.106 0.277 ± 0.082 0.877 ± 0.077 0.949 ± 0.044 0.539 ± 0.131 0.987 ± 0.007 0.858 ± 0.065 0.303 ± 0.062 0.943 ± 0.030
CSO-SDAENN 0.820 ± 0.017 0.289 ± 0.025 0.897 ± 0.013 0.951 ± 0.013 0.543 ± 0.029 0.983 ± 0.017 0.857 ± 0.025 0.416 ± 0.027 0.937 ± 0.013
AdaSVM 0.822 ± 0.079 0.291 ± 0.070 0.903 ± 0.051 0.950 ± 0.037 0.539 ± 0.071 0.986 ± 0.006 0.848 ± 0.087 0.314 ± 0.106 0.942 ± 0.035
AdaCSVM 0.829 ± 0.066 0.296 ± 0.055 0.893 ± 0.050 0.950 ± 0.035 0.541 ± 0.081 0.986 ± 0.006 0.854 ± 0.071 0.325 ± 0.067 0.937 ± 0.043
AdaBoostICSVM 0.827 ± 0.088 0.296 ± 0.084 0.891 ± 0.070 0.954 ± 0.047 0.538 ± 0.081 0.985 ± 0.007 0.859 ± 0.072 0.318 ± 0.112 0.940 ± 0.048
Our method 0.837 ± 0.013 0.335 ± 0.023 0.899 ± 0.053 0.959 ± 0.105 0.636 ± 0.163 0.987 ± 0.006 0.863 ± 0.032 0.513 ± 0.027 0.955 ± 0.015

Table 6 of other methods when the proposed method is used as a control


Results of Holm’s test with the proposed method as the control algorithm. method in Holm’s test. This result indicates that boosting strategy
Classification methods α0.05 p-value can improve the generalization performance of the SVM classifier.
Performance metric G-Mean Compared with other imbalanced classification techniques, the
SVM 0.0046522 2.8269e−33 proposed DSMSM-SVDD scheme approach obtains p-values much
CSVDD 0.0051162 1.2499e−28 smaller than a significance level of α = 0.05 in terms of G-
DSMSM-SVDD 0.010206 1.3731e−10
Mean, F-Measure and AUC metrics, indicating that the proposed
ROSVM 0.0073008 2.7777e−13
SMOTESVM 0.012741 3.3328e−10 approach significantly outperforms other ones when dealing with
BSMOTESVM 0.0085124 2.7777e−13 imbalanced datasets. This is mainly because that as a base classi-
WKSMOTESVM 0.05 0.18908 fier, DSMSM-SVDD not only considers the effect of the different
RUSVM 0.005683 8.152e−26 distribution of the training instances on the performance but
CSVM 0.0063912 1.6881e−18
CSO-SDAENN 0.0073008 1.4524e−12
also adopts the maximum soft margin regularization term to
AdaSVM 0.010206 3.6364e−11 effective utilize valuable minority class instances, which leads to
AdaCSVM 0.016952 3.9932e−10 the resulting classification boundary more reasonable than other
AdaBoostICSVM 0.025321 9.7701e−10 techniques, thus improving the generalization performance. Ad-
Performance metric F-Measure
ditionally, unlike conventional one-class classifier such as CSVDD,
SVM 0.0051162 1.0852e−17
CSVDD 0.005683 7.131e−17 which usually obtains unsatisfactory AUC values and possesses
DSMSM-SVDD 0.025321 3.0008e−04 poor stability, the proposed approach shows superior classifi-
ROSVM 0.0063912 6.5656e−16 cation performance in terms of AUC measure and achieve the
SMOTESVM 0.0085124 1.2835e−11 improved robustness with the help of boosting ensemble.
BSMOTESVM 0.025321 1.081e−06
WKSMOTESVM 0.05 0.031592
RUSVM 0.0046522 9.0035e−19 6. Conclusions
CSVM 0.0073008 9.4641e−16
CSO-SDAENN 0.0063912 5.9605e−08 In order to deal with imbalanced classification problems, a
AdaSVM 0.010206 1.51e−11
novel ensemble of density-sensitive SVDD classifier based on
AdaCSVM 0.012741 1.8883e−09
AdaBoostICSVM 0.016952 5.8994e−08 maximum soft margin is proposed. In the proposed DSMSM-
Performance metric AUC SVDD method, the relative density-based penalty weights are in-
SVM 0.0063912 4.5067e−09 troduced to reflect the importance of different training instances.
CSVDD 0.0046522 5.0011e−29 The introduction of relative density-based penalty weights en-
DSMSM-SVDD 0.0051162 1.0469e−11
ROSVM 0.010206 5.6478e−08
ables the training instances with high relative densities more
SMOTESVM 0.025321 3.7995e−05 likely to fall into the hypersphere than those with low relative
BSMOTESVM 0.005683 3.1069e−10 densities, which is beneficial to produce the more appropriate
WKSMOTESVM 0.05 0.0017855 classification boundary and thus improve the generalization ca-
RUSVM 0.0051162 1.8488e−17
pability. In addition, to utilize fully rare valuable minority class
CSVM 0.0073008 9.6286e−09
CSO-SDAENN 0.0046522 5.8436e−12 instances to refine the classification boundary, the maximum soft
AdaSVM 0.016952 3.3396e−06 margin regularization term is incorporated into the optimiza-
AdaCSVM 0.012741 2.585e−06 tion objective function in the proposed DSMSM-SVDD approach.
AdaBoostICSVM 0.0085124 3.655e−08 The comparative results on arbitrary datasets demonstrate that
Compared to CSVDD, DSMSM-SVDD can avoid under-fitting and
effectively improve the generalization capacity due to the ef-
the cost-sensitive coefficient in advance, which is difficult in most fect of relative density-based penalty weights. In addition, with
cases and is usually application-dependent, so the improvement the introduction of maximum soft margin generalization term,
of generalization performance is not obvious affected by im- DSMSM-SVDD seems to behave similarly to SVM and simulta-
proper cost-sensitive coefficient values. From the results shown neously is not affected by rare minority instances, which makes
in Table 6, we can also find that all SVM-ensemble methods it more suitable for dealing with imbalanced datasets, especially
obtain greatly higher adjusted α values and p-values than those highly imbalanced ones. To further improve the generalization
17
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

where) α = α1 , α2 , . . . , αN +N , αi ≥ 0 and β = (β1 , β2 , . . . ,


( )
performance and enhance the robustness, we developed an en-
semble scheme based on AdaBoost strategy using DSMSM-SVDD βN +N , βi ≥ 0 are Lagrange multiplier vectors.
as base classifiers. The extensive experimental results show that According to the KKT conditions, setting partial derivatives
the proposed DSMSM-SVDD ensemble approach can outperform with respect to R, d, a, ξi , ξj to zero, we can obtain:
other compared imbalanced classification techniques with signifi-
N +N
cant difference and has good stability. Finally, we also provide the ∂L ∑
= 2R − 2 αi yi R = 0 (A.2)
theoretical analyses and empirical results about the relationship ∂R
i=1
between the margin parameter M and the nonnegative penalty
N +N
constant factors C , and give the simple setting strategy of M ∂L ∑
according to C values so as to eliminate the difficulty of setting = −2Md + 2 αi d = 0 (A.3)
∂d
values. Furthermore, the effect of other parameters regarding i=1
N +N
relative density-based penalty weights are also discussed in the ∂L ∑
experiment and some suggestion about setting them is also given. = −2 αi yi (ϕ (xi ) − a) = 0 (A.4)
∂a
It is worth noting that in the experiments we find although the i=1
∂L
proposed algorithm performs well in most of tested datasets, it = C1 ρi − αi − βi = 0, ∀i, yi = 1 (A.5)
cannot often obtain satisfactory performance when multi-modal ∂ξi
exists in majority class or within-class imbalance occurs in mi- ∂L
= C2 ρj − αj − βj = 0, ∀j, yj = −1 (A.6)
nority class. In addition, the effect of Gaussian kernel parameter ∂ξj
on the performance of the proposed algorithm was not discussed According to the complementary slackness condition, we obtain:
in this study, which is difficult to be tuned especially under
αi d2 − ξi − yi R2 − ⟨ϕ (xi ) − a, ϕ (xi ) − a⟩ = 0
[ ( )]
highly skewed imbalanced high-dimensional datasets due to only (A.7)
a few minority instances available for cross-validation. Therefore, αi ≥ 0 (A.8)
how to improve the performance under multi-modal or within-
class imbalanced issues and set appropriate kernel parameters − βi ξ i = 0 (A.9)
for the proposed algorithm under highly skewed imbalanced βi ≥ 0 (A.10)
high-dimensional datasets should be one of our future research
focuses. Combining the above derived constraints, we can further obtain
the following conditions:
From Eq. (A.2), we can obtain:
CRediT authorship contribution statement
N +N

Xinmin Tao: Conceptualization, Methodology, Software. Wei αi yi = 1 (A.11)
Chen: Data curation, Writing - original draft. Xiangke Li: Visu- i=1
alization, Investigation. Xiaohan Zhang: Supervision. Yetong Li: From Eq. (A.3), we can obtain:
Software, Validation. Jie Guo: Writing - review & editing.
N +N

αi = M (A.12)
Declaration of competing interest
i=1

The authors declare that they have no known competing finan- From Eqs. (A.5) and (A.8), we can obtain:
cial interests or personal relationships that could have appeared 0 ≤ α i ≤ C 1 ρi , ∀i , y i = 1 (A.13)
to influence the work reported in this paper.
From Eqs. (A.6) and (A.8), we can obtain:
Acknowledgments 0 ≤ αj ≤ C2 ρj , ∀j, yj = −1 (A.14)
From Eqs. (A.4) and (A.11), we can obtain:
This work was supported in part by the Fundamental Re-
search Funds for the Central Universities, China [grant num- N +N

bers 2572017EB02], Innovative talent fund of Harbin science and a= αi yi ϕ (xi ) (A.15)
technology Bureau, China [grant number 2017RAXXJ018], Double i=1
first-class scientific research foundation of Northeast Forestry Substituting Eqs. (A.5), (A.6), (A.11), (A.12), (A.15) into Eq. (A.1),
University, China [grant number 411112438]. The authors are the corresponding dual problem is expressed by:
grateful to the anonymous reviewers for their valuable comments
L a, R, d, ξi , ξj , α, β
( )
and suggestions which were very helpful in improving the quality
and presentation of this paper.
 2
N +N N +N  N +N 
∑ ∑ ∑
αi yi ∥ϕ (xi ) − a∥2 = αi yi ϕ (xi ) − αi yi ϕ (xi )
 
= 
Appendix i=1 i=1  i=1



N +N N +N
By introducing the Lagrange multipliers, the primal problem of ∑ ∑
finding an optimal domain description can be formulated into: = αi yi ⎝ϕ (xi ) · ϕ (xi ) − 2ϕ (xi ) · αi yi ϕ (xi )
∑ ∑ i=1 i=1
L a, R, d, ξi , ξj , α, β = R2 − Md2 + C1 ρi ξi + C2 ρj ξj
( )
N +N
i,yi =1 j,yj =−1

+ αi αj yi yj ϕ (xi )
N +N N +N
∑ ∑ i,j=1
αi d2 − ξi − yi R2 − ⟨ϕ (xi ) − a, ϕ (xi ) − a⟩ βi ξ i
[ ( )]
+ − ⎞
i=1 i=1
( )
·ϕ xj ⎠
(A.1)

18
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897
⎡ ⎛ ⎞
N +N N +N Nn N +N N +N
1
⎣ 1
∑ ∑ ∑
⎝k (xl , xl ) − 2 αi yi k (xl , xi ) + αi αj yi yj k xi , xj ⎠
∑ ∑ ( )
αi yi ϕ (xi ) · ϕ (xi ) − 2 αi αj yi yj ϕ (xi ) · ϕ xj
( )
= =
2 Nn
l=1 i=1 i,j=1
i=1 i,j=1 ⎛ ⎞⎤
Na N +N N +N
N +N 1 ∑ ∑ ∑
⎝k ( x m , x m ) − 2 αi yi k (xm , xi ) + αi αj yi yj k xi , xj ⎠⎦
( )
∑ +
αi αj yi yj ϕ (xi ) · ϕ xj
( )
+ Na
m=1 i=1 i,j=1
i,j=1
(A.24)
N +N N +N
Let γi = αi yi , ∀i = 1, 2, . . . , N + N. We obtain:
∑ ∑ ∗
αi yi ϕ (xi ) · ϕ (xi ) − αi αj yi yj ϕ (xi ) · ϕ xj
( )
= (A.16)
i=1 i,j=1 N +N

According to kernel trick, the inner product ϕ (xi ) · ϕ (xi ) be- a= γi∗ ϕ (xi ) (A.25)
tween(two input feature vectors can be replaced by a kernel func- i=1

tion k xi , xj satisfying Mercer’s condition. For instance,


)
Gaussian ⎛ ⎡
( 2 Nn N +N
kernel function: kernel xi , xj = k xi , xj 1 1
( ) ( ) 
= exp − xi − xj 
∑ ∑
R2 = ⎣ ⎝k (xl , xl ) − 2 γi∗ k (xl , xi )
) 2 Nn
/2σ 2 . Replacing all inner products with the kernel function, the l=1 i=1

corresponding dual problem can alternatively expressed by N +N

γi∗ γj∗ k xi , xj ⎠
( )
⎧ ⎫ +
⎨N∑
+N N +N
∑ )⎬ i,j=1
αi yi k (xi , xi ) − αi αj yi yj k xi , xj
(
max (A.17)
α

Na N +N
i,j=1
⎩ ⎭
i=1 1 ∑ ∑
+ ⎝k (xm , xm ) − 2 γi∗ k (xm , xi )
s.t. Na
m=1 i=1
⎞⎤
N +N N +N

αi yi = 1

(A.18) γi γj k xi , xj ⎠⎦
∗ ∗
( )
+
i=1 i,j=1
N +N ⎡ ⎛ ⎞
∑ Nn N +N N +N
αi = M (A.19) 1 1 ∑ ∑ ∑
γi∗ k (xl , xi ) + γi∗ γj∗ k xi , xj ⎠
( )
= ⎣ ⎝1 − 2
i=1 2 Nn
l=1 i=1 i,j=1
αi ≥ 0, ∀i = 1, 2, . . . , N + N (A.20) ⎛ ⎞⎤
Na N +N N +N
0 ≤ αi ≤ C1 ρi , ∀i, yi = 1 (A.21) 1 ∑ ∑ ∑
γi∗ k (xm , xi ) + γi∗ γj∗ k xi , xj ⎠⎦
( )
+ ⎝1 − 2
Na
0 ≤ αj ≤ C2 ρj , ∀j, yj = −1 (A.22) m=1 i=1 i,j=1

(A.26)
Through solving the dual quadratic programming problem, we (  )
can obtain the Lagrange multiplier vectors α. And thus the center
2
where k xi , xj = exp − xi − xj  /2σ 2 ; k (xl , xl ) = exp
( )
a and the radius R of the minimum enclosing hypersphere S can
− ∥xl − xl ∥2 /2σ 2 = 1; Nn represents the number of majority
( )
be expressed as follows: class support vectors; Na represents the number of minority class
N +N support vectors; xl denotes the majority class support vector; xm
denotes the minority class support vector.

a= αi yi ϕ (xi ) (A.23)
According to the complementary slackness condition of KKT,
i=1
A majority (minority) class training instance xi (xj ) and its corre-
[ ] sponding αi (αj ) satisfies the following conditions:
Nn Na
1 1 ∑( 1 ∑(
R2 = ∥ϕ (xl ) − a∥2 + ∥ϕ (xm ) − a∥2 (1) If αi = 0.∀i, yi = 1. According to (C1 ) ρi − αi − βi =
) )
2 Nn Na
⎡ ⎛
l=1 m=1 0, −βi ξi = 0, we obtain βi = (C1 ) ρi > 0, ξi = 0. According
2 ⎞
to αi d[2 − ξi − yi R(2 − ⟨ϕ (xi ) − a, ϕ (xi ) − a⟩ )]= 0, we
(⏐⏐ [ ( )]
Nn  N +N  Na
1 ⎢ 1 ∑ ⎜ 1 ∑ ⏐⏐
∑ ⏐⏐
⎝ ϕ ( x l ) − αi yi ϕ (xi ) ⎠ + ⏐⏐ϕ (xm )
 ⎟
= ⎣ obtain d2 − ξi − yi R2 − ⟨ϕ (xi ) − a, ϕ (xi ) − a⟩ < 0, ∥
 
2 Nn Na
ϕ (xi ) − a ∥2 < R2 − d2 . That is, xi is classified correctly.
⏐⏐
l=1  i=1  m=1

If αj = 0.∀j, yj = −1. According to (C2 ) ρj − αj − βj =


⏐⏐2 ⎞ ⎤
N +N ⏐⏐
0, −βj ξj = 0, we obtain βj = (C2 ) ρj > 0, ξj = 0. According

− αi yi ϕ (xi ) ⏐⏐ ⎠⎦
⏐⏐
to αj d2 − ξj − yj R2 − ϕ xj − a, ϕ xj − a
⏐⏐ [ ⟨ ( ) ( ( ) ⟩)]

i=1

= 0, we
( ) d − ξj − yj R − ϕ xj − a, ϕ xj − a < 0, ∥
Nn N +N
[ 2 ( 2 ⟨ ( ) ( ) ⟩)]
1 1 ∑ ∑ obtain
= ⎣ ⎝ϕ (xl ) · ϕ (xl ) − 2 αi yi ϕ (xl ) · ϕ (xi )
2 Nn ϕ xj − a ∥2 > R2 + d2 . That is, xj is classified correctly.
l=1 i=1
⎞ (2) If 0 < αi < (C1 ) ρi , ∀i, yi = 1. According to (C1 ) ρi − αi −
∑ N +N
βi = 0, −βi ξi[ = 0, we obtain( βi = (C1 ) ρi − αi , ξi = 0.)]Ac-
αi αj yi yj ϕ (xi ) · ϕ xj ⎠
( )
+
cording to αi d2 − ξi − yi R2 − ⟨ϕ (xi ) − a, ϕ (xi ) − a⟩ =
i,j=1
0, we obtain d2 − ξi − yi R2 − ⟨ϕ (xi ) − a, ϕ (xi ) − a⟩ =
[ ( )]

Na N +N N +N
1 ∑ 0, ∥ ϕ (xi ) − a ∥2 = R2 − d2 . That is, xi is correctly classified,
∑ ∑
+ ⎝ϕ (xm ) · ϕ (xm ) − 2 αi yi ϕ (xm ) · ϕ (xi ) + αi αj yi yj ϕ (xi )
Na and it is also the majority class support vectors.
m=1 i=1 i,j=1
⎞⎤ If 0 < αj < (C2 ) ρj , ∀j, yj = −1. According to (C2 ) ρj − αj −
( ) βj = 0, −βj ξj = 0, we obtain βj = (C2 ) ρj − αj , ξj = 0. Ac-
·ϕ xj ⎠ ⎦
cording to αj d2 − ξj − yj R2 − ϕ xj − a, ϕ xj − a
[ ( ⟨ ( ) ( ) ⟩)]
=
19
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

0, we obtain d2 − ξj − yj R2 − ϕ xj − a, ϕ xj − a
[ (
⟨ ( ) ( ) ⟩)]
= [15] R.J. Yang, L. Yu, Y.J. Zhao, et al., Big data analytics for financial market
0, ∥ ϕ (xi ) − a ∥= R2 + d2 . That is, xj is correctly classified, volatility forecast based on support vector machine, Int. J. Inf. Manage. 50
(2020) 452–462, http://dx.doi.org/10.1016/j.ijinfomgt.2019.05.027.
and it is also the minority class support vectors.
[16] Y.J. Li, T. Zhang, Deep neural mapping support vector machines, Neural
(3) If αi = (C1 ) ρi , ∀i, yi = 1 According to (C1 ) ρi − αi − βi = Netw. 93 (2017) 185–194, http://dx.doi.org/10.1016/j.neunet.2017.05.010.
0, −βi ξi = 0, we obtain βi = 0, ξi > 0. That is, xi belongs to [17] K.L. Han, S.B. Kim, An overlap-sensitive margin classifier for imbalanced
the majority class yet it is misjudged as the minority class. and overlapping data, Expert Syst. Appl. 98 (2018) 72–83, http://dx.doi.
If αj = (C2 ) ρj , ∀j, yj = −1 According to (C2 ) ρj − αj − βj = org/10.1016/j.eswa.2018.01.008.
[18] P. Zhou, X. Hu, P. Li, et al., Online feature selection for high-dimensional
0, −βj ξj = 0, we obtain βj = 0, ξj > 0. That is, xj belongs
class-imbalanced data, Knowl.-Based Syst. 136 (2017) 187–199, http://dx.
to the minority class yet it is misclassified as the majority doi.org/10.1016/j.knosys.2017.09.006.
class. [19] A. Neocleous, K. Nicolaides, C. Schizas, Intelligent noninvasive diagnosis
of aneuploidy: raw values and highly imbalanced dataset, IEEE J. Biomed.
To determine whether an unseen test instance xnew is within Health Inf. 21 (5) (2017) 1271–1279, http://dx.doi.org/10.1109/JBHI.2016.
the hypersphere, the distance from xnew to the center of the 2608859.
hypersphere S should be calculated as follow: [20] A. Daraei, H. Hamidi, An efficient predictive model for myocardial infarc-
 2 tion using cost-sensitive J48 model, Iran. J. Public Health 46 (5) (2017)
 N +N  N +N 682–692.
[21] X.R. Chao, Y. Peng, A cost-sensitive multi-criteria quadratic programming
∑ ∑
ϕ ( ) γ ϕ ( ) γi∗ γj∗ k xi , xj
 
dnew 2 ∗
( )
=
 xnew − i xi 
 = 1 + model for imbalanced data, J. Oper. Res. Soc. 69 (4) (2018) 500–516,
 i=1  i,j=1 http://dx.doi.org/10.1057/s41274-017-0233-4.
N +N
[22] Y.Y. Zhu, J.W. Liang, J.Y. Chen, et al., An improved NSGA-III algorithm
∑ for feature selection used in intrusion detection, Knowl.-Based Syst. 116
−2 γi∗ k (xnew , xi ) (A.27) (2017) 74–85, http://dx.doi.org/10.1016/j.knosys.2016.10.030.
i=1 [23] X.M. Tao, Q. Li, W.J. Guo, et al., Self-adaptive cost weights-based support
vector machine cost-sensitive ensemble for imbalanced data classification,
An unseen test sample xnew is accepted as majority class when Inform. Sci. 487 (2019) 31–56, http://dx.doi.org/10.1016/j.ins.2019.02.062.
the distance to the center of the hypersphere is smaller than the [24] J. Huang, X.F. Yan, Related and independent variable fault detection based
radius R, that is dnew 2 ≤ R2 , otherwise as minority class. on KPCA and SVDD, J. Process Control 39 (2016) 88–99, http://dx.doi.org/
10.1016/j.jprocont.2016.01.001.
[25] S. Ye, D.M. Chen, J. Yu, A targeted change-detection procedure by com-
References
bining change vector analysis and post-classification approach, Isprs J.
Photogramm. Remote Sens. 114 (2016) 115–124.
[1] B. Gu, X. Sun, V.S. Sheng, Structural minimax probability machine, IEEE
[26] M. Cha, J.S. Kim, J.G. Baek, Density weighted support vector data descrip-
Trans. Neural Netw. Learn. Syst. 28 (7) (2016) 1646–1656, http://dx.doi.
tion, Expert Syst. Appl. 41 (7) (2014) 3343–3350, http://dx.doi.org/10.1016/
org/10.1109/TNNLS.2016.2544779.
j.eswa.2013.11.025.
[2] L. Zhang, D. Zhang, Evolutionary cost-sensitive extreme learning machine,
[27] X.M. Tao, Q. Li, C. Ren, et al., Affinity and class probability-based fuzzy
IEEE Trans. Neural Netw. Learn. Syst. 28 (12) (2017) 3045–3060, http:
support vector machine for imbalanced data sets, Neural Netw. 122 (2020)
//dx.doi.org/10.1109/TNNLS.2016.2607757.
289–307, http://dx.doi.org/10.1016/j.neunet.2019.10.016.
[3] M. Shafiq, Z. Tian, A.K. Bashir, Data mining and machine learning methods
[28] X.M. Tao, Q. Li, W.J. Guo, et al., Adaptive weighted over-sampling for imbal-
for sustainable smart cities traffic classification: a survey, Sustainable Cities
anced datasets based on density peaks clustering with heuristic filtering,
Soc. 60 (2020) http://dx.doi.org/10.1016/j.scs.2020.102177.
Inform. Sci. 519 (2020) 43–73, http://dx.doi.org/10.1016/j.ins.2020.01.032.
[4] M. Ruiz, L.E. Mujica, S. Alférez, et al., Wind turbine fault detection and clas-
[29] C. Jimenez-Castaño, A. Alvarez-Meza, A. Orozco-Gutierrez, Enhanced au-
sification by means of image texture analysis, Mech. Syst. Signal Process.
tomatic twin support vector machine for imbalanced data classification,
107 (2018) 149–167, http://dx.doi.org/10.1016/j.ymssp.2017.12.035.
Pattern Recognit. 107 (2020) 107442, http://dx.doi.org/10.1016/j.patcog.
[5] Q. Zhang, L.T. Yang, Z. Chen, et al., A survey on deep learning for big data,
2020.107442.
Inf. Fusion 42 (2018) 146–157, http://dx.doi.org/10.1016/j.inffus.2017.10.
[30] A. Roy, R.M.O. Cruz, R. Sabourin, et al., A study on combining dynamic
006.
selection and data preprocessing for imbalance learning, Neurocomputing
[6] S.K. Ghosh, A. Ghosh, Classification of gene expression patterns using a
286 (2018) 179–192, http://dx.doi.org/10.1016/j.neucom.2018.01.060.
novel type-2 fuzzy multigranulation-based SVM model for the recognition
of cancer mediating biomarkers, Neural Comput. Appl. (2) (2020) http: [31] Q. Kang, X.S. Chen, S.S. Li, et al., A noise-filtered under-sampling scheme for
//dx.doi.org/10.1007/s00521-020-05241-7. imbalanced classification, IEEE Trans. Cybern. 47 (12) (2017) 4263–4274,
http://dx.doi.org/10.1109/TCYB.2016.2606104.
[7] M. Elkano, M. Galar, J. Sanz, et al., CHI-PG: a fast prototype generation
algorithm for big data classification problems, Neurocomputing 287 (2018) [32] A. Amin, S. Anwar, A. Adnan, et al., Comparing oversampling techniques
22–33, http://dx.doi.org/10.1016/j.neucom.2018.01.056. to handle the class imbalance problem: a customer churn prediction case
[8] J. Gola, D. Britz, T. Staudt, et al., Advanced microstructure classification study, IEEE Access 4 (2016) 7940–7957, http://dx.doi.org/10.1109/ACCESS.
by data mining methods, Comput. Mater. Sci. 148 (2018) 324–335, http: 2016.2619719.
//dx.doi.org/10.1016/j.commatsci.2018.03.004. [33] T.F. Zhu, Y.P. Lin, Y.H. Liu, Synthetic minority oversampling technique
[9] J.P. Barddal, L. Loezer, F. Enembreck, et al., Lessons learned from data for multiclass imbalance problems, Pattern Recognit. 72 (2017) 327–340,
stream classification applied to credit scoring, Expert Syst. Appl. 162 (2020) http://dx.doi.org/10.1016/j.patcog.2017.07.024.
113899, http://dx.doi.org/10.1016/j.eswa.2020.113899. [34] L. Abdi, S. Hashemi, To combat multi-class imbalanced problems by means
[10] W. Chen, H.R. Pourghasemi, A. Kornejady, Landslide spatial modeling: of over-sampling techniques, IEEE Trans. Knowl. Data Eng. 28 (1) (2016)
introducing new ensembles of ANN, maxent, and SVM machine learning 238–251, http://dx.doi.org/10.1109/TKDE.2015.2458858.
techniques, Geofis. Int. 305 (2017) 314–327, http://dx.doi.org/10.1016/j. [35] N.V. Chawla, K.W. Bowyer, L.O. Hall, et al., SMOTE: Synthetic Minority
geoderma.2017.06.020. Over-sampling Technique, J. Artificial Intelligence Res. 16 (2002) 321–357.
[11] X. Yao, J. Crook, G. Andreeva, Enhancing two-stage modelling methodology [36] J. Sun, J. Lang, H. Fujita, et al., Imbalanced enterprise credit evaluation
for loss given default with support vector machines, European J. Oper. Res. with DTE-SBD: decision tree ensemble based on SMOTE and bagging
263 (2) (2017) 679–689, http://dx.doi.org/10.1016/j.ejor.2017.05.017. with differentiated sampling rates, Inform. Sci. 425 (2018) 76–91, http:
[12] A.A. Aburomman, M.B.I. Reaz, A novel weighted support vector machines //dx.doi.org/10.1016/j.ins.2017.10.017.
multiclass classifier based on differential evolution for intrusion detection [37] C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, Safe-Level-SMOTE:
systems, Inform. Sci. 414 (2017) 225–246, http://dx.doi.org/10.1016/j.ins. Safe-level-synthetic minority over-sampling technique for handling the
2017.06.007. class imbalanced problem, in: Pacific-Asia Conference on Advances in
[13] F. Kang, J.S. Li, J.J. Li, System reliability analysis of slopes using least Knowledge Discovery & Data Mining, Springer-Verlag, 2009.
squares support vector machines with particle swarm optimization, Neu- [38] H. Han, W. Wang, B. Mao, Borderline-SMOTE: A new over-sampling
rocomputing 209 (2016) 46–56, http://dx.doi.org/10.1016/j.neucom.2015. method in imbalanced data sets learning, Adv. Intell. Comput. 17 (12)
11.122. (2005) 878–887, http://dx.doi.org/10.1007/11538059_91.
[14] J. Masino, J. Pinay, M. Reischl, et al., Road surface prediction from [39] S. Barua, M.M. Islam, X. Yao, et al., MWMOTE-Majority weighted minority
acoustical measurements in the tire cavity using support vector machine, oversampling technique for imbalanced data set learning, IEEE Trans.
Appl. Acoust. 125 (2017) 41–48, http://dx.doi.org/10.1016/j.apacoust.2017. Knowl. Data Eng. 26 (2) (2014) 405–425, http://dx.doi.org/10.1109/TKDE.
03.018. 2012.232.

20
X.M. Tao, W. Chen, X.K. Li et al. Knowledge-Based Systems 219 (2021) 106897

[40] J. Mathew, M. Luo, C.K. Pang, et al., Kernel-based SMOTE for SVM Classi- [55] W. Fan, S.J. Stolfo, J.X. Zhang, et al., AdaCost: misclassification cost-sensitive
fication of Imbalanced Datasets, in: IECON 2015-41ST Annual Conference boosting, in: Proceedings of the Sixteenth International Conference on
of the IEEE Industrial Electronics Society, 2005. Machine Learning, 1999.
[41] J. Mathew, C.K. Pang, M. Luo, et al., Classification of imbalanced data by [56] K.M. Ting, A comparative study of cost-sensitive boosting algorithms, in:
oversampling in Kernel Space of support vector machines, IEEE Trans. Proceedings of the 17th International Conference on Machine Learning,
Neural Netw. Learn. Syst. 29 (9) (2018) 4065–4076, http://dx.doi.org/10. Stanford University, CA, 2000.
1109/TNNLS.2017.2751612. [57] Y. Sun, M.S. Kamel, A.K.C. Wong, et al., Cost-sensitive boosting for classi-
[42] X.M. Tao, Q. Li, C. Ren, W.J. Guo, et al., Real-value negative selection over- fication of imbalanced data, Pattern Recognit. 40 (12) (2007) 3358–3378,
sampling for imbalanced data set learning, Expert Syst. Appl. 129 (2019) http://dx.doi.org/10.1016/j.patcog.2007.04.009.
118–134, http://dx.doi.org/10.1016/j.eswa.2019.04.011. [58] W. Lee, C.H. Jun, J.S. Lee, Instance categorization by support vector
[43] B. Gu, V.S. Sheng, K.Y. Tay, et al., Cross validation through two-dimensional machines to adjust weights in adaboost for imbalanced data classification,
solution surface for cost-sensitive SVM, IEEE Trans. Pattern Anal. Mach. Inform. Sci. 381 (2017) 92–103, http://dx.doi.org/10.1016/j.ins.2016.11.014.
Intell. 39 (6) (2017) 1103–1121, http://dx.doi.org/10.1109/TPAMI.2016. [59] C. Bellinger, S. Sharma, N. Japkowicz, One-class classification-From theory
2578326. to practice: A case-study in radioactive threat detection, Expert Syst. Appl.
[44] Q. Zhang, X.X. Chen, Z. Fang, et al., Reducing false arrhythmia alarm 108 (2018) 223–232, http://dx.doi.org/10.1016/j.eswa.2018.05.009.
rates using robust heart rate estimation and cost-sensitive support vector [60] I. Jeong, D.G. Kim, J.Y. Choi, et al., Geometric one-class classifiers using
machines, Physiol. Meas. 38 (2) (2017) 259–271, http://dx.doi.org/10.1088/ hyper-rectangles for knowledge extraction, Expert Syst. Appl. 117 (2019)
1361-6579/38/2/259. 112–124, http://dx.doi.org/10.1016/j.eswa.2018.09.042.
[45] F.Y. Cheng, J. Zhang, C.H. Wen, Cost-sensitive large margin distribution [61] V. Camerini, G. Coppotelli, S. Bendisch, Fault detection in operating heli-
machine for classification of imbalanced data, Pattern Recognit. Lett. 80 copter drivetrain components based on support vector data description,
(2016) 107–112, http://dx.doi.org/10.1016/j.patrec.2016.06.009. Aerosp. Sci. Technol. 73 (2018) 48–60, http://dx.doi.org/10.1016/j.ast.2017.
[46] Z.H. Zhou, X.Y. Liu, Training cost-sensitive neural networks with methods 11.043.
addressing the class imbalance problem, IEEE Trans. Knowl. Data Eng. 18 [62] S.C. Pang, M.A. Orgun, Z.Z. Yu, A novel biomedical image indexing and
(1) (2006) 63–77, http://dx.doi.org/10.1109/TKDE.2006.17. retrieval system via deep preference learning, Comput. Methods Programs
[47] A. Ghazikhani, R. Monsefi, H.S. Yazdi, Online cost-sensitive neural network Biomed. 158 (2018) 53–69, http://dx.doi.org/10.1016/j.cmpb.2018.02.003.
classifiers for non-stationary and imbalanced data streams, Neural Comput. [63] G.G. Cabral, A.L.I. Oliveira, One-class classification based on searching for
Appl. 23 (5) (2013) 1283–1295, http://dx.doi.org/10.1007/s00521-012- the problem features limits, Expert Syst. Appl. 41 (16) (2014) 7182–7199,
1071-6. http://dx.doi.org/10.1016/j.eswa.2014.05.037.
[48] C. Zhang, W. Gao, J. Song, et al., An imbalanced data classification algorithm [64] X.Q. Wang, D. Wei, H. Cheng, et al., Multi-instance learning based on
of improved autoencoder neural network, in: 2016 Eighth International representative instance and feature mapping, Neurocomputing 216 (2016)
Conference on Advanced Computational Intelligence (ICACI), 2016. 790–796, http://dx.doi.org/10.1016/j.neucom.2016.07.055.
[49] Y.H. Zhou, Z.H. Zhou, Large margin distribution learning with cost interval [65] A. Belghith, C. Bowd, F.A. Medeiros, et al., Learning from healthy and stable
and unlabeled data, IEEE Trans. Knowl. Data Eng. 28 (7) (2016) 1749–1763, eyes: a new approach for detection of glaucomatous progression, Artif.
http://dx.doi.org/10.1109/TKDE.2016.2535283. Intell. Med. 64 (2) (2015) 105–115, http://dx.doi.org/10.1016/j.artmed.
[50] G. Tuysuzoglu, D. Birant, Enhanced Bagging (eBagging): A novel approach 2015.04.002.
for ensemble learning, Int. Arab J. Inf. Technol. 17 (4) (2020) 515–528, [66] A.E. Lazzaretti, D.M.J. Tax, H.V. Neto, et al., Novelty detection and multi-
http://dx.doi.org/10.34028/iajit/17/4/10. class classification in power distribution voltage waveforms, Expert Syst.
[51] H.R. Kadkhodaei, A.M.E. Moghadam, M. Dehghan, HBoost: A heterogeneous Appl. 45 (2016) 322–330, http://dx.doi.org/10.1016/j.eswa.2015.09.048.
ensemble classifier based on the boosting method and entropy measure- [67] C.F. Zhang, K.X. Peng, J. Dong, A novel plant-wide process monitoring
ment, Expert Syst. Appl. 157 (2020) 113482, http://dx.doi.org/10.1016/j. framework based on distributed Gap-SVDD with adaptive radius, Neu-
eswa.2020.113482. rocomputing 350 (2019) 1–12, http://dx.doi.org/10.1016/j.neucom.2019.04.
[52] C.J. Tsai, New feature selection and voting scheme to improve classification 026.
accuracy, Soft Comput. 23 (22) (2019) 12017–12030, http://dx.doi.org/10. [68] H.T. Lin, C.J. Lin, R.C. Weng, A note on Platt’s probabilistic outputs for
1007/s00500-019-03757-2. support vector machines, Mach. Learn. 68 (2007) 267–276, http://dx.doi.
[53] N. Mahendran, P.M.D.R. Vincent, K. Srinivasan, et al., Realizing a stack- org/10.1007/s10994-007-5018-6.
ing generalization model to improve the prediction accuracy of major [69] [dataset], Machine Learning Repository UCI, 2015, http://archive.ics.uci.
depressive disorder in adults, IEEE Access 8 (2020) 49509–49522, http: edu/ml/datasets.html.
//dx.doi.org/10.1109/access.2020.2977887. [70] J. Demsar, Statistical comparisons of classifiers over multiple data sets,
[54] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line J. Mach. Learn. Res. 7 (2006) 1–30, http://dx.doi.org/10.1007/s10846-005-
learning and an application to boosting, J. Comput. System Sci. 55 (1997) 9016-2.
119–139, http://dx.doi.org/10.1006/jcss.1997.1504.

21

You might also like