Class Weights Random Forest Algorithm For Processing Class Imbalanced Medical Data

Received November 30, 2017, accepted December 31, 2017, date of publication January 4, 2018,
date of current version February 28, 2018.

Digital Object Identifier 10.1109/ACCESS.2018.2789428
Class Weights Random Forest Algorithm for

Processing Class Imbalanced Medical Data
MIN ZHU 1,2 , JING XIA1 , XIAOQING JIN3 , MOLEI YAN3 , GUOLONG CAI3 ,
JING YAN3 , AND GANGMIN NING1
1 Department of Biomedical Engineering, Zhejiang University, Hangzhou, Zhejiang 310027, China
2 Guizhou Key Laboratory of Agricultural Bioengineering, Guizhou University, Guiyang 550025, China
3 Zhejiang Hospital, Hangzhou 310058, China
Corresponding authors: Jing Yan (zjicu@vip.163.com) and Gangmin Ning (gmning@zju.edu.cn)

This work was supported in part by the National Natural Science Foundation of China under Grant 81271662, in part by the Ministry of the
Science and Technology of China under Grant 2014DFT3050, and in part by the Department of Science and Technology of Zhejiang
Province under Grant 2013C03049-2.
ABSTRACT The classification in class imbalanced data has drawn significant interest in medical application.
Most existing methods are prone to categorize the samples into the majority class, resulting in bias,
in particular the insufficient identification of minority class. A kind of novel approach, class weights random
forest is introduced to address the problem, by assigning individual weights for each class instead of a single
weight. The validation test on UCI data sets demonstrates that for imbalanced medical data, the proposed
method enhanced the overall performance of the classifier while producing high accuracy in identifying both
majority and minority class.
INDEX TERMS Class imbalanced, random forest, weighted voting, class weights voting.
I. INTRODUCTION with larger distribution is named as the majority while the

Classification in medical diagnostics is able to aid in disease other is named as the minority [12], [13]. Dealing with the
diagnosis and predicts outcomes in response to the treatment. class-imbalanced data, conventional algorithms are prone to
Many efforts have been made to improve the classification consider tend to minority observation as noise or outliers and
performance. For instance, in traditional methodology for ignore them in the classifying [12], thereby tend to classify
classification, logistic regression–based trichotomous classi- samples into the majority class [9], [14], [15]. Consequently,
fication tree was applied in diagnosing breast cancers [1]. the predictive accuracy for the minority class will be much
Non-parametric empirical bayes algorithm was developed for lower than that for the majority class [8], [16]–[20].
integrative genetic risk prediction of complex diseases with To solve the aforementioned problems, data usually need to
binary phenotypes [2]. Hierarchical support vector machine- be processed to construct a balanced dataset [21]–[32]. How-
based algorithm was employed in the EEG-based motor ever, in classification in medical diagnostics.., it is often desir-
imagery classification task [3]. Bionic algorithms were also able to retain as much data as possible. Therefore, the direct
introduced in the classification of medical data. Self-adaptive application of data is widely employed. Classifier Combina-
niche genetic algorithm with random forest was proposed tion is a practical method [18], [28]–[30], [33], [34], and the
to build model for sepsis patient’s stratification [4], Clas- ultimate goal of designing pattern recognition systems is to
sification rules were extracted by ant-miner algorithm and achieve the best possible classification performance for the
thereby applied in diagnosing heart disease etc. [5]. Utilizing task. This suggests that different classifier designs potentially
neural network (CNN) algorithm, multimodal disease risk offering complementary information about the patterns to
prediction model was developed to predict the risk of chronic be classified, which could be harnessed to improve perfor-
disease in communities [6], and classify lung nodule by mance when dealing with class imbalanced data [35]–[40].
unsupervised image features learning [7]. However, in the There were many popular algorithms concerning about
practice of medical classification, data are usually class- Classifier Combination; such as Bayesian [41], [42],
imbalanced [8], which means the distributions of classes are Dempster–Shafer [43]–[47], Fuzzy Integral [48], [49], and
not uniform [9]–[11]. In binary-classification cases, the class Voting Methods [50]–[57]. The Voting Strategy which is
2169-3536
2018 IEEE. Translations and content mining are permitted for academic research only.
VOLUME 6, 2018 Personal use is also permitted, but republication/redistribution requires IEEE permission. 4641
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
M. Zhu et al.: Class Weights Random Forest Algorithm for Processing Class Imbalanced Medical Data
built by Bayesian and Novel Naive Bayes is proposed [58]. of each classifier into an ultimate prediction [87]. Different
A Combination Method based on the Dempster–Shafer weights per class are obtained from the empirical error of
Theory of Evidence is built [59], an Evidence-based Com- different classifiers. The algorithm assigns individual weights
bining Classifiers method is proposed for brain signal anal- for each class instead of a single weight and focuses on
ysis [60]. Dempster’s Rule is combined multiple classifiers the problem of effective identification for the minority class.
using for text categorization [61]. Dempster-Shafer Fusion It can improve the recognition performance for minority class
is combined models of prostate cancer [62]. Fuzzy Integral while maintaining those for majority class.
and Genetic Algorithms are combined multiple neural net-
II. METHOD
work classifiers [63]. Bayesian and Dempster–Shafer meth-
ods require a priori knowledge, and the calculation of the The algorithm, CWsRF is proposed; there are three proce-
fuzzy measure function is very complex [58]–[63]. To find dures as fellows: Building RF Model, Building CWsV and
a more convenient way to build the model, we chose Voting Classifying Votes; as shown in Fig. 1.
Method, which uses a unique method to integrate classifiers.
Furthermore, the calculation is simple and does not require
an auxiliary combiner or preset function, it is not limited to
decision trees with axis-parallel splits, and it is applicable to
any type of classifiers [64].
The most popular method of voting is major voting, which
based on random forest (RF) algorithm. Random forest clas-
sifiers can achieve high accuracy in data classification com-
pared with many standard classification methods [65]–[68],
and it can minimize the overall classification error rate and
has the ability to handle class imbalanced data [4], [56], [69].
However, when the imbalanced rate increases (e.g., 15%),
the classification ability is weakened [56], [70], [71], it is
because that each classifier has the same weight when the
classifiers are combined [70], [72]. Therefore, to solve this
problem, a method of combining classifiers with different
weights is proposed. Weight voting random forest (WRF),
an algorithm of getting the decisions of each classifier is
multiplied by a weight to reflect the individual confidence
of these decisions [73]–[75]. A weighted majority voting
method which is based on class-conditional independence of
FIGURE 1. The framework: CWsRF.
the classifier outputs is proposed [76]. Endogenous voting
weights is mentioned for elected representatives and redis- A. THE FRAMEWORK OF CWsRF
tricting [64]. Power in Weighted Voting Games with Super- RF takes the same weight for different classes of samples
Increasing Weights is analyzed [77]. A random forests (RF) and ensembles by major voting, which makes the classifiers
method with weighted voting for the task of anomaly detec- sensitive to the majority class(MAJ), and the classification
tion is presented [72], [78]. The internal out-of-bag (OOB) performance of the minority class(MIN) is decreased when it
error metric is used as a tree-weight in RF [79]. However, faced with imbalanced data. Therefore, an approach of CWsV
these approaches have not dramatically improved predictive is designed to distinguish MAJ and MIN.
ability. Each classifier only has a single weight when the There are three procedures: 1) Building RF Model: the
classifiers are combined [80], [81], which is not adequately vote of RF is obtained. 2) Building CWsV Method: it is the
distinguish between the majority class and the minority class; key procedure of the proposed algorithm (CWsRF) is mainly
the class weights (CWs), which refer to the different weights being introduced. It has two steps: a) the most important step
per class of each classifier are not obtained. Based on this is different weights per class were calculated; each classifier
observation, classifiers require multiple weights, and increas- has two different weights (minority and majority weight);
ing the CWs of each classifier has the potential to improve b) the votes of the samples are calculated. 3) Classifying
the overall predictive performances [82]. Therefore, CWs Votes: the improved votes are classified using the threshold:
should be assigned to better represent the classifier’s ability Aggregating Probability (AP).
to distinguish the difference between the majority class and
the minority class [21], [83]–[86]. B. THE PROCEDURES OF CWsRF
Therefore, we proposed a class weights voting (CWsRF) 1) BUILDING RF MODEL
algorithm based on random forest algorithm (RF), which Traditional RF is building to obtain the vote of classifiers.
contains an approach (CWsV) and trains a collection of clas- vi,j,c , v_tei,j,c , the labels denoting the jth classifier in the
sifiers in different weights per class to combine the output ith sample of the training set and test set need to be first
4642 VOLUME 6, 2018

TABLE 1. vi ,j ,c : The labels that classified by jth classifier in ith sample Step 2 (Calculating ACCj,c ): ACCj,c , the accuracy of
(train set).
the cth class in the jth classifier, is the difference accuracy
between each class per classifier. Thus, each classifier has two
ACCj,c values. nMAJ is the number of MAJ samples; nMIN is
the number of MIN samples. HMIN are the classifiers when
vi,j,c ∈ MIN . HMAJ are the classifiers when vi,j,c ∈ MAJ .
PnMAJ H
i=1 PHMAJ
scorei,j,MAJ
j=1
ACCj,MIN = (2)
TABLE 2. v _tei ,j : The labels that classified by jth classifier in ith sample nMAJ
(test set). PnMIN H
i=1 PHMIN score
j=1 i,j,MIN
ACCj,MAJ = (3)
nMIN
Step 3 (Calculating wj,c ): wj,c , the weights of each classi-
fier per class, are calculated to obtain new voting results based
on ACCj,c . They also have two values.
wj,MIN = ACCj,MIN (4)
obtained, they have one statue, either MIN or MAJ, as shown wj,MAJ = ACCj,MAJ (5)
in TABLE 1 and TABLE 2. j is the number of classifiers. i is
(2) Calculating Votes
the number of samples, c belongs to the MIN or MAJ, and xi
Train set and test set’s votes are obtained, each sample
is the original label of samples of the training set.
gets two votes, there are two steps: calculating vtri,j vtei,j ;
calculating vtraini,c , vtest i,c
2) BUILDING CWsV
vtri,j vtei,j are concert to train set, test set, they are the votes
CWsV, Key procedures of CWsRF will be presented in two in the jth classifier to the ith sample as wj,c are promoted. They
steps, as shown in Fig. 2. have one statue, either MIN or MAJ
vtri,j,c = vi,j,c × wj,c (6)
vtei,j,c = v_tei,j,c × wj,c (7)
vtraini,c , vtest i,c are concert to train set, test set; they are
the total votes to ith sample in MIN and MAJ.
XHc
vtraini,c = vtr i,j,c (8)
j=1
XHc
vtest i,c = vtei,j,c (9)
j=1
3) CLASSIFYING VOTES
Threshold voting is used to instead of major voting in tradi-
tional RF. There are two steps
(1) Calculating vtrnewi,j,c
vtr_newi,j,c is the vote of ith sample in different j threshold
weight, which has one statue, either MIN or MAJ.
FIGURE 2. Different weights per class of classifiers. Algorithm 1 Pseudo-Code of Aggregating Probability
for i ∈ [1 . . . n]
The classification capability of each classifier is often used for j ∈ [−H . . . H]
to evaluate the weight of a classifier; therefore the classifier’s if vtraini,MIN − vtraini,MAJ > j
prior accuracy (ACC) is used to measure the different weight Output: vtr_newi,j,c ← MIN (c ∈ MIN)
per class (W ∝ ACC); and the weights were used to calculate else
votes. Output: vtr_newi,j,c ← MAJ (c ∈ MAJ)
(1) Calculating Different Weight per Class end for
Step 1 (Calculating scorei,j,c ): scoreij,c , the score of each end for
classifier for each sample, they have one statue, either 1 or 0.
The equation is given as follows:
vtr_newi,j,c is shown in TABLE 3:
scorei,j,c = 1 vi,j,c == xi = c (1) (2) Obtaining AP
VOLUME 6, 2018 4643

TABLE 3. vtr_newi ,j ,c : The result of vote in ith sample between TABLE 4. Data basic information.
different J.
A best AP value is obtained through max AUC. AUC is

widely used in evaluating the performance of classification
of imbalanced data sets: C. FLOWCHART
PnMIN PnMIN According to the above analysis, the flowchart of CWsRF
i=1 (ri −i) (ri ) − nMIN ×(n2MAJ +1) is presented, shown in Fig. 3. The weights of the classifiers
AUC = = i=1 (10)
nMIN × nMAJ nMIN × nMAJ between MIN and MAJ are distinguished.
nMIN and nMAJ are the number of minority and majority cases,
III. DATASET DESCRIPTION AND EXPERIMENTAL SETUP
and ri is an ordinal number in the sorting table of the ith
A. DATASET DESCRIPTION
minority case.
To evaluate the different performances, five databases are
Then,
the AUCs of different j are calculated, argmax used, four from the UCI Machine Learning [SPECTF
AUCj is obtained, the current value j is recorded, and
(SP), Mammographic (MA), Diagnostic Wisconsin Breast
AP = j, the optimal AP is obtained. After that, AP and vtesti,c
Cancer (WD), Colic (CO)] and one collected by Zhejiang
are used to determine the result of classifying, as shown
Hospital [Osteoporosis (OST)]. Their size, type, and source
in Fig. 3.
are shown in TABLE 4.
The SP database contains cardiac single proton emission
computed tomography images collected during the myocar-
dial perfusion diagnostic process [88]. The MA database
contains information on the discrimination of benign and
malignant mammographic masses. The WDBC database is
the Diagnostic Wisconsin Breast Cancer Database. The Colic
database contains horse-colic data. OST contains osteo-
porosis data. The imbalanced information contained in the
included datasets is summarized in TABLE 5.
TABLE 5. Imbalanced information of data.
In medical data collection, the class imbalanced rate of

data varies because of different incidences. For example,
according to GLOBOCAN statistics, female breast cancer
incidence rates range from 25.6 cases per 100,000 females in
Thailand to 95.3 per 100,000 in the Netherlands [88], and in
the affected population, the mortality rate is approximately
10% to 25%. Osteoporotic fracture in China ranges from
40-50% in women and is 20% for men [89]. The inci-
dence varies greatly among different diseases, and in addi-
tion, in machine learning fields, the class imbalanced ratio
FIGURE 3. Flowchart. between the minority class and the majority class is not less
4644 VOLUME 6, 2018

TABLE 6. The results of AUC, F1_score, Recall for different IRs where IR = 25%, 20%, 15%, and 10%.
than 20-25% [90], [91]. Therefore, the content above was and the worst score at 0. As shown in TABLE 6, the AUC,
combined to show the versatility of the algorithm, and the F1, Recall of CWsRF is higher than those of RF and WRF.
datasets were altered by different imbalanced rates (IRs): Taking CO as an example, with increasing IR, AUC score of
the minority class was set to 25%, 20%, 15%, or 10% of CWsRF has better performances. When IR = 25%, RF (AUC
the majority class, respectively. Moreover, the IR and the score of 0.66, F1 core of 0.76, Recall score of 0.69), WRF
incidences of the datasets used in this paper were matched. (AUC score of 0.73, F1 core of 0.81, Recall score of 0.75).,
These findings show that the algorithm presented here has whereas CWsRF (AUC score of 0.83, F1 core of 0.85, Recall
high practical significance. score of 0.88). Although CWsRF got improved, it does not
show a particular advantage. When IR = 20%, AUC score
B. PARAMETER SETUP for CWsRF is 0.17, 0.09 higher than those for RF and WRF;
The running times of algorithm (t) is set as 50, the data are F1 score is 0.06, 0.02 higher than those for RF and WRF;
randomly selected to construct different IR datasets 50 times; Recall score is 0.21, 0.13 higher than those for RF and WRF.
and the average results are taken as the final outcome. And When IR = 15% and 10%, CWsRF has marked advantages,
thus, random forest is an ensemble algorithm which has as AUC and Recall score is approximately 0.30 higher than
good performance, so the running parameters are setting in those for RF and WRF; F1 score is nearly 0.20 higher than
accordance with tradition. The number of the classifiers (H) those for RF and WRF. Therefore, CWsRF has more advan-
is set as 300 [92], [93].The number of the class is 2, 1 is tages over RF and WRF in dealing with imbalanced data.
used to represent the MIN, and 0 represents the MAJ. Our
algorithm is implemented by C++ and Matlab, the accuracy B. ACCURACY OF THE MINORITY AND
of the classification, AUC, F1-score, Recall and the accuracy MAJORITY SAMPLES
of MIN and MAJ are used to analyze the effectiveness. The In classification in medical diagnostics, the classes of interest
results of RF, WRF, and CWsRF are distinguished. are often scarce, and therefore, in such an unbalanced situa-
tion, accuracy of the minority samples (ACC_MIN) and accu-
IV. RESULTS racy of the majority samples (ACC_MAJ) are more important
A. AUC, F1-SCORE, RECALL than the accuracy. Therefore, we observed the changes in
AUC, F1, Recall is commonly used when the performance these values, especially in ACC_MIN.
of a classifier needs to be evaluated to select a high pro- As shown in TABLE 7, when the imbalance increases,
portion of minority instances in the dataset. AUC is bene- ACC_MAJ, the ability to recognize the majority class,
ficial for being independent of class distribution and cost; changes little, whereas ACC_MIN, the ability to recog-
Recall is a quality measure of completeness/quantity, which nize the minority class, decreases. However, ACC_MIN of
intuitively reflects the proportion of positive samples that CWsRF is less affected. ACC_MIN of CWsRF is better than
are correctly identified. F-score (F1) is a harmonic mean of those of RF and WRF for different IRs. Considering the CO
precision and recall, which can be interpreted as a weighted samples as an example, with the increase in IR, ACC_MIN
average of precision and recall. They can distinguish the per- of CWsRF shows a distinct advantage. When IR = 25%,
formances between classifiers when processing imbalanced ACC_MIN of RF and WRF are 0.68 and 0.76, respectively,
data [94]–[96]. These scores reach the optimum value at 1 whereas that of CWsRF is 0.87. When IR = 20%, ACC_MIN
VOLUME 6, 2018 4645

TABLE 7. ACC_MIN and ACC_MAJ in different IR, where IR = 25%, 20%, 15%, and 10%.
FIGURE 4. 1CWsRF−RF (ACC_MIN improvement (%) between CWsRF and RF)1CWsRF−RF = (ACC_MINCWsRF − ACC_MINRF )/ACC_MINRF .
FIGURE 5. 1CWsRF−WRF (ACC_MIN improvement (%) between CWsRF and WRF) 1CWsRF−WRF = (ACC_MINCWsRF − ACC_MINWRF )/ACC_MINWRF . (a) SPE.
(b) WD. (c) MA. (d) CO. (e) OST.
of RF and WRF are 0.59 and 0.67, respectively, whereas that C. ACC
of CWsRF is 0.79; when IR = 15% and 10%, ACC_MIN of As shown in Fig. 6, the accuracy of all algorithms is approxi-
RF and WRF, which decreased more obviously, are approximately with 80%-90% due to the increase in imbalance when
mately 0.40, whereas that of CWsRF is 0.70. With increasing the performance is not sufficient to identify the data well, and
imbalance, the change in CWsRF is not sensitive to RF and the external results are reflected by the increase in accuracy.
WRF. Therefore, CWsRF can better identify the minority For example, when there are 100 data points, 90 of which
class. belong to the majority class and 10 to the minority class,
Fig. 4 and Fig. 5 show that all the results were posi- if the classifier divides the 100 data points into the majority
tive, so CWsRF performed better than RF and WRF. Addi- class, the correct rate is also 90%. Hence, high accuracy does
tionally, with increasing imbalance, especially for IR = not mean good performance, and it is necessary to combine
15% and 10%, 1CWsRF−RF and 1CWsRF−WRF are increased, the classification accuracies of the majority class and the
and thus, CWsRF is more advantageous than RF and WRF. minority class. An algorithm can be considered a good class
4646 VOLUME 6, 2018

FIGURE 6. Accuracy of different algorithms per IR: (a) IR = 25%, (b) IR = 20%, (c) IR = 15% and (d) IR = 10%.
imbalanced classification algorithm if it fulfills the following Generally, L divides the samples into two classes. If D > 0,
conditions: accuracy without loss (or little loss), increased the samples are classified to MIN, otherwise are MAJ. Larger
AUC, and accuracy of classifying both the minority and distances between the samples and L lead to less misclassifi-
majority samples. Since the accuracy rate is high (greater than cation. Therefore, if DCWsRF − DRF > 0, CWsRF has better
80%) combined with high AUC and ACC_MIN, that CWsRF performances than RF in minority class. The equation is:
achieves better performance.
DCWsRF = (−1) × vtraini,MAJ + (1) × vtraini,MIN + AP
V. DISCUSSION (14)
The Performance and complexity are discussion in the
section. D is calculated from the training set for which is used
to build model; x0 is vtraini,MIN , y0 is vtraini,MIN , C is
A. PERFORMANCE AP.A is −1, B is 1.
The performances of MIN are increased while maintain Substituting (8) into (14), gives:
the accuracy of MAJ (shown in TABLE 6, section IV). XHi,MAJ
So, the performance of MIN has improved significantly, so it DCWsRF = − vi,j,MAJ × wj,MAJ
j=1
is discussed. (Due to limited space, the performance for MAJ XHi,MIN
has not been mentioned.) + vi,jMIN × wj,MIN + AP (15)
j=1
According to the theory of classification, the distance (D) XHi,MAJ XHi,MIN
of the sample to the classification line was used to evaluate DRF =− vi,jMAJ + vi,jMIN (16)
j=1 j=1
the performance of the algorithms. D is calculated from the XHi,MAJ XHi,MIN
DWRF =− vi,j,MAJ × wj + vi,j,MAJ ×wj
measured point to the threshold line. Larger distances lead to j=1 j=1
less misclassification. (17)
There is a sample P (x0 , y0 ), its location is determined by
(X , Y ), X is the number of votes for MAJ, and Y is the number DCWsRF and DRF are compared here, write:
of votes for MIN. The line L is Ax+By+C = 0. The samples XHMAJ
P (x0 , y0 ) can be relatively expressed as D from (x0 , y0 ) to the DCWsRF −DRF = − vi,j,MAJ × wj,MAJ
j=1
line L, Q is the pedal of P on L. Q can be express as XHMIN
+ vi,jMIN × wj,MIN + AP
2 j=1
B x0 − ABy0 − AC A2 y0 − ABx0 − BC

,
XHMAJ XHMIN
(11) + vi,jMAJ − vi,jMIN
A2 + B2 A2 + B2 j=1 j=1
2 2
2 B x0 − ABy0 − AC Since Hc is the classifiers which belong to c class, implying
|PQ| = − x0
A2 + B2 that vi,j,MAJ vi,jMIN can replace by 1, lead to:
2 2
A y0 − ABx0 − BC XHMAJ XHMAJ
+ − y (12)

0 1− wj,MAJ
A2 + B2 j=1 j=1
Ax0 + By0 + C XHMIN XHMIN
D = PQ = √ (13) + wj,MIN − 1 + j (18)
A2 + B2 j=1 j=1
VOLUME 6, 2018 4647

Since j > 0, Substituting (4), (5) into (18), gives:

PnMIN H
XHMIN i=1 PHj=1 MIN score
i,j,MIN XHMIN XHMAJ
− 1+ 1
j=1 nMIN j=1 j=1
PnMAJ H
XHMAJ i=1 PHj=1 MAJ score
i,j,MAJ
FIGURE 7. The distance of the sample to the classification line.
− (19)
j=1 nMAJ
It can be preceded in two parts:
Part (1): Determine whether the result of (20) is positive:
PnMIN H
XHMIN i=1 PHj=1 MIN score
i,j,MIN XHMIN
− 1
j=1 nMIN j
PnMIN H
XHMIN i=1 PHMIN score
j=1 i,j,MIN
= ( − 1) (20)
j=1 nMIN
PDetermine
nMIN H
whether nMIN greater than
i=1 PHMIN score ,lead to:
j=1 i,j,MIN
XnMIN H XnMIN
PHMIN − 1
i=1 scorei,j,MIN i=1
j=1
XnMIN H

= PH −1
i=1 MIN
scorei,j,MIN
j=1
Considering the characteristics of classifying imbalanced

FIGURE 8. The result of the equation (24):
dataset, not all of P
them give correctly classification, implying HMIN ∗ H − 1 − HMAJ ∗ (1 − H ).
HMIN
that, H > Max( j=1 scorei,j,MIN ), meaning the result of Ȳ X̄
equation (20) is positive.

Part (2): Determine whether the result of (21) is positive: (23) into (20), (21), gives:
PnMAJ H
XHi,MAJ z=1 PHj=1 MAJ score H H
z,j,MAJ HMIN ∗ − 1 − HMAJ ∗ 1 − (24)
Hi,MAJ − (21) Ȳ X̄
j=1 nMAJ
Equation (24) expresses our intension on increasing weight
Determine whether nMAJ greater than
P nMAJ H of MIN through HȲ , reducing the weight of MAJ by H .
i=1 PHMAJ , lead to: X̄
j=1 scorei,j,MAJ (For X̄ is larger than Ȳ in the classification of imbalanced data
nX
MAJ set usually). Since nMAJ nMIN and scorei,j,MIN , scorei,j,MAJ
H are produced RF (according to the procedure 1, section II);
1 − PH
i=1
MAJ
j=1 scorei,j,MAJ therefore, the idea need to analyzed by the datasets instead of
equation of CWsV. The sample CO is took as example.
Considering not all of scorei,j,MAJ are correct, implying From Fig.8, the results are positive; therefore
PHMAJ
that, H > Max( j=1 scorei,j,MAJ ), meaning the result of DCWsRF −DRF is determined to be positive:
equation (21) is negative.
Equation (20) is positive while equation (21) is negative, DCWsRF − DRF > 0
further analysis should carried to determine whether the result The distances of CWsRF are larger than those of RF.
of (20) larger than (21). In order to analyze the problems of Furthermore, the results of the improved distances are shown
easily, write: in Fig. 9. 1CWsV −MV and 1CWsV −WV are positive for all IRs.

H
Therefore, CWsRF has better generalization ability for MIN
Ȳ = meanHMIN PH (22) than RF and WRF.
MIN
j=1 scorei,j,MIN
X
nMAJ H B. COMPLEXITY
X̄ = meanHMAJ (23)
i=1
PHMAJ The complexity is O (tHfd · log n) in RF [97], t is the number
j=1 scorei,j,MAJ
of running times, H is the number of the classifiers, f is the
Ȳ is the mean of HMIN classifiers score in MIN, X̄ is the number of the features, n is the number of the data; c is the
mean of HMAJ classifiers score in MAJ. Substituting (22), number of the classes.
4648 VOLUME 6, 2018

VI. CONCLUSION
The classification of class imbalanced data is a new research
topic and represents an urgent problem to be solved.
An algorithm (CWsRF) to develop class weights for pro-
cessing imbalanced medical data was proposed. In the study,
the empirical error is taken as the measurement to obtain
class weights of the classifier. The algorithm yields superior
performance than other schemes. The proposed algorithm had
very high accuracy classifying, AUC, F1 and Recall.
This paper is an attempt to improve the ensemble learning
algorithm of RF dealing with binary classification and could
be extended to ensemble learning with other algorithms and
multi-classification problems.
REFERENCES
FIGURE 9. Improved distance between the different algorithms per IR in [1] Y. Zhu and J. Fang, ‘‘Logistic regression-based trichotomous classification
minority class samples of CO,CWsRF − RF = DCWsRF − DRF , tree and its application in medical diagnosis,’’ Med. Decision Making,
CWsRF − WRF = DCWsRF − DWRF where (a), (b), (c), and (d) represent the vol. 36, no. 8, pp. 973–989, 2016.
improved distances for IR = 25%, 20%, 15%, and 10%, respectively. [2] S. D. Zhao, ‘‘Integrative genetic risk prediction using non-parametric
empirical Bayes classification,’’ Biometrics, vol. 73, no. 2, pp. 582–592,
One approach is added in WRF compared to RF. It is: 2017.
Calculating the weight of each classifier, which mainly [3] E. Dong, C. Li, L. Li, S. Du, A. N. Belkacem, and C. Chen, ‘‘Classification
of multi-class motor imagery with a novel hierarchical SVM algorithm
reflected in the number of classifiers, the number of samples. for brain-computer interfaces,’’ Med. Biol. Eng. Comput., vol. 55, no. 10,
Its complexity is O (Hn). Therefore, the complexity of WRF pp. 1809–1818, 2017.
is O (tHfd · log n + tHn) [4] M. Zhu, J. Xia, M. Yan, G. Cai, J. Yan, and G. Ning, ‘‘Dimension-
ality reduction in complex medical data: Improved self-adaptive niche
Two approaches are added in CWsRF compared to RF. genetic algorithm,’’ Comput. Math. Methods Med., vol. 2015, Oct. 2015,
They are (1) the weights of each classifier per class; (2) the Art. no. 794586, doi: 10.1155/2015/794586.
APs based on the number of classifiers. [5] M. Durgadevi and R. Kalpana, ‘‘Medical distress prediction based on
classification rule discovery using ant-miner algorithm,’’ in Proc. 11th Int.
(1) Approach 1th , which mainly reflected in the num- Conf. Intell. Syst. Control (ISCO), Jan. 2017, pp. 88–92.
ber of classifiers, the number of samples, the complexity [6] M. Chen, Y. Hao, K. Hwang, L. Wang, and L. Wang, ‘‘Disease prediction
of the number of classes. So the complexity is O (Hnc); by machine learning over big data from healthcare communities,’’ IEEE
Access, vol. 5, no. 1, pp. 8869–8879, 2017.
(2) Approach 2th , which has an application of the sorting [7] M. Chen, X. Shi, Y. Zhang, D. Wu, and M. Guizani, ‘‘Deep
process, selecting the maximum AUC value, which mainly features learning for medical image analysis with convolutional
reflected in the number of classifiers. So the complexity is autoencoder neural network,’’ IEEE Trans. Big Data, 2017,
doi: 10.1109/TBDATA.2017.2717439.
O (H · log H ). [8] B. A. Bak and J. L. Jensen, ‘‘High dimensional classifiers in the imbalanced
Therefore, the complexity of CWsRF is O (t · (Hfd · log n+ case,’’ Comput. Stat. Data Anal., vol. 98, pp. 46–59, Jun. 2016.
Hnc + H · log H )). The values (n, k and t) have influence on [9] Y. Zhang, P. Fu, W. Liu, and G. Chen ‘‘Imbalanced data classification based
on scaling kernel-based support vector machine,’’ Neural Comput. Appl.,
the efficiency of CWsRF. vol. 25, nos. 3–4, pp. 927–935, 2014.
In addition, to evaluate the algorithms’ complexity, [10] C. K. Maurya, D. Toshniwal, and G. V. Venkoparao, ‘‘Online sparse class
the average running time on the five different datasets were imbalance learning on big data,’’ Neurocomputing, vol. 216, pp. 250–260,
Dec. 2016.
calculated and the results are shown in TABLE 8. The exper- [11] S. Al-Stouhi and C. K. Reddy, ‘‘Transfer learning for class imbal-
iments are conducted on PC ( Intel Core i7-3537U, 2.5 GHz ance problems with inadequate data,’’ Knowl. Inf. Syst., vol. 48, no. 1,
CPU and 4 GB memory). pp. 201–228, 2016.
[12] M. El-Banna, ‘‘Modified Mahalanobis Taguchi system for imbalance
data classification,’’ Comput. Intell. Neurosci., vol. 2017, Jul. 2017,
TABLE 8. The time cost of the algorithms (seconds). Art. no. 5874896.
[13] B. Mirza et al., ‘‘Efficient representation learning for high-dimensional
imbalance data,’’ in Proc. IEEE Int. Conf. Digit. Signal Process. (DSP),
Oct. 2016, pp. 511–515.
[14] M. M. Al-Rifaie and H. A. Alhakbani, ‘‘Handling class imbalance in direct
marketing dataset using a hybrid data and algorithmic level solutions,’’ in
Proc. SAI Comput. Conf. (SAI), Jul. 2016, pp. 446–451.
[15] H. Lu, K. Yang, and J. Shi, ‘‘Constraining the water imbalance in a
From the TABLE 8, it shows similar runtime on the dif- land data assimilation system through a recursive assimilation scheme,’’
ferent data for the same algorithm. The shortest cost time in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), Jul. 2016,
pp. 2993–2996.
of CWsRF is 2.20s (on the dataset with 962 instances, [16] S. Pouyanfar and S.-C. Chen, ‘‘Automatic video event detection for
5 attributes), while the longest is 2.41s (on the dataset with imbalance data using enhanced ensemble deep learning,’’ Int. J. Semantic
313 instances, 85 attributes). The results are acceptable in the Comput., vol. 11, no. 1, pp. 85–109, 2017.
[17] W. Mao, J. Wang, and Z. Xue, ‘‘An ELM-based model with sparse-
practice of medical diagnosis and big complex data will be weighting strategy for sequential data imbalance problem,’’ Int. J. Mach.
tested in the further study. Learn. Cybern., vol. 8, no. 4, pp. 1333–1345, 2017.
VOLUME 6, 2018 4649

[18] J. Zhai, S. Zhang, and C. Wang, ‘‘The classification of imbalanced large [40] H. Noçairi, C. Gomes, M. Thomas, and G. Saporta, ‘‘Improving stacking
data sets based on MapReduce and ensemble of ELM classifiers,’’ Int. methodology for combining classifiers; applications to cosmetic industry,’’
J. Mach. Learn. Cybern., vol. 8, no. 3, pp. 1009–1017, 2017. Electron. J. Appl. Stat. Anal., vol. 9, no. 2, pp. 340–361, 2016.
[19] A. Amin et al., ‘‘Comparing oversampling techniques to handle the [41] T.-M. Chan, Y. Li, C.-C. Chiau, J. Zhu, J. Jiang, and Y. Huo, ‘‘Imbalanced
class imbalance problem: A customer churn prediction case study,’’ IEEE target prediction with pattern discovery on clinical data repositories,’’ BMC
Access, vol. 4, pp. 7940–7957, 2016. Med. Inform. Decision Making, vol. 17, p. 47, Apr. 2017.
[20] K. Jiang, J. Lu, and K. Xia, ‘‘A novel algorithm for imbalance data [42] Y. Zhang, J. Ren, and J. Jiang, ‘‘Combining MLC and SVM classifiers
classification based on genetic algorithm improved SMOTE,’’ Arabian for learning based decision making: Analysis and evaluations,’’ Comput.
J. Sci. Eng., vol. 41, no. 8, pp. 3255–3266, 2016. Intell. Neurosci., vol. 2015, pp. 1–8, May 2015.
[21] X. Zhang, Q. Song, G. Wang, K. Zhang, L. He, and X. Jia, ‘‘A dissimilarity- [43] D. Zhou, Y. Tang, and W. Jiang, ‘‘A modified belief entropy in
based imbalance data classification algorithm,’’ Appl. Intell., vol. 42, no. 3, Dempster–Shafer framework,’’ PLoS ONE, vol. 12, no. 5, p. e0176832,
pp. 544–565, 2015. 2017.
[22] J. Wang, J. Z. Feng, and Z. Han, ‘‘Discriminative feature selection based [44] X. Deng, W. Jiang, and J. Zhang, ‘‘Zero-Sum matrix game with payoffs
on imbalance SVDD for fault detection of semiconductor manufacturing of Dempster–Shafer belief structures and its applications on sensors,’’
processes,’’ J. Circuits, Syst. Comput., vol. 25, no. 11, p. 1650143, 2016. Sensors, vol. 17, no. 4, p. 922, 2017.
[23] P. Vorraboot, S. Rasmequan, K. Chinnasarn, and C. Lursinsap, ‘‘Improving [45] W. Chen, H. R. Pourghasemi, and Z. Zhao, ‘‘A GIS-based comparative
classification rate constrained to imbalanced data between overlapped and study of Dempster–Shafer, logistic regression and artificial neural network
non-overlapped regions by hybrid algorithms,’’ Neurocomputing, vol. 152, models for landslide susceptibility mapping,’’ Geocarto Int., vol. 32, no. 4,
pp. 429–443, Mar. 2015. pp. 367–385, 2017.
[24] P. Du, A. Samat, B. Waske, S. Liu, and Z. Li, ‘‘Random forest and rotation [46] A. M. Al-Abadi, ‘‘The application of Dempster–Shafer theory of evidence
forest for fully polarized SAR image classification using polarimetric for assessing groundwater vulnerability at Galal Badra basin, Wasit gover-
and spatial features,’’ ISPRS J. Photogramm. Remote Sens., vol. 105, norate, east of iraq,’’ Appl. Water Sci., vol. 7, no. 4, pp. 1725–1740, 2017.
pp. 38–53, Jul. 2015. [47] J. Wang, K. Qiao, Z. Zhang, and F. Xiang, ‘‘A new conflict management
[25] M. Belgiu and L. Drǎguţ, ‘‘Random forest in remote sensing: A review of method in Dempster–Shafer theory,’’ Int. J. Distrib. Sensor Netw., vol. 13,
applications and future directions,’’ ISPRS J. Photogramm. Remote Sens., no. 3, pp. 1–11, 2017.
vol. 114, pp. 24–31, Apr. 2016. [48] C.-D. Zheng, Y. Zhang, and Z. Wang, ‘‘Novel stability condition of
[26] L.-I. Tong, K.-H. Chang, P.-Y. Wu, and Y.-C. Chan, ‘‘Using dual response stochastic fuzzy neural networks with Markovian jumping under impulsive
surface methodology as a benchmark to process multi-class imbalanced perturbations,’’ Int. J. Mach. Learn. Cybern., vol. 7, no. 5, pp. 795–803,
data,’’ J. Ind. Prod. Eng., vol. 34, no. 2, pp. 147–158, 2017. 2016.
[49] P. Chen and D. Zhang, ‘‘Constructing support vector machines ensemble
[27] H. Lee, E. Kim, and S. Kim, ‘‘Anomalous propagation echo classification
classification method for imbalanced datasets based on fuzzy integral,’’
of imbalanced radar data with support vector machine,’’ Adv. Meteorol.,
in Modern Advances in Applied Intelligence (Lecture Notes in Computer
vol. 2016, pp. 1–13, Jan. 2016.
Science), vol. 7. Berlin, Germany: Springer, 2014, pp. 70–76.
[28] M. J. Fernández-Gómez, G. Asencio-Cortés, A. Troncoso, and
[50] R. Tang, Y. Zhu, and G. Chen, ‘‘Imbalanced data classification method
F. Martínez-Álvarez, ‘‘Large earthquake magnitude prediction in
based on clustering and voting mechanism,’’ in Proc. Int. Conf. Inf., 2013,
Chile with imbalanced classifiers and ensemble learning,’’ Appl. Sci.,
pp. 667–674.
vol. 7, no. 6, p. 625, 2017.
[51] R. Hidayati, K. Kanamori, L. Feng, and H. Ohwada, ‘‘Implementing major-
[29] J. Jia, Z. Liu, X. Xiao, B. Liu, and K. C. Chou, ‘‘iPPBS-Opt: A sequence-
ity voting rule to classify corporate value based on environmental efforts,’’
based ensemble classifier for identifying protein-protein binding sites by
in Data Mining and Big Data (Lecture Notes in Computer Science),
optimizing imbalanced training datasets,’’ Molecules, vol. 21, no. 1, p. 95,
vol. 211. Berlin, Germany: Springer, 2016, pp. 59–66.
2016.
[52] M. C. Çolaka, C. Çolakb, N. Erdila, and A. K. Arslan, ‘‘Investigating
[30] U. R. Salunkhe and S. N. Mali, ‘‘Classifier ensemble design for imbalanced
optimal number of cross validation on the prediction of postoperative atrial
data classification: A hybrid approach,’’ Procedia Comput. Sci., vol. 85,
fibrillation by voting ensemble strategy,’’ Turkiye Klinikleri J. Biostat.,
pp. 725–732, May 2016.
vol. 8, no. 1, pp. 30–35, 2016.
[31] J.-H. Xue and P. Hall, ‘‘Why does rebalancing class-unbalanced data [53] A. Tamvakis, G. E. Tsekouras, A. Rigos, C. Kalloniatis,
improve AUC for linear discriminant analysis?’’ IEEE Trans. Pattern Anal. C. N. Anagnostopoulos, and G. Anastassopoulos, ‘‘A methodology to
Mach. Intell., vol. 37, no. 5, pp. 1109–1112, May 2015. carry out voting classification tasks using a particle swarm optimization-
[32] M. Thakong, S. Phimoltares, S. Jaiyen, and C. Lursinsap, ‘‘Fast learn- based neuro-fuzzy competitive learning network,’’ Evolving Syst., vol. 8,
ing and testing for imbalanced multi-class changes in streaming data by no. 1, pp. 49–69, 2017.
dynamic multi-stratum network,’’ IEEE Access, vol. 5, pp. 10633–10648, [54] S. Abbasi, A. Shahriari, and Y. Nemati, ‘‘Retracted: A novel voting math-
2017. ematical rule classification for image recognition,’’ in Computational Sci-
[33] B. Wang and J. Pineau, ‘‘Online bagging and boosting for imbalanced data ence and Its Applications—ICCSA (Lecture Notes in Computer Science),
streams,’’ IEEE Trans. Knowl. Data Eng., vol. 28, no. 12, pp. 3353–3366, vol. 8. Berlin, Germany: Springer, 2016, pp. 257–270.
Dec. 2016. [55] T. Subbulakshmi and R. R. Raja, ‘‘An ensemble approach for sentiment
[34] M. N. Haque, N. Noman, R. Berretta, and P. Moscato, ‘‘Heterogeneous classification: Voting for classes and against them,’’ ICTACT J. Soft
ensemble combination search using genetic algorithm for class imbalanced Comput., vol. 6, no. 4, pp. 1281–1286, 2016.
data classification,’’ PLoS ONE, vol. 11, no. 1, p. e0146116, 2016. [56] B. Xia, H. Jiang, H. Liu, and D. Yi, ‘‘A novel hepatocellular carcinoma
[35] P. Blonda, C. Tarantino, A. D’Addabbo, G. Satalino, and G. Pasquariello, image classification method based on voting ranking random forests,’’
‘‘Combination of multiple classifiers by fuzzy integrals: An application to Comput. Math. Methods Med., vol. 2016, Apr. 2016, Art. no. 2628463.
synthetic aperture radar (SAR) data,’’ in Proc. IEEE Int. Fuzzy Syst. Conf., [57] A. Linden and P. R. Yarnold, ‘‘Using classification tree analysis to generate
vol. 3. Dec. 2001, pp. 944–947. propensity score weights,’’ J. Eval. Clin. Pract., vol. 23, no. 4, pp. 703–712,
[36] Q. Wang et al., ‘‘A novel ensemble method for imbalanced data learn- 2017.
ing: Bagging of extrapolation-SMOTE SVM,’’ Comput. Intell. Neurosci., [58] C. De Stefano, F. Fontanella, and A. S. di Freca, ‘‘A novel naive Bayes
vol. 2017, pp. 1–11, Jan. 2017. voting strategy for combining classifiers,’’ in Proc. Int. Conf. Frontiers
[37] H. Li, Y. Wang, H. Wang, and B. Zhou, ‘‘Multi-window based ensemble Handwriting Recognit., Sep. 2012, pp. 467–472.
learning for classification of imbalanced streaming data,’’ World Wide Web, [59] G. Rogova, ‘‘Combining the results of several neural network classifiers,’’
vol. 20, no. 6, pp. 1507–1525, 2017. in Classic Works of the Dempster-Shafer Theory of Belief Functions,
[38] S. Ali, A. Majid, S. G. Javed, and M. Sattar, ‘‘Can-CSC-GBE: Developing vol. 219. Manchester, U.K.: IEEE, 2008, pp. 683–692.
Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer [60] S. R. Kheradpisheh, A. Nowzari-Dalini, R. Ebrahimpour, and
classification using protein amino acids and imbalanced data,’’ Comput. M. Ganjtabesh, ‘‘An evidence-based combining classifier for brain
Biol. Med., vol. 73, pp. 38–46, Jun. 2016. signal analysis,’’ PLoS ONE, vol. 9, no. 1, p. e84341, 2014.
[39] J. Kevric, S. Jukic, and A. Subasi, ‘‘An effective combining classifier [61] Y. Bi, D. Bell, H. Wang, G. Guo, and K. Greer, ‘‘Combining multiple
approach using tree algorithms for network intrusion detection,’’ Neural classifiers using Dempster’s rule of combination for text categorization,’’
Comput. Appl., vol. 28, pp. 1051–1058, Dec. 2016. in Proc. Int. Conf. Modeling Decisions Artif. Intell., 2004, pp. 127–138.
4650 VOLUME 6, 2018

[62] S. Chandana, H. Leung, and K. Trpkov, ‘‘Staging of prostate cancer [85] C. Tang, C. Hou, P. Wang, and Z. Song, ‘‘Salient object detection using
using automatic feature selection, sampling and Dempster–Shafer fusion,’’ color spatial distribution and minimum spanning tree weight,’’ Multimed
Cancer Inform., vol. 2009, no. 7, pp. 57–73, Feb. 2009. Tools Appl., vol. 75, no. 12, pp. 6963–6978, 2016.
[63] Y. Li and J. Jingping, ‘‘New algorithm for combining classifiers based [86] J. Chiquet, G. Rigaill and P. Gutierrez, ‘‘Fast tree inference with weighted
on fuzzy integral and genetic algorithms,’’ Proc. SPIE, vol. 4554, fusion penalties,’’ J. Comput. Graph. Stat., vol. 26, no. 1, pp. 205–216,
pp. 176–181, Sep. 2001. 2017.
[64] J. Svec and J. Hamilton, ‘‘Endogenous voting weights for elected represen- [87] Z. Xu, C. Voichita, S. Drǎghici, and R. Romero, ‘‘Z-bag: A classification
tatives and redistricting,’’ Constitutional Political Economy, vol. 26, no. 4, ensemble system with posterior probabilistic outputs,’’ Comput. Intell.,
pp. 434–441, 2015. vol. 29, no. 2, pp. 310–330, 2013.
[65] Q. Wu, Y. Ye, Y. Liu, and M. K. Ng, ‘‘SNP selection and classification of [88] C. E. DeSantis, F. Bray, J. Ferlay, J. Lortet-Tieulent, B. O. Anderson,
genome-wide SNP data using stratified sampling random forests,’’ IEEE and A. Jemal, ‘‘International variation in female breast cancer incidence
Trans. Nanobiosci., vol. 11, no. 3, pp. 216–226, Sep. 2012. and mortality rates,’’ Cancer Epidemiol. Biomarkers Prevention, vol. 24,
no. 10, pp. 1495–1506, 2015.
[66] Y. Ye, Q. Wu, J. Z. Huang, M. K. Ng, and X. Li, ‘‘Stratified sampling for
[89] O. Johnell and J. Kanis, ‘‘Epidemiology of osteoporotic fractures,’’ Osteo-
feature subspace selection in random forests for high dimensional data,’’
porosis Int., vol. 16, pp. S3–S7, Mar. 2005.
Pattern Recognit., vol. 46, no. 3, pp. 769–787, 2013.
[90] S. Janitza, C. Strobl, and A.-L. Boulesteix, ‘‘An AUC-based permutation
[67] J. Sun, G. Zhong, J. Dong, H. Saeeda, and Q. Zhang, ‘‘Cooperative profit
variable importance measure for random forests,’’ BMC Bioinf., vol. 14,
random forests with application in ocean front recognition,’’ IEEE Access,
pp. 119–130, Apr. 2013.
vol. 5, pp. 1398–1408, 2017.
[91] A. I. Marqués, V. García, and J. S. Sánchez, ‘‘On the suitability of
[68] W. Lin, Z. Wu, L. Lin, A. Wen, and J. Li, ‘‘An ensemble random resampling techniques for the class imbalance problem in credit scoring,’’
forest algorithm for insurance big data analysis,’’ IEEE Access, vol. 5, J. Oper. Res. Soc., vol. 64, no. 13, pp. 1060–1070, 2013.
pp. 16568–16575, 2017. [92] A. Cuzzocrea, S. L. Francis, and M. M. Gaber, ‘‘An information-theoretic
[69] L. Breiman, ‘‘Random forests,’’ Mach. Learn., vol. 45, no. 1, pp. 5–32, approach for setting the optimal number of decision trees in random
2001. forests,’’ in Proc. IEEE Int. Conf. Syst., Man, Cybern., vol. 177. Oct. 2013,
[70] M. E. H. Daho, N. Settouti, M. E. A. Lazouni, and M. E. A. Chikh, pp. 1013–1019.
‘‘Weighted vote for trees aggregation in random forest,’’ in Proc. Int. Conf. [93] P. Latinne, O. Debeir, and C. Decaestecker, ‘‘Limiting the number of trees
Multimedia Comput. Syst. (ICMCS), Apr. 2014, pp. 438–443. in random forests,’’ in Multiple Classifier Systems, vol. 2013. Manchester,
[71] T. Perry and M. Bader-El-Den, ‘‘Imbalanced classification using geneti- U.K.: IEEE, 2001, pp. 178–187.
cally optimized random forests,’’ in Proc. Companion Publication Annu. [94] R. K. Shahzad, M. Fatima, N. Lavesson, and M. Boldt, ‘‘Consensus deci-
Conf. Gen. Evol. Comput., 2015, vol. 15. no. 7, pp. 1453–1454. sion making in random forests,’’ in Machine Learning, Optimization, and
[72] C. A. Ronao and S.-B. Cho, ‘‘Random forests with weighted voting for Big Data (Lecture Notes in Computer Science). Berlin, Germany: Springer
anomalous query access detection in relational databases,’’ in Artificial 2015, pp. 347–358.
Intelligence and Soft Computing (Lecture Notes in Computer Science), [95] J. Hu, ‘‘Automated detection of driver fatigue based on AdaBoost classifier
vol. 2015. New York, NY, USA: ACM, 2015, pp. 36–48. with EEG signals,’’ Frontiers Comput. Neurosci., vol. 11, no. 72, pp. 1–10,
[73] S. A. Naghibi, K. Ahmadi, and A. Daneshi, ‘‘Application of support 2017.
vector machine, random forest, and genetic algorithm optimized random [96] J. Hu, ‘‘Automated detection of driver fatigue based on AdaBoost classifier
forest models in groundwater potential mapping,’’ Water Resour. Manage., with EEG signals,’’ Frontiers Comput. Neurosci., vol. 11, no. 8, pp. 1–10,
vol. 31, no. 9, pp. 2761–2775, 2017. 2017.
[74] A. M. Youssef, H. R. Pourghasemi, Z. S. Pourtaghi, and M. M. Al-Katheeri, [97] G. Biau, ‘‘Analysis of a random forests model,’’ J. Mach. Learn. Res.,
‘‘Landslide susceptibility mapping using random forest, boosted regression vol. 13, pp. 1063–1095, Apr. 2012.
tree, classification and regression tree, and general linear models and
comparison of their performance at Wadi Tayyah Basin, Asir Region, Saudi
Arabia,’’ Landslides, vol. 13, no. 5, pp. 839–856, 2016.
[75] S. A. Naghibi, H. R. Pourghasemi, and B. Dixon, ‘‘GIS-based groundwater
potential mapping using boosted regression tree, classification and regres-
sion tree, and random forest machine learning models in iran,’’ Environ.
Monitor. Assessment, vol. 188, no. 1, p. 44, 2016. MIN ZHU received the M.S. degree from the
[76] L. I. Kuncheva and J. J. Rodríguez, ‘‘A weighted voting framework for College of Computer Science and Technology,
classifiers ensembles,’’ Knowl. Inf. Syst., vol. 38, no. 2, pp. 259–275, 2014.
Guizhou University, Guiyang, China, in 2006. She
[77] Y. Bachrach, Y. Filmus, J. Oren, and Y. Zick, ‘‘Analyzing power in is currently working toward the Ph.D. degree at
weighted voting games with super-increasing weights,’’ in Proc. Int. Symp.
Zhejiang University, Hangzhou, China. She is cur-
Algorithmic Game Theory, 2012, pp. 169–181.
rently a Senior Engineer at the Guizhou Key Lab-
[78] T. Hayes, S. Usami, R. Jacobucci, and J. J. McArdle, ‘‘Using clas-
oratory of Agricultural Bioengineering, Guizhou
sification and regression trees (CART) and random forests to analyze
University. Her research interests include data
attrition: Results from two simulations,’’ Psychol. Aging, vol. 30, no. 4,
pp. 911–929, 2015. mining, pattern recognition, and classification.
[79] H. Ishwaran, U. B. Kogalur, E. H. Blackstone, and H. Lauer, ‘‘Random
survival forests,’’ Ann. Appl. Stat., vol. 2, no. 3, pp. 841–860, 2008.
[80] N. Tóth and B. Pataki, ‘‘Classification confidence weighted majority voting
using decision tree classifiers,’’ Int. J. Intell. Comput. Cybern., vol. 1, no. 2,
pp. 169–192, 2008.
[81] A. F. R. Rahman and M. C. Fairhurst, ‘‘Multiple classifier decision com-
bination strategies for character recognition: A review,’’ Document Anal.
Recognit., vol. 5, no. 4, pp. 166–194, 2003. JING XIA received the B.S. degree in biomedi-
[82] A. Arnaiz-González, J. F. Díez-Pastor, C. García-Osorio, and cal engineering from Zhejiang University, China,
J. J. Rodríguez, ‘‘Random feature weights for regression trees,’’ Progr. in 2013. She is currently working toward the Ph.D.
Artif. Intell., vol. 5, no. 2, pp. 91–103, 2016. degree in biomedical engineering at Zhejiang Uni-
[83] S. J. Winham, R. R. Freimuth, and J. M. Biernacka, ‘‘A weighted random versity, China. Her research interests focus on
forests approach to improve predictive performance,’’ Stat. Anal. Data intelligent medical diagnosis.
Mining, vol. 6, no. 6, pp. 496–505, 2013.
[84] M. Das and S. Bhattacharya, ‘‘A modified history based weighted average
voting with soft-dynamic threshold,’’ in Proc. Int. Conf. Adv. Comput. Eng.,
Jun. 2010, pp. 217–222.
VOLUME 6, 2018 4651

XIAOQING JIN received the Ph.D. degree from JING YAN received the M.S. degree from the
the Department of Acupuncture, Zhejiang Chinese Department of Cardiology, Zhejiang University,
Medical University, China. She is currently the China. He is currently the Dean of Zhejiang
Head of the Department of Acupuncture, Zhejiang Hospital, China.
Hospital, China.
MOLEI YAN received the Ph.D. degree from

the Department of Internal Medicine, Zhejiang
Chinese Medical University, China. She is cur-
rently a Physician at the Department of ICU, Zhe-
jiang Hospital, China.
GUOLONG CAI received the M.S. degree from GANGMIN NING received the Dr.-Ing. degree
the Department of Cardiology, Zhejiang Univer- from the Deparment of Biomedical Engineering,
sity, China. He is currently a Physician at the TU Ilmenau, Germany. He is a currently a Profes-
Department of ICU, Zhejiang Hospital, China. sor at the Department of Biomedical Engineering,
Zhejiang University, China.
4652 VOLUME 6, 2018

Class Weights Random Forest Algorithm For Processing Class Imbalanced Medical Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Class Weights Random Forest Algorithm For Processing Class Imbalanced Medical Data

Uploaded by

Copyright:

Available Formats

Received November 30, 2017, accepted December 31, 2017, date of publication January 4, 2018,

date of current version February 28, 2018.

Class Weights Random Forest Algorithm for

Corresponding authors: Jing Yan (zjicu@vip.163.com) and Gangmin Ning (gmning@zju.edu.cn)

I. INTRODUCTION with larger distribution is named as the majority while the

4642 VOLUME 6, 2018

VOLUME 6, 2018 4643

A best AP value is obtained through max AUC. AUC is

TABLE 5. Imbalanced information of data.

In medical data collection, the class imbalanced rate of

4644 VOLUME 6, 2018

VOLUME 6, 2018 4645

4646 VOLUME 6, 2018

VOLUME 6, 2018 4647

Since j > 0, Substituting (4), (5) into (18), gives:

Considering the characteristics of classifying imbalanced

equation (20) is positive.

4648 VOLUME 6, 2018

VOLUME 6, 2018 4649

4650 VOLUME 6, 2018

VOLUME 6, 2018 4651

MOLEI YAN received the Ph.D. degree from

4652 VOLUME 6, 2018

You might also like