You are on page 1of 27

Received June 26, 2020, accepted July 3, 2020, date of publication July 10, 2020, date of current version

July 22, 2020.


Digital Object Identifier 10.1109/ACCESS.2020.3008416

Variable Importance Analysis in Imbalanced


Datasets: A New Approach
ISMAEL AHRAZEM DFUF 1 , JOAQUÍN FORTE PÉREZ-MINAYO2 , JOSÉ MANUEL MIRA MCWILLIAMS2 ,
AND CAMINO GONZÁLEZ FERNÁNDEZ2
1 ETSIT, Universidad Politécnica de Madrid, 28040 Madrid, Spain
2 ETSII, Universidad Politécnica de Madrid, 28006 Madrid, Spain
Corresponding author: Ismael Ahrazem Dfuf (ismael.ahrazem.dfuf@alumnos.upm.es)

ABSTRACT Decision-making using machine learning requires a deep understanding of the model under
analysis. Variable importance analysis provides the tools to assess the importance of input variables
when dealing with complex interactions, making the machine learning model more interpretable and
computationally more efficient. In classification problems with imbalanced datasets, this task is even more
challenging. In this article, we present two variable importance techniques, a nonparametric solution, called
mh-χ 2 , and a parametric method based on Global Sensitivity Analysis. The mh-χ 2 employs a multivariate
continuous response framework to deal with the multiclass classification problem. Based on the permutation
importance framework, the proposed mh-χ 2 algorithm captures the dissimilarities between the distribution
of misclassification errors generated by the base learner, Conditional Inference Tree, before and after
permuting the values of the input variable under analysis. The GSA solution is based on the Covariance
decomposition methodology for multivariate output models. Both solutions will be assessed in a comparative
study of several Random Forest-based techniques with emphasis in the multiclass classification problem with
different imbalanced scenarios. We apply the proposed techniques in two real application cases in order first,
to quantify the importance of the 35 companies listed in the Spanish market index IBEX35 on the economic,
political and social uncertainties reflected in economic newspapers in Spain during the first quadrimester
of 2020 due to the COVID-19 pandemic and second, to assess the impact of energy factors on the occurrence
of spike prices on the Spanish electricity market.

INDEX TERMS Covid-19 pandemic, electricity market, global sensitivity analysis, multiclass classification
problem, multivariate response scenario, variable importance analysis.

I. INTRODUCTION From a machine learning perspective [2], this analysis is


Variable importance analysis (VIA) has gained attention in performed to identify the predictive power of each individual
many practical applications [1] due to the complexity of inter- input variable in regression or classification models. Global
actions among variables on large datasets. Nowadays, in any Sensitivity Analysis (GSA) [3], on the other hand, aims to
classification or regression problem, VIA is a crucial task quantify the output variability due to the variance of each
whose primary goal is to improve the model interpretability, input variable. In both cases, as a result, the analysis permits
reduce the computational cost, optimize the data storage and us to rank and select the most influential input variables and
ultimately provide a smaller number of relevant input vari- drop from the model those that are irrelevant. Regardless of
ables without losing explanatory prediction power. Statisti- the type of VIA, we need to determine the relationship that
cally speaking, VIA refers to a set of techniques that allow maps the input variables on the output from which we can
us to assess numerically the impact or relative importance perform the study. This relationship can be characterized by
of input variables (factors, features), described by an input a parametric model, Y = f (X ), or a nonparametric model that
vector X ∈ [x1 , . . . , xp ], on a model output variable Y . needs to be fit, for instance, using some machine learning
technique (Neural Network, Support Vector Machine, etc).
The associate editor coordinating the review of this manuscript and Depending on the nature of the output variable, we deal with
approving it for publication was Yongming Li . a classification problem if the output variable is categorical

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
127404 VOLUME 8, 2020
I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

(binary or multiclass) or a regression problem in the case of a categorized into Boosting and Bagging methods. In Boosting,
continuous output variable (univariate (scalar) or multivariate the set of base learners are trained sequentially using random
(vector output)). samples. These samples are selected with replacement of
In the case of a classification machine learning algorithm, over-weighted data in which misclassified observations are
we aim to predict (after a training process), with the highest assigned higher weights and then passed to the next base
accuracy, the class of a new observation Xnew . However, learner. The most popular Boosting methods are, AdaBoost,
the accuracy of these algorithms can be influenced by the XGBoost or Gradient Boost. On the other hand, in Bagging
imbalanced nature of the data [4], [5], where observations of (Bootstrap aggregating) methods, the training of each base
one class outnumber the other class(es) (imbalance ratio of at learner is performed independently. Popular methods are
least 5:1 for the binary case [6]). This skewed distribution C4.5, CART, and most recently developed, the Conditional
of classes makes the goal of predicting very challenging Inference Trees (CIT) algorithm [16]. The CIT technique will
since the cost of misclassification of a minority class can be be used in this work as a base learner for the proposed variable
more expensive than misclassifying an observation from the importance algorithm when applied to the imbalanced multi-
majority class (see for instance medical diagnosis cases [7]). class problem due to its capability of handling multivariate
In order to improve the prediction accuracy and lower the response models.
computation expense, a previous identification and quantifi- To overcome the imbalanced classification problem, sev-
cation of the most relevant input variables of the model is eral solutions based on bagging methods were proposed. [17]
always highly advised. This challenge, increases even more explore the idea of modeling the Random Forest (RF) algo-
when coping with multiclass problems with an imbalanced rithm by incorporating a cost-sensitive learning as well as a
dataset. resampling technique in order to turn the imbalanced binary
To address the imbalanced problem, several approaches dataset into a more balanced one. Reference [18] modified
have been considered, which can be grouped into four the splitting criterion of the C4.5 algorithm by measuring
categories: the probability divergence between the joint probability of
1) Preprocessing strategies: Resampling techniques features, the imbalanced binary target class Y (p(x,y)) and
and/or variable importance analysis. the marginal distributions (p(x) and p(y)). Reference [19]
2) Cost-sensitive learning methods. followed a similar approach using the Hellinger distance
3) Adaptation of machine learning techniques. in the splitting phase and various sampling methods. When
4) Hybrid solutions: Combination of the previous three dealing not only with high-dimensional settings (p  n), but
approaches. also Bigdata scenarios, [20] used the MapReduce framework
Preprocessing strategies based on resampling techniques to first partition the original dataset and then perform sepa-
can be divided into two subcategories, oversampling (i.e. rate analysis to the imbalanced binary classification problem
process of duplicating minority samples, see SMOTE or combining various sampling techniques along with the RF
synthetic minority oversampling technique [8]) or undersam- algorithm. The Roughly Balanced Bagging (RBBag) tech-
pling methods (i.e. removing a subset of majority classes, nique [21] was applied in combination with a random under-
see RUS or Random Under-Sampling [9]). A comparison of sampling procedure by [22] showing a better performance
sampling methods applied to the imbalanced classification related to over-sampling procedures. On the other hand,
problem can be found in [10]–[12]. The strategy behind boosting methods were also adapted to cope with the imbal-
the cost-sensitive learning is to adjust the machine learning anced problem. To overcome the bias behavior toward major-
technique in order to minimize the misclassification cost. Dif- ity classes seen with Gradient Boosting when classifying rare
ferent solutions implementing cost-sensitive learning tech- events (only for the binary imbalanced scenario), [23] pre-
niques can be found in [6], [13]. Researchers also adapted sented three different undersampling techniques combined
parametric models [14] (i.e. mainly Logistic Regression and with a modified boosting method that consisted of an early
Support Vector Machine (SVM) algorithms) and nonpara- stopping and shrinkage process. Reference [24] extended the
metric methods (Decision Tree algorithms), to find solutions Adaptive Boosting algorithm (AdaBoost) [25], to tackle the
to the imbalanced multiclass classification problem. When imbalanced multiclass scenario. They developed a new algo-
applied as variable importance techniques, SVM techniques rithm, AdaBoost.M1, adding a cost-sensitive learning method
have been used extensively to overcome the imbalance prob- to the boosting process. Reference [26] first upgraded their
lem, however, Ensemble Methods (EM) based on Decision initial binary class algorithm, AdaBoost.NC [27] to handle
Trees techniques have shown better performance compared the multiclass scenario. Their study showed that the class
to parametric-based classifiers [15]. decomposition strategy (i.e. converting the multiclass prob-
EM refers to the collection of techniques in which a group lem into binary problem, known as one-against-all, OAA)
of models, belonging to the same algorithm (base learner), did not perform well, showing the effectiveness of the simul-
are trained together in order to obtain a better performance taneous processing when dealing with the multiclass prob-
compared to the performance achieved with a single algo- lem. They improved the accuracy of classification through a
rithm. The technique helps us to reduce the variance and combination of different over-sampling methods along with
bias of the single learner, improving its accuracy. EM are the upgraded AdaBoost.NC algorithm. Hybrid solutions try

VOLUME 8, 2020 127405


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

to combine the different approaches aforementioned to deal the Euclidean distance. This metric has been extensively
with the imbalanced issue and improve the classification applied in clustering problems for outlier detection. In our
accuracy. Researchers are referred to [28]–[35] for detailed case, we leverage the properties (sensitivity to class distribu-
descriptions of these hybrid solutions. It is worth mentioning tion) of the Mahalanobis distance to analyze the supervised
that the majority of these solutions belong to the family problem with an imbalanced dataset. Additionally, we sup-
of ensemble-based methods showing an overall advantage port our importance analysis by obtaining a nonparametric
of bagging methods combined with sampling methods over test statistic based on the null hypothesis of equal multivariate
boosting methods and their extensions. cumulative density functions. The parametric alternative is
Whereas many performance metrics (Precision, based on the covariance decomposition multioutput Global
F-measures, Recall, etc.) are used to assess the accuracy of Sensitivity Analysis technique applied to the fitted multi-
classifiers, only the Area Under the Curve (AUC) [36] and output model Yc = fc (X ). The proposed methods will be
G-Mean [37] have shown, due to their robustness, to be compared to existing variable importance techniques based
adequate in the imbalanced case [38]. In this work, we apply on RF algorithms since they are designed for the binary as
the extension of the AUC, the Volume Under the Surface well as the multiclass scenario as opposed to the rest of the
(VUS) [39] to analyze and visualize the importance of fea- models (SVM, NN or Logistic Regression).
tures in multiclass classification problems. This paper is organized as follows: Section 2 presents
Another effective approach to deal with the imbalanced the state of the art related to variable importance meth-
classification problem is to first run a preprocessing step ods with emphasis on the imbalanced multiclass problem.
based on variable importance analysis in order to remove non- Section 3 describes the theoretical concepts and proposed
influential variables. Removing irrelevant features lowers the methodologies. Simulated examples consisting of nonlin-
risk of misclassification, avoids overfitting, bias towards the ear functions with different imbalanced ratios are tested in
majority class and the possibility of considering the minority Section 4. In Section 4, we also apply our proposed method-
class(es) as noise [40]. The variable importance analysis ologies to two real-world problems. In the first case, we inves-
gets more challenging when dealing with the multiclass sce- tigate the impact of the 35 companies listed in the Spanish
nario since the class decomposition strategy overall underper- index IBEX35 on a newly proposed uncertainty index created
forms [26] when compared to the simultaneous processing of using the Natural Language Processing Word2Vec algorithm.
all classes. The index, called IdW2V, captures the political, economic
In this paper, we investigate two VIA approaches to and social uncertainty in Spain due to the COVID-19 pan-
deal with imbalance classification problems. The presented demic during the first quadrimester of 2020. In the second
methodologies can be implemented using either a parametric real application, we aim to identify and quantify the impact
or a nonparametric framework. The key idea is to convert of several factors on extreme prices in the Spanish electricity
a multiclass classification problem into a continuous multi- market. Section 5 is dedicated to discussing the contributions
output regression problem and analyze through the proposed and limitations of the proposed techniques. Final conclusions
importance techniques the influence of the input variables. are presented in Section 6.
The nonparametric approach applies the permutation impor-
tance technique to measure the influence of each predictor. II. RELATED WORK
The changes seen on the predictive output vector (predicted Several Variable Importance (VI) methods have been pro-
probabilities, Yc = fc (X ), Yc ∈ [Y1 , Y2 , . . . , Yl ], posed to tackle the imbalanced multiclass classification prob-
P l
i=1 Yi = 1) are summarized in a distribution error matrix lem. These methods can be grouped into 3 categories [2]:
that is computed as the row-wise difference between the Filter, wrapper and embedded methods. Filter methods are
observed and the predicted matrices. The resulting matrix is model-independent techniques (i.e. no need to have a map-
then vectorize through the computation of the Mahanalobis ping function that maps features to classes). They use sta-
distance between its rows, which leads to a vector of distri- tistical measures (χ 2 -stastistic, Fisher Criterion Score or
bution of errors. Compared to other importance techniques Information-theory based measures) to capture the level of
where only a summary of the available information of errors dependency of the classes on a group of features. Although,
is used, our approach considers all the information related they underperform their competition when dealing with non-
to errors generated by the predictive model. Working with linearities between the features [42], they are computation-
probability distributions has shown to be more robust and ally less expensive and hence suitable for high-dimensional
suitable in skewed class distribution settings since we work settings [43]. Wrapper and embedded methods, on the other
with the entire information related to the class membership. hand, use classification models along with search methods.
Finally, the dissimilarities between the Mahalanobis distance They measure the importance of each feature by assessing its
vectors before and after permuting the predictor under assess- predictive power. Despite their ability to capture interactions
ment are quantified using the χ 2 -distance (symmetric met- among features, their tendency to overfit and their computa-
ric). Although a similar solution was presented in [41], in this tional cost make them less attractive models.
work we improve the algorithm and adapt it to the imbalanced Mutual Information (MI) was used in [44], [45] as a score
classification problem by using the Mahalanobis instead of function to assess both the impact of the single feature as

127406 VOLUME 8, 2020


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

well as the feature interactions on the multiclass output. its competitors. Reference [59] used the permutation frame-
Reference [46] presented an entropy-based solution to reduce work technique [60] to assess the importance of vari-
the computation costs associated to filter methods in multi- ables (features) in imbalanced scenarios. In their work,
class problems. Their key idea consisted of calculating the the importance of each input variable is measured by com-
dependencies between each feature and a preselected subset puting the difference between the AUC before and after ran-
of classes. Binary Relevance and Label Powerset approaches domly permuting the values of each input variable at each
were considered by [47] to transform the multiclass prob- base learner (CIT algorithm).
lem into a single class problem, to subsequently apply the The feature selection method for imbalanced binary classi-
Relief and Information Gain (IG) as feature selection meth- fication problems proposed by [61] uses a sequential forward
ods to the transformed problem. Reference [48] proposed an technique where features are only added to the classifier
Iterative Ensemble Feature Selection method (IEFS). First, based on clustering and outlier detection if the mean of sen-
the OAA strategy1 was applied to convert the multiclass sitivity and specificity of the classifier increase. The feature
data set into two-class sub-datasets. Then, iteratively for 0 T 0 selection technique, asBagging-FSS [62], aims to remove
times, the Fast-correlation based filter method was applied irrelevant and redundant features by applying hierarchical
to each subdata set, which was previously sampled using the clustering that collect similar features using the Pearson cor-
SMOTE method. More recent feature selection filter-based relation coefficient across two features. A more recent sur-
methods that dealt with the multiclass domain are discussed vey of preprocessing techniques, including resampling and
in [49], [50]. feature selection methods for imbalanced binary biomedical
Parametric models such as SVM were also used datasets are discussed in [63]. A user guide of resampling
as VI methods. For instance, [51] extended the work of [52] techniques as well as feature selection methods according to
by proposing an embedded VI method based on SVM and the type of dataset is also provided. An overview of feature
a recursive feature elimination procedure to deal with the selection techniques can be found in [64].
multiclass scenario. Reference [53] used a similar strategy Few papers have proposed variable selection solutions for
applying a backward elimination technique with SVM as classification problems based on Global Sensitivity Analysis.
a classifier to cope with high dimensional settings. Refer- Reference [65] proposed the FAST (Fourier Amplitud Sen-
ence [54] adapted the SVM algorithm with a cost-sensitive sitivity Test) combined with a trained Feedforward Neural
learning method using the Quasi-Newton-based optimization Network (FNN) as a new feature selection method for clas-
scheme. To overcome the loss of class information when the sification problems. The total effect2 of each input variable
OAA strategy is applied, [55] proposed a normalization pro- (Xi ) on the univariate response Yl = P(cl |X ) (where each
cedure weighing the output of each two-class SVM classifier response represents a probability of a specific class when
with a reliability measure. an observation X is evaluated using the predictive model,
When dealing with skewed class distributions, probabilis- FNN), is calculated as the sum of the individual total effect
tic approaches have shown better performance in imbalanced with respect to all output probability values. The approach
scenarios. Reference [56] presented a VI method based on followed by [66] trains a Neural Network and finds the Sobol
the computation of the Hellinger distance between the two- decomposition of the fitted functions. The sensitivity index
class distributions using the CART and SVM algorithms as attached to each input variable is determined by the difference
base learners. Reference [57] followed a similar solution among the activation functions that described the NN sys-
and proposed the Density-Based Feature Selection algorithm tem. Both solutions fail to consider the possible relationships
(DBFS). The algorithm estimates the probability density of among the probability distributions of the classes, are limited
the analyzed feature in each class PDF(Ci ) and then computes to scenarios with independent continuous variables (assump-
the overlapping of this probability with respect to the rest of tions on the input space) and need a known mathematical
the classes. Reference [58] compared three alternatives for expression that maps the input on the output(s).
the multiclass scenario based on the Random Forest algo-
rithm. In the first two, the Binary Relevance (BRRF) and A. RANDOM FOREST-BASED VARIABLE IMPORTANCE
Label Powerset (RFRF) techniques were first applied to con- TECHNIQUES
vert the multiclass dataset into various single-class datasets, In this subsection, we present several proposed techniques for
while in the third alternative the Random Forest Predictive VIA based on Random Forest that will be compared to our
Clustering Tree (RFPCT) was applied to handle the multi- proposed methodologies.
class data simultaneously. In their study, BRRF outperformed
1) AUC EXTENSION
1 A widely used strategy is the so-called class decomposition technique Reference [59] used the permutation approach [60] to assess
or One-Against-All (OAA) technique. The OAA solution consists of trans- the importance of input variables for the binary output
forming a multiclass dataset with l classes into l different binary sub-dataset.
However, when applying this strategy, we may face a possible loss of class model when considering different imbalanced scenarios.
information, which could potentially lead to an inadequate learning of the
algorithm. For this reason, considering all classes simultaneously could be a
more efficient procedure in the variable importance analysis. 2 Total effect = Main (single) + Interactions effects

VOLUME 8, 2020 127407


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

Based on the same framework,3 we apply the extended ver- the remaining variables. The selected set of variables will be
sion of AUC, the Volume Under the Surface (VUS) for the the one that leads to the smallest error rate. The proportion
3-class case and the Hypervolume Under the Surface (HUM) of variables to eliminate is an arbitrary parameter on this
for the n-class case. These metrics have shown to be insen- method, and does not depend on the data.
sitive to changes in class distributions and hence suitable in This procedure provides the smallest possible set of vari-
imbalance scenarios. ables that can still lead to good predictive performance.
The level of relevance of each predictor (VIM (Xi ) is mea-
sured as the difference between the performance metric of 3) BORUTA
each tree in the forest (ntrees) when predicting the Out of Based on the permutation importance technique, the pro-
Bag (OOB) observations before (Mti , in our case VUS or cedure ([70] and R Package [71]) adds randomness to the
HUM) and after (MPti ) permuting the values of the predictor dataset and evaluates, in a recursive way, the new data using
under analysis (Xi ). the Random Forest algorithm. At each step irrelevant input
ntrees variables are removed.
1 X
The Boruta algorithm duplicates the original dataset, and
VIM (Xi ) = · (MPti − Mti ) (1)
ntrees shuffles the values for each feature. The decision of keeping
t=1
or discarding a given variable will be based on the probability
Evaluating an OOB observation XOOB by using any classi-
that this feature is ranked higher or lower than the added
results in a vector of probabilities, Yc ∈ [Y1 , Y2 , . . . , Yl ]
fier P
random variables in terms of VI. The threshold that will lead
and li=1 Yi = 1, where each element of the vector represents
this decision will be the top VI value of all random variables
the level of membership to each class of the observation. If no
added to the original model. That is, the algorithm will check
assumption is made about the distribution of the predicted
for each of the real features if they have a higher importance
probabilities, an unbiased nonparametric estimator for VUS
than the defined threshold. Only if this requirement is met,
is defined as (3-class case) [67]:
will they be considered as significant in the model.
n1 X n3
n2 X
1 X
VUS = I (Y1i , Y2j , Y3k ) (2) 4) VSURF
n1 n2 n3
i=1 j=1 k=1
Consisting of a two-stage strategy, VSURF technique ( [72]
where, and R Package [73]) first applies the permutation technique to
eliminate the irrelevant input variables whose importance is
if Y1i < Y2j < Y3k

1


1
 lower than a predefined threshold. The model ranks the vari-

 if Y1i = Y2j < Y3k or Y1i < Y2j = Y3k ables by sorting the Variable Importance in descending order.
I (Y1i , Y2j , Y3k ) = 21 The set of eliminated variables will be based on this sorted

 if Y1i = Y2j = Y3k order. The calculation of the VI threshold is based on Variable
6



0 otherwise Importance’s standard deviations, and only those variables
(3) whose importance exceeds this value will be considered in
the model.
Analogously, for the n-class case, the non-parametric Second, a variable selection step is performed. A sequence
HUM estimate is defined as: of RF models are built with the variables that produce the
1
n1
X nn
X smallest OOB error. As result of this step, two sets of vari-
HUM = ... I (Y1i , . . . , Ynk ) (4) ables are produced. The first set counts for the variables that
n1 . . . nn
i=1 k=1 are highly related to the response, including the redundant
ones. The second set is formed by variables with low redun-
2) VarSelRF dancy and the highest prediction accuracy.
Based on an iterative elimination algorithm, the VarSelRF
procedure ( [68] and R Package [69]) builds recursively each III. THEORETICAL CONCEPTS AND PROPOSED
new forest (using the RF algorithm) with variables that pro- METHODOLOGIES
duce the smallest error rate, removing the irrelevant variables In this section we describe the theoretical concepts from
in the process. which we build our proposed variable importance techniques
To select those variables that will be included in the model, for the imbalanced multiclass scenario.
this algorithm builds a new model at each iteration, discarding The importance of input variables implicated in a input-
the variables associated with the smallest variable impor- output model can be generally assessed using different
tance. At each step, 20% of the variables with the smallest approaches, within which Machine Learning (ML) tech-
importance will be eliminated, and a new model is built with niques and Global Sensitivity Analysis (GSA) are the most
popular. By applying ML techniques, we aim to identify
3 This technique will be applied in this work as a visualization method the most important variables in terms of their predictive
(minor contribution) in order to show the difference between the volume power, while GSA provides the set of variables that contribute
under the surface when the input variable is permuted [59]. the most to the output variability. In both cases, a set of

127408 VOLUME 8, 2020


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

FIGURE 1. Schematic design of the proposed methodologies for variable importance analysis in multiclass classification problems based on mh-χ 2 and
multiresponse GSA. (*) Transform the multiclass response with l labels into a continuous output vector with l elements, which represent the probabilities
of membership to each class. VI(Xi ) variable importance of input variable Xi . Si and STi are the main and total effects of Xi . PI: Permutation Importance.
C.I.T.: Conditional Inference Tree. D.T.: Decision Tree.

influential input variables is extracted and ranked. A detailed imbalanced multiclass classification problems using a con-
comparison and connection between these techniques for the tinuous multivariate output framework.
continuous univariate response model can be found in [74]. A schematic design of the proposed methodologies for
In this paper, we apply a continuous multivariate out- VIA in multiclass classification problems are displayed
put framework to analyze the importance of input variables in Figure 1.
involved in a multiclasss classification output model with
imbalanced class distribution (Figure 1). Two solutions are A. GSA: THEORETICAL CONCEPTS
investigated. From a ML perspective, the importance analysis Given a multioutput computational model Yc = fc (X ) (fc :
is based on the permutation importance method (Figure 2) Rp − > Rl p, l are integers) where X ∈ [X1 , X2 , . . . , Xp ]
along with a φ-divergence measure. The covariance decom- are assumed independent input variables defined on
position approach is proposed as a solution for the GSA some probability space (, P), Yc a multioutput vector
methodology. To the best of our knowledge, this is the Yc ∈ [Y1 , . . . , Yl ] with a semi positive definite covariance
first work analyzing the importance of input variables in matrix C = Cov(Y1 , . . . , Yl ) and Yr = gr (X1 , . . . , Xp ) a

VOLUME 8, 2020 127409


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

representing the covariance of the multioutput response (6)


as a sum of partial covariance matrices (Equ. 7).
6 = Cv + C∼v + Cv,∼v (7)
In order to find the Sobol indices, the previously com-
puted covariance matrices (Equ. 7) are projected into a scalar
value by multiplying the covariance matrix by the identity
matrix (I ) and taking the trace:
Tr(I 6) = Tr(ICv ) + Tr(IC∼v ) + Tr(ICv,∼v ) (8)
The generalized Sobol indices are:
FIGURE 2. Permutation importance framework for variable importance
analysis. Tr(ICv )
Sv (I ; fc ) = ;
Tr(I 6)
deterministic function. To assess the impact of the uncer- Tr(IC∼v )
tainty introduced by each input variable on a multioutput S∼v (I ; fc ) = ;
Tr(I 6)
model, GSA techniques can be divided into three types of Tr(ICv,∼v )
methods: the output decomposition method, the covariance Sv,∼v (I ; fc ) = (9)
Tr(I 6)
decomposition method (both moment dependent) and a prob-
abilistic approach (or moment independent method). A brief where, Sv + S∼v + Sv,∼v = 1.
description of these techniques is reviewed in the following Similarly to the Sobol indices for the univariate output
subsections. Special attention is given to the covariance model (Equ. 5), the Sobol indices Sv in the multioutput case,
decomposition approach since this will be used as a method- can also be estimated using the Monte-Carlo pick-freeze
ology proposed to analyze the importance of input variables method described in [77].
in a multiclass classification problem with an imbalanced
dataset. 2) OUTPUT DECOMPOSITION METHOD
The output decomposition approach [78] can be described
1) COVARIANCE DECOMPOSITION METHOD in two steps. First, an orthogonal decomposition is per-
Through the Hoeffding decomposition method [75], we can formed on the multioutput response. Second, a sensitivity
partition the variance of a univariate output into partial vari- analysis (Sobol decomposition [79], FAST [80]) to each
ances. Each partial variance measures the uncertainty of the scalar output is applied. For instance, for the multioutput
model output due to each input variable. Under the assump- scenario, [81] compared three alternatives combining the
tion of independent input variables, the importance of each Principal Component Analysis technique with different GSA
input variable is assessed by computing the so-called Sobol methods (Sobol decomposition, extended FAST and Frac-
indices [76]. As a result, we can compute the Single effect tional Factorial design).
Si or, contribution of Xi to the output variability, the Total In a multivariate response problem, if the elements of
effect STi or join contribution of single (main) plus interaction the output vector (Yi ∈ i = 1, . . . , l) are found to be
effects of Xi and the Interaction effects (second order effect) independent, we can analyze the problem of variable impor-
Sij or, contribution of Xi and Xj to the output variability tance by considering l independent univariate response global
(Equ. 5). sensitivity analysis and then averaging the computed inde-
V [E(Y |Xi ) V [E(Y |X∼i ) ces (Si , STi ). In cases where the independence condition is
Si = ; STi = 1 − (5) not met, the problem must be solved by applying a multi-
V (Y ) V (Y )
variate output GSA in order to consider the dependencies
Reference [77] extended the analysis to the multivariate (i.e. correlations) between the output variables.
output case and generalized the computation of the Sobol
indices. 3) A MOMENT-FREE APPROACH
For the multivariate output, let the Hoeffding decomposi- When the variance is not an accurate measure of uncertainty
tion of fc (X1 , X2 , . . . , Xp ) be: and/or the output distribution is highly skewed, the use of
fc (X1 , X2 , . . . , Xp ) = k + fv (Xv )+f∼v (X∼v )+fv,∼v (Xv , X∼v ) density-based methods to compute the sensitivity indices may
be more adequate since we work with the entire distribution
(6)
of the response and not only a summary of it. Also, moment-
where k ∈ Rl , v a non-empty r-subset of [1, . . . , p], Xv = free approaches based on φ-divergence measures can capture
(Xi , i ∈ v) and the complement X∼v = (Xi , i ∈ [1, . . . , p] \ the dissimilarities between the output distributions and take
v), fv : Rr → Rl , f∼v : Rp−r → Rl and fv,∼v : Rp → Rl . into account the interactions between the response variables.
If we compute the covariance matrix of each side of Equ. 6, The first moment-free approach used as importance metric
we have the covariance decomposition of the model response, for VIA was introduced by [82]. For the univariate response

127410 VOLUME 8, 2020


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

problem, the single effect Si is computed as the average area and total sensitivity indices. Given the fitted model
under the conditional Probability Density Function (PDF) Yc = fc (X ), we obtain two new model outputs as
when Xi is fixed fY |Xi (y) and the unconditional PDF fY (y). follows:
Reference [83] extended the previous method to the multi- a) Yc (i) = fc (Xi0 , X(i) ), where Xi0 is an independent
variate response scenario. copy of Xi (i.e. freeze all input variables except
1
Z ∞ Xi and sample Xi ). Yc (i) is the model output when
Si = EXi
2
|fY (y) − fY |Xi (y)|dy (10) evaluating all N observations (Xi0 , X(i) ) on fc (·).
−∞ (i) (i) (i)
Yc (i) = [Yc1 , Yc2 , . . . , Ycl ] is a matrix of size
Due to the difficulty of approximating the PDF func- (i)
Nxl. Each element Yck (k ∈ [1, ..l]) of the matrix
tion, [84] employed instead the Cumulative Density Func- (i)
Yc is a vector of size Nx1 that represents the
tion (CDF) and computed the Kolmogorov Smirnov (KS)
vector of probabilities of class k.
distance between the CDFs (FY |Xi (y) and FY (y)). In this case,
b) Yc i = fc (Xi , X(i)
0 ), where X 0 is an independent
(i)
the sensitivity indexes are computed as the distance metric
copy of X(i) (i.e. freeze Xi and sample the rest of
between the corresponding conditional and unconditional
the input variables). Yc i is the model output when
CDFs.
evaluating all N observations (Xi , X(i) 0 ) on f (·).
c
Si = EXi maxy |FY (y) − FY |Xi (y)| (11) Yc = [Yc1 , Yc2 , . . . , Ycl ] is a matrix of size Nxl.
i i i i

Each element Yck i (k ∈ [1, ..l]) of the matrix Y i


c
Similarly, [85] applied the CDF-based solution to measure is a vector of size Nx1 that represents the vector
the contribution of each input variable in multivariate output of probabilities of class k.
models while also taking into consideration the correlations
The estimator of STi (similarly for Si using Yc i ) is
between the outputs.
defined as [77]:
Z 1
1
ηi = |kV (v) − kV |Xi |dv; Si = EXi [ηi ] (12) Pl PN (i) 1 PN Yck,j+Yck,j 2
(i)

0 2 k=1 ( j=1 Yck,j · Yck,j − N ( j=1 2 ) )


STi =
where kV (v) = P(V ≤ FY (Y1 , . . . , Yl )) represents the Pl PN Yc2 +Yck,j
(i)2
PN (i)
Yck,j+Yck,j
Probability Integral Transformation (PIT) and FY the CDF k=1 ( j=1
k,j
2 )− N1 ( j=1 2 )2 )
(kV |Xi is PIT when Xi are fixed). (13)
Although the sensitivity indices based on computing den-
Apart from time efficiency, using the covariance decompo-
sity dissimilarities take into account the correlation between
sition method permits us to break down the contribution made
the outputs, their computational cost makes them unfeasible
by each input variable to the variability of the output variable.
and limited to non-expensive models.
This is achieved by obtaining the single (main) Si , total STi
and interaction effects as STi − Si of each variable. Addition-
B. PROPOSED VIT BASED ON THE COVARIANCE
ally, in contrast to [65], we can consider the possible relation-
DECOMPOSITION GSA TECHNIQUE
ships among the responses in the analysis. On the other hand,
The methodology for VIA in multiclass classification prob-
the use of the GSA technique is limited to problems where the
lems, based on the covariance decomposition multioutput
assumptions of independence between input variables is met
GSA technique, consists of the following steps:
(although an orthogonalization procedure could be applied to
1) Transform the multiclass output dataset [Y ; X ] (Y ∈ meet the independence criterion.), the distributions of input
[1, . . . , l]) into a multioutput regression dataset [Yc ; X ] variables is known and the approximated model Yc = fc (X )
(Yc ∈ [Yc1 , . . . , Ycl ], li=1 Yci = 1, Yci ≥ 0) by:
P
is well fitted.
a) Fit the multiclass output dataset [Y ; X ] into a
multiclass classification algorithm (i.e. SVM, NN C. PROPOSED VIT BASED ON A MACHINE LEARNING
or RF). APPROACH: THE mh-χ 2 ALGORITHM
b) Compute the predicted probabilities P(cl |X ) The method for VIA for imbalanced multiclass classification
using the previous classification algorithm and problems is built using the CIT algorithm [16] as a base
the observations X . Result: [Yc ; X ], where learner, the permutation importance framework [87] and a
Yc = [P(c1 |X ), P(c2 |X ), . . . , P(cl |X )]. Each col- φ-divergence measure. With Decision Tree techniques (DT)
umn of Yc represents the probability of a class (for we do not need to assume independence among input vari-
instance, P(cl |X = xi ) is the level of membership ables and no previous variable prescaling is needed. They
of a realization xi ∈ X to each class l). have shown high performance in datasets with highly nonlin-
2) Find a multivariate output regression model Yc = fc (X ) ear interactions in high dimensional problems. The permuta-
for the compositional dataset [86] that describes the tion importance technique measures the association between
relationship between Yc and X . the output and the permuted input variable. If some asso-
3) Apply the covariance decomposition GSA method for ciation exists between the model output Y and the Xi , per-
the multioutput response scenario to compute the single muting Xi values while keeping the rest of the variables

VOLUME 8, 2020 127411


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

fixed, will break the relationship, measuring this way the


relative importance of the input variable under analysis. The
φ-divergence measure is suitable for highly skewed output
distributions since it allows us to capture the dissimilarities
between probability distributions.
In our approach, we study the relevance of input variables
by analyzing the dissimilarities between the distribution of
errors made by the CIT model before and after permuting
the variable under assessment. Working with the entire dis-
tribution of errors instead of only a summary of the errors
(G-mean, AUC) permits us to work with all the information
generated by the predictive model. This information enables
us to capture the influence of input variables when dealing
with skewed distributions which characterize the imbalance
problem.
Below, we present the notation and some definitions used
in the proposed algorithm.
1) Transform the multiclass output dataset [Y ; X ] (Y ∈
[1, . . . , l]) into a multioutput Pregression dataset [Yc ; X ]
(Yc ∈ [Y1 , . . . , Yl ] ∈ Rl , li=1 Yi = 1, Yi ≥ 0) and
X = [X1 , . . . , Xp ] ∈ Rp using a multiclass classifica-
tion algorithm.
2) From the [Yc ; X ] dataset, we sample the In of Bag
(IOB) (i.e. [YcIOB ; X IOB ]) and Out of Bag (OOB) (i.e.
[YcOOB ; X OOB ]) datasets. Fit the CIT predictive model
to the IOB dataset (i.e. YcIOB = fCIT (X IOB ), fCIT : Rp
-> Rl ).
3) Compute the ŶBP OOB = f OOB ) ∈ Rl (i.e.
CIT (X
the predicted matrix computed when evaluating
the OOB observations using the CIT algorithm j
FIGURE 3. These figures illustrate the distribution of realizations (εBP
fCIT : Rp -> Rl before permuting (BP) Xi ). Similarly j OOB = Y OOB − Ŷ OOB and
and εAP ) represented by the matrices EBP c BP
OOB = f OOB l (i )
ŶAP (i) CIT (X(i) ) ∈ R after permuting (AP) Xi . OOB = Y OOB − Ŷ OOB for 3 balance datasets [100,100,100],[200,90,10]
EAP c AP
(i ) (i )
4) The predictive power of the input variable Xi is and [10,280,10]. Each realization measures the error generated by the CIT
assessed by permuting its values (permutation impor- before and after permuting Xi . From top to bottom we consider the
balanced, the moderately balanced and the imbalanced scenarios. Right
tance framework). This assessment consists of sev- column represents the distributions before permuting Xi and the left
eral steps: We first measure the errors introduced by column after doing so.
the base learner (CIT) before and after permuting Xi
through the computation of the error matrices EBP OOB =

YcOOB − ŶBP OOB and E OOB = Y OOB − Ŷ OOB . Both outliers are being affected by the permutation of the
AP(i) c AP(i)
matrices provide us with the necessary information to input variable.
j
assess the importance of the variable under analysis. 5) In order to capture how the realizations εBP and
j
For instance, see Figure 3, if Xi has any predictive εAP(i) change their locations when permuting the input
power, when permuting its values, the distribution of variable, we measure the distance between these
j
errors will be affected (i.e. the association between realizations εBP and the distribution EBP
OOB while tak-

classes and the predictors will change according to the ing into account the correlations among the realiza-
power of the variable). When dealing with imbalanced tions, before and after permuting Xi (i.e. same with
j
datasets ([200 : 90 : 10] and [10 : 280 : 10] datasets), εAP(i) and EAP
OOB ). This is performed computing the
(i)
j j
the realizations εBP and εAP(i) , j = 1, . . . , nOOB (rows Mahanalobis distance. When performed for all real-
OOB and E OOB ) corresponding to the izations, we are vectorizing the matrices EBPOOB and
of the matrices EBP AP(i) OOB
EAP(i) leading to the distributions of Mahanalobis
minority classes appear as outliers in the graphs. Some
of these outliers change their positions while others distances.
remain almost invariant, which may suggest that the r
permutation impacts one class more than the other. The j j
(εBP − µ
j
BP ) · S OBB · (εBP − µ
BP T −1
dmh (εBP ) = d dBP )
permutation also impacts the majority and minority EBP
classes differently. The goal is to measure how these (14)

127412 VOLUME 8, 2020


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

where j = 1, . . . , nOOB , µ
d −1
BP and S OBB are the column- where ε̄ is the mean vector and the covariance matrix:
EBP
OOB . OOB + n
nOOB · SBP OOB
wise mean and inverse covariance matrix of EBP OOB · SAP(i)
AP j Su = (18)
Similarly for dmh (i) (εAP(i) ).4 (N − 2)
j
BP (ε ) and d (i) (ε AP j
6) dmh BP mh AP(i) ) are the Mahanalobis dis-
OOB and E OOB respectively.
of EBP AP(i)
tances between each realization before and after Now, consider the hypothesis:
permuting Xi . The distribution of these distances OOB OOB
BP ∼ P and d AP(i) ∼ P
(dmh Ho : F(εBP ) = G(εAP )
d mh d(i) ) incorporate information (i)

about the errors generated by the CIT base learner for OOB OOB
Ha : F(εBP ) 6= G(εAP(i)
) (19)
each OOB observation.
7) The importance of the input variable under analysis Since we expect that the alternative hypothesis is true if Xi
Xi is then measured as the χ 2 histogram distance [88] is relevant, U can be used as a test statistic for this purpose.
BP and d AP(i) .5 If U is large, the Ho is rejected and hence Xi is influential.
between the random variables dmh mh OOB ) =
On the contrary, if the Ho is not rejected, then F(εBP
OOB OOB OOB OOB
G(εAP(i) ), which implies that EBP and EAP(i) yield to the
1 X (Pdj − Pd(i)j )2
χd2 (Pd , Pd(i) ) = AP
2 (Pdj + Pd(i)j ) same distribution of errors, hence χt2 (dmh
BP , d (i) ) = 0.
t mht
j=1
Although the U test does not provide an importance rank-
The importance metric VI (Xi ) for Xi is defined as: ing, it can be used as an assessment tool to select variables in
AP a multioutput scenario. This could be used as a preprocessing
VI (Xi ) = E[χd2 (dmh
BP
t
, dmht(i) )] step to then perform a variable importance analysis where a
ntree
1 X 2 t t ranking of importance is generated.
= χd (Pd , Pd(i) ) (15) The U test is nonparametric since the null distribution
ntree
t=1 does not rely on the underlying distribution of the posterior
where ntree is the number of trees in the Random Forest and predicted probabilities of each class as these probabilities do
χd2 (Ptd , Ptd(i) ) the χ 2 -distance between the random variables not follow a known distribution. In imbalanced scenarios we
AP
BP and d (i) evaluated in each tree t. may encounter cases where the predicted probabilities are
dmh t mht
Some properties of VI (Xi ): hard to fit and hence this test could be more suitable. The
accuracy of the U for variable importance analysis is highly
1) VI (Xi ) = 0 if Xi is irrelevant.
dependent on the sample size nOOB . The larger the nOOB the
2) 0 ≤ VI (Xi ) ≤ 1.
more accurate the test is.7
3) VI (Xj ) > VI (Xi ) if Xj is more relevant than Xi .
OOB = E OOB due to Ŷ OOB =
The pseudo-code describing the methodology proposed for
If Xi is irrelevant, then EBP AP(i) BP VIA in multiclass classification problems based on the non-
AP
OOB , thus d BP = d (i) , which implies χ 2 (d BP , d (i) ) = 0 AP
ŶAP mht mht t mht mht
parametric approach is shown in the Algorithm 1.
(i)
(i.e. Ptd X = Ptd X ).
(i) IV. SIMULATED EXAMPLES AND REAL
The relevance of Xi can also be tested using the nonpara-
CASE APPLICATIONS
metric test for the bivariate two-sample problem 6 [89]. Let
j j We discuss the performance of the proposed methodolo-
εBP and εAP(i) be independent random samples (realizations)
gies based on GSA and Machine Learning techniques
extracted from EBPOOB and E OOB with cumulative density func-
AP(i) (sections III-B, III-C) and compare them to methods based
OOB
tions F(εBP ) and G(εAP OOB ). Given a U test statistic defined
(i)
on Random Forest (VarSelRF section II-A2 and Boruta
as: section II-A3, VSURF section II-A4). We choose these tech-
(N − 1) · T 2 niques for the following reasons: a) All methods use deci-
U= (16) sion tree learners, hence they all have intrinsic capability for
(N − 2)
variable importance analysis, no variable scaling is needed
where N = 2nOOB , T 2 is the Hotelling’s two-sample T 2 and mixed data (categorical and continuous) is allowed. b)
statistic defined as: Variable importance analysis for binary and multiclass classi-
fication problems can be performed with the same algorithm.
T 2 = nOOB (ε̄BP
OOB
− ε̄AP
OOB T
) · Su−1 (ε̄BP
OOB
− ε̄AP
OOB
) (17)
(i) (i) c) All methods use the permutation importance framework as
4 The Mahanalobis distance measures the distance between a point (in a tool to assess the importance of variables under analysis.
our case a realization) and the set of points characterized by a mean and d) The compared methods provide numerical estimates of
a covariance matrix. It reduces to the Euclidean distance if the covariance importance. To help with the comparison and interpretability
matrix is the identity matrix. This distance is widely used in clustering of the importance values, we standardize (i.e. PpVI (X i)
%)
problems to assess the importance of outliers in a distribution. i=1 VI (Xi )
5 As a distance metric, it satisfies the conditions: non-negativity the importance values provided by each method. e) All meth-
(χd2 (A, B) ≥ 0), symmetry (χd2 (A, B) = χd2 (B, A)) and subadditivity ods have an open code implementation.
(χd2 (A, B) ≤ χd2 (A, C) + χd2 (C, B))
OOB → ∞, it can be proved that under Ho asymtotically U ∼ χ2 .
6 For non-normal data, which is our case. 7 As n 2

VOLUME 8, 2020 127413


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

Algorithm 1 mh-χ 2 : Imbalanced MultiClass Algorithm TABLE 1. Balance/Imbalance classification model (Class distributions for
2 sample sizes (n=150, 900): Balanced [33%,33%,33%], Moderate
1: Original Multi-Class output dataset [Y ; X ] Balancing [60%,30%,10%] and [80%,10%,10%], Imbalanced
2: Transformed continuous multi-output dataset [Yc ; X ]. [96%,2%,2%], [2%,96%,2%], [2%,2%,96%]. For example, in a model with
900 observations with moderate balancing 60% of observations (i.e
3: Variable selection: U Nonparametric test. 540 observations) belong to class 1, 30% of observations (i.e. 270) to
4: for Xi in 1 : col(X ) do class 2 and 10% (i.e 90) to class 3).
5: for treet in 1 : ntrees do
6: Build tree CITt using [YcIOB ; X IOB ]t .
Compute ŶBP OOB = f OOB ) and E OOB =
7:
t CITt (Xt BPt
YcOOB OOB .
− ŶBP
t t
BPt j
8: Calculate dmh (εBPt ), j = 1, . . . , nOOB (dmh
BP ∼
t
t
Pd X ).
9: Permute the values of the variable under analysis Xi .

OOB = f OOB OOB


A. CLASSIFICATION MODEL
10: Compute ŶAP CITt (X(it) ) and EAP(it) =
(it) The simulated classification model is designed so that the
YcOOB OOB .
− ŶAP
t (it) level of association of the input variables with the multiclass
APit j AP
11: Calculate dmh (εAPit ), j = 1, . . . , nOOB (dmht(i) ∼ response decreases gradually. If a variable xi is relevant, then
t
Pd X ). this predictor will be able to separate the classes. In our
(i) model, xa and xb have the largest effect as they clearly allow
12: Compute χd2 (Ptd , Ptd(i) ). us to distinguish among classes, xc and xd have a moderate
13: end for effect and xe and xf a small impact. Variables xg to xj are
1 Pntree 2 t
t=1 χd (Pd , Pd(i) )
14: VI (Xi ) = ntree t
added to the design as noisy predictors (no association with
15: end for the multiclass response) (see Table 1). The simulations are
16: Return relative importance ranking. performed for two sample sizes, n=150 and 900 and different
class distributions (imbalance levels) over 100 dataset. In this
first model we do not consider the GSA method since the
The efficiency of the different methodologies will be independence between inputs Xi is violated. We only compare
assessed by answering the following questions: the DT-based VITs.
1) Speed: What is the computational cost of each VIT? Figure 4 shows the mean of the VI values of each input
2) Accuracy (effectiveness): Are the most influential input variable computed over 100 simulated datasets for sample
variables properly captured and ranked? sizes n=150 and 900 and class distributions [300:300:300],
3) Stability: How does the importance ranking provided [18:864:18] and [3:3:144]. For both sample sizes and mod-
by each VIT change over 100 bootstrap replicate at erate balance datasets (see Appendix A for [50:50:50] and
different sample sizes? [90:45:15] for n=150, [300:300:300] and [540:270:90] for
4) Balance ratio behavior: How does each VIT perform n=900), all evaluated techniques generate similar ranking
over different imbalanced scenarios? values. Since the importance of variables is defined in
decreasing order two by two, all VITs show the same behavior
We consider two simulated examples and two real appli-
except VSURF, where this pattern is not followed. If we
cation cases. The first design corresponds to a classification
consider importance values below 5% as residual, only the
model with 10 continuous input variables whose effect on
mh-χ 2 algorithm seems to meet this threshold, outperforming
the multiclass response (3-class in our example) varies in
the rest of the algorithms in highly imbalanced scenarios
terms of their predictive power (the original design can be
(see figures n=150 [3:3:144] and n=900 [18:864:18]). This
found in [59]). The second example is a nonlinear model
is explained by the fact that mh-χ 2 takes into consideration
described by Friedman et al. [90] that consists of 10 inde-
the entire distribution of missclassification errors generated
pendent input variables and a continuous response that is
by the base learner. The importance of each input variable
then discretized. For the real cases, we first assess the impact
is obtained as follows: Given the dataset [YcOOB ], where
of the IBEX358 components on a newly created index that
YcOOB = [POOB (c1|X ), POOB (c2|X ), .., POOB (c2|X )] is the
capture the economic, political and social uncertainty during
observed matrix of probabilities of each class (size noob x l),
the COVID-19 pandemic in Spain. In the second example,
we predict using the fitted model YcIOB = fCIT (X IOB ) the
we will examine seven sources of uncertainty, identifying and
predicted matrix of probabilities ŶcOOB = fCIT (X OOB ) and
measuring their relative impact on electricity price spikes in
substract row-wise both matrices. This results in a matrix
the Spanish electricity market. OOB with information about misclassification errors made
EBP
All experiments have been evaluated on an Intel Core CPU OOB corresponds to a different
in each class (each column of EBP
i7-8550U 1.8GHz, RAM 16GB and under the R version 3.4.3.
class). We then obtain the Mahanalobis distance of each
j OOB (i.e. we perform a vec-
8 Benchmark stock market index of the Bolsa de Madrid, Spain’s principal observation (BP ) of the matrix EBP
stock exchange. OOB
torization of the matrix EBP , maintaining the information

127414 VOLUME 8, 2020


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

observation level (difference between the histograms dmh BP and


AP(i)
dmh bin-wise). On the other hand, since the compared VITs
(VarSelRF, VSURF and Boruta) rely on misclassification
error rates to assess the importance of Xi , with this metric the
minority class might be under-represented since this metric
only provides a summary of the errors (errors information is
diluted in one value).
Since U ∼ χ22 as n → ∞ (see [89]), using the U test as a
preprocessing variable selection step is highly dependent on
the sample size. Therefore, we can apply the U test when cop-
ing with large high-moderate balanced domains (see Figure 4,
graphs n=900 [300:300:300]). In this scenario, the results
of the variable importance analysis can be supported by the
U test values. For instance, the U values for xa , xb , xc and
xd are 14.87, 23.33, 4.97 and 7.83 which are significant
since 90% of χ22 values are less than the upper-tail critical
value 4.605. Although the interpretability of the U test values
rely on the sample size and the balance ratio, they are still
informative when it comes to discarding irrelevant variables
in imbalanced cases as the U values attached to irrelevant
variables will always be unsignificant (very small) compared
to relevant variables.
While computationally very expensive, the VUS technique
can be applied to visualize the relevance of the input model
variables (Figure 5). Same as the GSA methodology and
mh-χ 2 , VUS technique uses the predicted probabilities of
each class to build the importance measure. In Figure 5,
we visualize the importance of three types of input variables
(xa , xc and xj , left to right) when dealing with an imbalanced
scenario [18 : 864 : 18]. The more relevant the permuted
variable is, the less evident the overlap between the ROC
surfaces becomes.
FIGURE 4. Comparison of the DT-based VITs (VSURF, Boruta, VarSelRF, The mh-χ 2 algorithm involves nvar iterations (once for
mh-χ 2 ) for the simulated classification model IV-A. We compare the each analyzed input variable) which can make the process
mean of the relative importance of each input variable computed using
the compared VITs. The mean results from calculating the average of the computationally more expensive than the fastest technique
VI values over 100 simulated dataset for sample sizes n=150 and 900. For Boruta. However, the process can be sped up by applying
each simulated sample size we simulated 6 balance ratio scenarios with
different class distributions. The U test values for each input variable are
tools such as parallel programming, coding some of the
also included (values under the input variable names) (The larger the U implemented functions in C++ and integrating these routines
value the more relevant the predictor is). Find the rest of VIA graphs full in our R code.
page image in Appendix A.

B. NONLINEAR MODEL
about misclassification errors made at each OOB observa- The nonlinear model described in [90] maps 10 independent
tion). In other words, the information related to the misclas- uniformly distributed input variables (Xi ∼ U (0, 1)) to a
sification error of each OOB observation is now embedded in continuous output Y = f (X ) which is then discretized to build
the corresponding Mahanalobis distance. Finally, as a result, the classification model.
we have the distribution of Mahanalobis distances dmh BP with
Y = 10sin(πXA XB ) + 20(XC − 0.5)2 + 10XD + 5XE + i
information about the majority as well as the minority classes
BP captures
(outliers in this distribution). The distribution of dmh where i ∼ N (0, 0.012 )
the misclassification errors of both the majority and minority Three dataset sizes are tested, n=100, 500 and 1000.
classes regardless of how small the number of classes is. This For each size, 100 randomly generated dataset are created.
procedure is then performed for permuted Xi providing us Notice that only the first five input variables have a real
AP
with dmh (i) . The importance of Xi is then computed using impact on the response with an expected impact ranking
the χ 2 -distance between dmh BP and d AP(i) which allows us to of XD > XB > XA > XE > XC . The discretiza-
mh
measure how permuting Xi has affected the distributions of tion process of Y consists of setting thresholds to the out-
Mahanalobis distances of minority and majority classes at an put variable. For the 3-class problem, by modifying the

VOLUME 8, 2020 127415


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

FIGURE 5. Comparison of the Volume Under the Surface before and after permuting the analyzed input variables xa , xc , xj (left to right) for the
simulated classification model IV-A and class distributions [18:864:18] (highly imbalanced case).

values of two thresholds [th1, th2], we can configure dif-


ferent imbalanced scenarios, each one defined by its bal-
ance ratio (BR) [BR(th1, th2)n = [ num c1 numc2 numc3
numc1 , numc1 , numc1 ]]. For
instance, the configuration for the 3-class problem consid-
ering n=1000, th1=5 and th2=10 leads to the following
classification setting:

C1 if f (X ) < th1

y = C2 if th1 6 f (X ) < th2

C3 if f (X ) > th2

The class distribution for this setting [BR(5, 10)1000 ] is


[numC1 : numC2 : numC3 ]1000 = [23 : 798 : 179]1000 .
Since the input variables are defined to be independent,
we can evaluate the VIT based on GSA as well as the mh-
χ 2 algorithm against the Decision Tree-based techniques
(VSURF, Boruta, VarSelRF, VUS).
Appendix B shows the comparison between the method-
ologies. In general, at different sample sizes n=100, 500 and
1000 (with the exception of VSURF) all the VITs perform
relatively similar. At the lowest sample size n=100 and
imbalanced scenario [6 : 38 : 56], only mh-χ 2 ranks the
input variables as expected, distinguishing clearly between
influential (XD > XB > XA > XE > XC ) and irrelevant ones
(the rest).The fact that mh-χ 2 considers the entire distribution
of errors corresponding to error misclassifications, even when
the sample size is small, makes this solution more robust.
Followed by mh-χ 2 , the Covariance decomposition GSA-
based technique provides a similar ranking. As shown in the
table (see Appendix B), the sample size factor (n) plays a
significant role in both accuracy and stability. The bigger
the sample size, the more stable and accurate the rankings
are. Nonparametric methods (Boruta, VarSelRF and mh-χ 2 )
FIGURE 6. Comparison of the performance of the proposed
show more stability compared to the GSA parametric method methodologies based on GSA (Single and Total effect) and mh-χ 2 against
(based on the Covariance decomposition method). It is worth the Decision Tree-based techniques (VSURF, Boruta, VarSelRF) for the
nonlinear model IV-B. The relative importance values are computed for
noting that the total effect value STi provided by the GSA tech- 3 sample sizes n=100, 500 and 1000 (left to right) over 100 randomly
nique is close to the values given by the Random Forest tech- simulated dataset. The figures correspond to the analysis performed for
the 3-class classification problem when the higher and lower threshold
niques (see [74] for a comparison and connection between the values are set to th1=5, th2=10 (class distribution [3, 21, 76]100 ,
GSA-based and the Random Forest-based VITs). Although [14, 90, 396]500 , [23, 179, 798]1000 ).
the use of GSA techniques can provide more information such
as the relevance of interactions effects on the model output of independence between Xi is met and the approximated
(STi − Si ), its use is limited to problems where the assumption model that describe the relationship between the response and

127416 VOLUME 8, 2020


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

FIGURE 7. Comparison of the Volume Under the Surface before and after permuting the analyzed input variables XD , XE , XJ (left to right) for the
simulated non-linear model IV-B.

the input variables is well defined. In terms of computation In order to perform the VIA we need to construct our
performance, the Boruta and Covariance decomposition GSA dataset. The following workflow describes the process:
techniques are the fastest methods. On the contrary, VSURF 1) Daily adjusted close prices of the 35 companies listed
and PI(VUS) have shown to be computationally inefficient. in the IBEX35 were extracted for the period January 01,
Figure 6 displays the comparison among the techniques for 2020 to April 30, 2020 (daily returns were computed).9
the setting n=100, 500, 1000 and th1=5, th2=10. At sample 2) We find the clusters of companies with similar
size n=100 and class distribution [3 : 21 : 76], mh-χ 2 identi- movement patterns [91], [92] using the unsupervised
fies all relevant and irrelevant variables, with importance val- machine learning algorithm SOM (Self-Organizing
ues of the irrelevant predictors close to zero in comparison to Map) [93]. Variable clustering based on SOM will
the remaining techniques. On average, VITs generate similar allow us to reduce the dimensionality of the problem
outputs (accuracy and stability) as the sample size increases. gathering the variables (i.e. companies) with similar
Figure 7 visualizes the effect of permuting the input vari- information (i.e. movement return patterns) in clusters
able under analysis. Important variables create more separa- [C1, C2, . . . , Ck].
tion between the ROC surfaces leading to a bigger difference 3) Each cluster will be represented by the first principal
VUSBP −VUSAP , which is in concordance with their influence component PCCi when performing a PCA to the set of
on the multiclass response companies that are in the cluster. This synthetic variable
will explain the variability of the corresponding cluster.
C. REAL APPLICATIONS Steps 2 and 3 help us to generate the input variables of
We evaluate the proposed VIA algorithm mh-χ 2 on two real our dataset X = [PCC1 , PCC2 , . . . , PCCk ].
case problems. In the first analysis, we quantify the impact 4) We use the Natural Language Processing (NLP) algo-
of the 35 components listed in the market capitalization rithm Word2Vec (W2V) [94] to create an uncertainty
weighted index IBEX35 on the Spanish economic, political index that captures the daily variability reflected in
and social uncertainty reflected in newspapers during the economic newspapers (IdW 2V ) due to the politi-
first quadrimester of 2020. During this exceptional period cal, economic and social uncertainty during the first
we generated an uncertainty index that captures the uncer- quadrimester of 2020.
tainty reflected in the media due to the market behavior. 5) Perform the VIA to the dataset [IdW 2V ; X ] applying
VIA will allow us to understand and quantify the factors that the proposed algorithm mh-χ 2 .
contributed to that uncertainty, assessing which among those
Figure 8 and Table 2 show the clusters created by SOM.
factors have more impact in the occurrence of high/low levels
At a glance, we can clearly see that GRF (Pharmaceuti-
of uncertainty (i.e. imbalance problem). In the second case,
cal sector) and VIS (Food industry) have been clustered far
we study the impact of energy factors on the occurrence of
from the remaining companies (Figure 8). Both companies
spike prices on the Spanish electricity market.
have shown the best performance (highest returns) during
this crisis. The high volatility seen in the oil market and the
1) SPANISH MARKET UNCERTAINTY REFLECTED IN
dependency of the rest of sectors on the energy sector caused
NEWSPAPERS IN THE HEIGHT OF COVID-19: VIA
similar patterns among these companies. Clusters C7 and
Exceptional time periods are normally characterized by high
C8 gather most of the financial companies. We apply PCA
levels of uncertainty. The goal in this section is to analyze
to characterize the variability of each cluster. Only the first
the impact of the uncertainty created by different types of
principal components are kept (PC1 explained at least 70%
companies clustered by their return movement patterns on the
of the total variability of the cluster).
Spanish written media during the first four months of 2020.
The study will allow us to assess how returns of stock prices
shaped the uncertainty in the news. 9 The price quotes were extracted from Yahoo finance news website.

VOLUME 8, 2020 127417


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

Wt : Total number of words in the newspapers on a specific


day t. wiUt : Uncertainty term found in a text on a specific day
t (total k).
Index based on Word2Vec algorithm:
The W2V algorithm is a one layer artifical neural network
that learns to predict words from a given context. Given
the set of daily news (corpus), we preprocessed the data
removing stopwords, punctuation, digits, lowering case and
tokenize the data. We then trained the W2V network in order
to find numerical representations of the words (n-dimensional
vectors). Once the algorithm is trained, it provides a weight
matrix with as many rows as unique words and columns
as the size of the word vector representation (normally
100-300 size). Having the numerical representation of words
we can perform mathematical operations such as the cosine
similarity between words.
−→
u .−
→ v
sim(−
→u ,−
→v)= − → −
→ (21)
| u || v |
The idea behind the W2V-based uncertainty index is as fol-
lows: If the keyword = [coronavirus, covid] creates uncer-
FIGURE 8. Clusters of the IBEX35 companies grouped by their returns
movement patterns during the period January 01, 2020 to April 30, tainty in the market, then this uncertainty will be reflected in
2020 using SOM algorithm. the economic newspapers, in the form of greater use of terms
of uncertainty and negative terms (terms found in the dictio-
TABLE 2. Created clusters of the IBEX35 companies. Abbreviations: Ener:
Energy, Elec: Electricity, Telec: Telecommunications, Fin: Financial, Trans: nary). If, on a given day, the market shows high volatility due
Transportation, Infr: Infrastructure, Tex: Textil, RE: Real Estate, Ins: to the keyword (coronavirus/covid), then we assume that for
Insurance, Met: Metallurgy, Tur: Turism, Petro: Petrochemical.
that day more uncertainty/negative terms will be used and
the cosine similarity between those terms and the keyword
will be high (i.e. uncertainty words will appear close to the
keyword). Once the W2V is trained (weight matrix of words
is created), we will be able to measure the similarity between
the uncertainty terms and the keyword.13
TUW TPW
j
X X
Id2W 2Vt = sim(kw, wiUTt ) − sim(kw, wPTt ) (22)
i=1 j=1

In order to quantify the level of uncertainty perceived where, kw stands for keyword, wiUTt is the uncer-
in the writen media, we first parsed daily news from the tainty/negative term that appeared on a specific day t in the
two most read economic and financial Spanish newspapers newspaper text (TUW total number of uncertainty/negative
j
(CincoDias10 and Expansion11 ) for the period January 01, terms day t), wPTt is the positive term that appeared
2020 to April 30, 2020. We then used the financial dictionary on a specific day t (TPW total positive terms day t).
created by (Loughran and McDonald, 2011) [95] composed keyword = [coronavirus, covid].
by 297 uncertainty risk related terms, 2373 words with nega- Figure 9 displays the evolution of the news measure uncer-
tive meaning and 371 positive meaning words.12 tainty index based on W2V (IdW 2Vt ) compared to the his-
Two uncertainty indices were generated: Index based on torical volatility (HV) 14 of the IBEX35. Until February,
counting daily uncertainty terms and a new index based on IdW 2V presents negative values which can be interpreted as
the Word2Vec algorithm. certain political, economic and social stability. While there is
Index based on counting uncertainty terms [96]: a delay between the two time series, the HV and the news
articles index show an increasing trend with a correlation
k
1 X i 13 Some similarity values are: sim(coronavirus, vacuna) = 0.06,
Id1t = wUt (20) sim(coronavirus, volatilidad) = 0.388, sim(coronavirus, crisis) = 0.68,
Wt
i=1 sim(coronavirus, dudas) = 0.614, sim(coronavirus, tratamiento) = 0.232,
sim(coronavirus, normalidad) = 0.311
14 Historical volatility is a measure of how much the stock price fluctuates
10 https://cincodias.elpais.com
during a given period. The formula used in this analysis for the HV is:
11 https://www.expansion.com/
σHVt = 12 (ln( H t 2 Ct 2
Lt )) − (2ln(2) − 1)(ln( Ot )) , Ht : High price, Lt : Low price,
12 The lists were translated from English to Spanish. Ct : Close price and Ot : Open price

127418 VOLUME 8, 2020


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

period for different reasons. While the pharmaceutical sec-


tor generated positive shocks (positive shocks in Idw2v) as
rumours related to the development of effective treatments
appeared in the news, energy (PCc4) and financial companies
(PCc7) lead to negative spikes (negative shocks in Idw2v),
increasing the feeling of more instability.

2) STRUCTURAL TOPIC MODELLING: COVID-19 EFFECTS IN


THE SPANISH ECONOMY
Parallelly to the VIA performed to analyze the impact of
FIGURE 9. Historical volatility IBEX35 vs uncertainty index based on W2V uncertainties created by different Spanish companies on the
during the period January 01, 2020 to April 30, 2020. (ene: January, feb: Spanish media, structural topic modeling [97], [98] will be
February, mar: March, abr: April, may: May).
used to estimate and visualize the uncertainty created by the
Covid-19 pandemic over the same time period. We com-
value of 0.34. When the stock market volatility hits a peake plete our analysis estimating the uncertainty created by
by March 15th, the news index reacts similarly by the end of the Covid-19 pandemic reflected in the Spanish economic
that month. While the market reacts immediately to the cur- newspapers. We apply Structural Topic Modelling (STM)
rent situation showing high fluctuations, the news absorb the (see [97], [98] for STM and listed examples) to the lon-
situation and reacts later incorporating not only the economic gitudinal dataset analyzed previously. This study allows us
but also the political and social uncertainties perceived in the to visualize the relationships (i.e. correlations) over time
environment. The IdW 2V could be used as a broader measure among groups of topics addressed during the COVID-19
of uncertainty in the current situation as it also accounts for pandemic (such us, Market, Economic impact, Growth,
social and political uncertainties. Coronavirus, etc).
Given the clusters of IBEX35 companies represented by Once the text data is extracted and preprocessed, it is
the variability of each cluster as input variables and the arranged in a dataframe with as many rows as analyzed
IdW 2V as the output, we want to distinguish the sectors days (each rows is represented by the corresponding news
responsible for negative and positive shocks in IdW 2V during articles, a rating variable [Optimistic, Neutral, Pesimistic]
the Covid-19 pandemic. We will compare the results of this defined according to the economic sentiment of the day and
VIA to the impact of these companies on the volatility of the the date). We train the STM model and generate a set of
IBEX35 index. topics characterized by a function (linear or nonlinear) that
We categorized 15 the time series, IdW 2V and σHV , to per- represents the relationship between the created topic and the
form the VIA using mh-χ 2 to the datasets [IdW 2V ; X ] and covariates ‘‘date’’ and ‘‘rating’’ Topic = f (Covariates). For
[σHV ; X ]. The categorization process converts the problem instance, Topic = rating+date or Topic = rating∗date allow
from continuous output to a classification problem with us to estimate the effect of the date on each topic (variable
imbalanced class distribution. rating is kept at its sample median).
The relative importance of each cluster (represented by In Figure 11, we display the relationship between time
the 1st PC) responsible for high and low spikes in IdW 2V and some selected topics 2 (Pandemic), 5 (Economy) and
is represented in Figure 10 for four scenarios. The first two 14 (Growth) with 95% confidence intervals. Topic 2 ‘‘Cri-
graphs compare the importance of the companies on positive sis + Pandemic’’, reaches its peak March 15th and starts
and negative shocks in the market volatility versus the news to decline by the end of April. For topic 5 ‘‘Economy’’,
articles index. For both σHV and IdW 2V , cluster 4 (PCc4, pri- the topic fluctuates over the analyzed period of time, with
marily energy companies) had the biggest impact, which goes a clear drop and a wider 95% confidence interval, which
in concordance with the expected behavior in any developed is an indicative of more uncertainty related to this topic.
economy as energy is the primary source of which the rest of The temporal evolution of the remaining topics is shown in
the sectors depend. Its influence accounted for almost 25% Appendix C. Topics 2 (Pandemic), 4 (Social measures), 12
in the volatility of the market and uncertainty captured by the (Coronavirus) and 16 (Social impact) had a similar pattern
media. The differences arise for the rest of the importance over time with a clear increase seen by March 15th (The
rankings. Cluster 6 seems also relevant for σHV . PCc6 sum- Spanish government declared the State of Emergency). At the
marizes the variability of a wide range of company sectors, same time, we see the opposite pattern with a strong drop
which largely represent the total variability of the IBEX35 in topics 1 (Financial system) and 11 (Companies’ profits).
σHV (the benchmark). On the contrary, the pharmaceutical Topic 14 (Growth) shows an interesting behavior. While the
sector (PCc2) along with energy companies (PCc4, PCc5) curve declined by March 15th, coinciding with the start of
attracted most of the news coverage during the pandemic the recession period, we can also see that by the end of April,
the curve started to rise suggesting a possible optimism in the
15 The categorization process is described in IV-C3 economy (possible beginning of recovery).

VOLUME 8, 2020 127419


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

FIGURE 10. Relative importance of each cluster responsible for high/low spikes in IdW 2V and the market volatility.

FIGURE 11. Estimate effect of topics 2 (Pandemic + Crisis), 5 (Spanish economy) and 14 (Spanish growth) during the Pandemic Period reflected in the
Spanish economic newspapers. Please see Appendix C for the rest of Estimate effects of the remaining topics.

3) IMPACT OF ENERGY FACTORS ON SPIKE PRICES ON THE • Superior spike: Price that surpasses the value of a
SPANISH ELECTRICITY MARKET: VIA limit price previously fixed (th2 or high price threshold)
The study presented in this section investigates the impact defined as th2 = µ + 2σ .
that certain factors have on the occurrence of spike prices on where, µ = mean(Pt )|t=Tt=1 and σ = std(Pt )|t=1 .
t=T

the Spanish electricity market. Understanding and measuring The previous definitions allow us to categorize each Pt as
the relationship between extreme electricity prices and the belonging to one of three categories: superior (class 3, C3),
factors that drive them may have important implications when normal (class 2, C2) and inferior price spike (class 1, C1).
it comes to forecast those prices. If normality is assumed by Pt ∼ N (0, 1), the defined clas-
Before any prediction of price or demand of electricity, sification thresholds will include approximately 95% of total
it is important to identify the exogenous variables that might prices (normal prices), leaving 5% of them to be considered
influence the spikes in prices. The goal of this analysis is extreme prices. This allows us to convert the original contin-
to identify the relative impact of some factors on extreme uous response problem into a classification problem with an
electricity prices (i.e. spike prices) in both cases, abnormally imbalanced domain.
high and low prices. Figure 12 shows the evolution of the hourly Spanish
A spike price is a price Pt that is significantly dif- electricity prices during the year 2016 (from January to
ferent from the previous one Pt−1 . Reference [99] estab- December) as well as the classification thresholds for each
lished thresholds to define the occurrence of spike prices. trimester (T1, T2, T3 and T4).
The proposed definitions distinguish two groups of spike Spike prices can occur due to diverse effects. Although,
prices: in an ideal market superior/inferior spikes should only be
• Inferior spike: Price that falls under the value of a attributed to high/low levels of demand, reality is far from
limit price previously fixed (th1 or low price threshold) ideality and other factors might play a significant role on the
defined as th1 = µ − 2σ . appearance of these rare prices. Among others, the hour of

127420 VOLUME 8, 2020


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

TABLE 3. Variables included in the electricity variable importance


analysis.

FIGURE 12. The figure shows the evolution of the hourly Spanish
electricity prices during the year 2016 along with the piecewise graphs
per trimester (T1 to T4) using thresholds (low and high) computed as:
th1 = µ − 2σ and th2 = µ + 2σ .

the day, the temperature, natural phenomenon disasters or


extreme meteorological conditions can be relevant factors.
Reference [99] applied the SVM algorithm to the Australian
electricity market to analyze the importance of the demand of
electricity, the energy production, the hour of the day, the net
energy exchanged and the seasonality of the electricity price.
Reference [100] showed the high impact of demand, energy
reserve and production on the prediction of spike prices.
As for the Spanish electricity market, its pricing offer uses
the so-called pool market16 where prices are fixed at that
which the last energy producer covered the demand of energy.
Although some producers might offer their energy at zero cost
(mainly, renewable, nuclear and hydraulic energies), they still FIGURE 13. Correlation matrix of input variables used in the electricity
market case.
get paid the final price at which the last producer covered
the existing demand. Among obvious reasons, the energy
system allows for the presence of renewable energies in dependency between the input variables is the reason why
order to ensure the 100% absorption of the produced energy. Covariance decomposition GSA model will not be taken into
Hence, the use of renewable energy (especially, wind energy consideration in the real Spanish Electricity market study.
production which accounts for 22.8% of the total produced Figure 14 shows the variable importance analysis per-
energy17 might have a direct effect on low prices when formed using the methodologies mh-χ 2 , Boruta, VarSelRF,
both the demand and other energies are high. This could VSURF and VSURF. We ranked the factors in terms of
explain the appearance of inferior spikes. On the other hand, their impact on extreme low and high electricity prices.
the difficulty of energy storage, production capacity or energy As the Spanish electricity market shows a strong seasonality,
transportation are important factors to consider with respect the study has been divided on a quarterly basis (Figure 14 only
to the vulnerability of the electricity market. Occasionally, displays results of the fourth trimester T4, the rest of VIA can
through their previligious position in the market, large energy be found in Appendix D).
providers can benefit from the idiosyncrasy of the Spanish For the first quarter (T1), measures of mh-χ 2 , Boruta,
electricity price system and influence the hourly evolution of VarSelRF are very close, with the three methods giving a
prices (i.e. influence of previous prices). This could explain higher importance to the price lagged one hour factor, which
the appearance of extremely high prices. is congruous with the idiosyncracy of the pricing system in
We discuss the application of nonparametric VITs the Spanish electricity market. This trend continues across
(VSURF, VarSelRF, Boruta and mh-χ 2 ) to understand and the remaining quarters. Both Boruta and VarSelRF provide
quantify the impact of certain factors on extreme electricity similar rankings for all periods. When transitioning from
prices. The dataset will be composed by the main Spanish the first two quarters (T1 and T2) to the third quarter (T3),
Electricity Market variables during 2016 (see Table 3). First, the importance values provided by all methods change con-
we performed a correlation analysis (Figure 13) to find the siderably. In T3, factors such as the Price1T, Hidraulic Reser-
relationship among the studied input variables. The strong voir and Wind Energy Production seem to have more impact
on extreme prices, which agrees with what would happen
16 http://www.omie.es/inicio when the demand of electricity increases (summer season).
17 https://www.esios.ree.es/es For the last quarter (T4, see Figure 14), it is important to

VOLUME 8, 2020 127421


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

FIGURE 14. Importance values of each factor responsible for extreme prices during the fourth trimester applying the VITs: mh-χ 2 , Boruta and VSURF. Find
rest of VIA per trimester on full page image in Appendix D.

note that after the Price1T factor, the Wind Energy Produc-
tion is identified as one of the most relevant factors by mh-
χ 2 and Boruta. This is consistent with the analysis of the
price electricity behavior shown in the evolution of electricity
prices on that period (Figure 12) where we can see a severe
negative spike (low prices are influenced by an increased use
of renewable energy).

V. DISCUSSION AND LIMITATIONS


So far we have assessed and compared the performance of
the nonparametric algorithm for VIA, mh-χ 2 , as well as the
parametric solution based on the covariance decomposition FIGURE 15. The misclassification error information is embedded in the
multioutput GSA methodology. While all the discussed Deci- Mahanalobis distances obtained using the matrices of probabilities of the
classes (majority and minority) before and after permuting Xi (each class
sion Tree-based methods (VarSelRF, Boruta and VSURF) corresponds to a column in the matrix). This figure illustrates the effect
apply the permutation importance framework as the tool to on the Mahanalobis distances when permuting the values of Xi when
dealing with imbalanced datasets (class
assess the importance of the analyzed variable (Xi ), the dif- distribution [10,280,10]).
ference with mh-χ 2 lies in how the information related to the
misclassification errors produced by the base learner before
and after permuting Xi is used. VarSelRF method applies a discard variables. In the case of the Boruta method, the impor-
backwards elimination technique where iteratively a Random tance of an input variable is computed using a Z score. This
Forest is fit with 80% of the input variables that have not is defined as the ratio between the average and standard
been dropped from the previous RF. The discarded variables deviation of the decrease of classification accuracy due to the
are those with the smallest importance values. In this case, permutation of the values of Xi over all trees that use Xi in
the importance value is computed as the difference between the forest. In VarSelRF, VSURF and Boruta, the importance
misclassification error rates before and after permuting Xi measures are based on a summary of the distribution of
when using the.632+ bootstrap method [101]. This strategy misclassification errors introduced by the base learner. Distri-
does not allow us to distinguish the errors suffered by the bution estimators such as average and/or standard deviation
different classes (majority and minority) due to the permu- of the errors are used to characterize the errors. These
tation of Xi since it uses a summary (misclassification error estimators do not provide us with sufficient information
rate) of the misclassification errors made by each base learner related to errors observed in the outliers (minority classes)
on the set of chosen observations. VSURF applies a similar when permuting the values of Xi . In a classification problem
importance method to VarSelRF where first all input variables with imbalanced dataset, the minority class is the class of
are ranked in order of decreasing importance. Only variables interest. Therefore, we want to be able to assess and measure
whose standard deviation of the variable importance (VI) is the influence of the permutation on the minority class since
larger than a defined threshold are kept (m variables), the rest the average or the standard deviation only tell us part of
are removed. In a second step, nested RFs are built with k the story about the distribution of misclassification errors.
input variables (k = 1 to m) and only the RF with the smallest Hence, we need an importance measure that considers the
OOB error is chosen. The m0 input variables involved in the entire distribution of errors. Our proposal for VIA, the mh-χ 2
selected RF are selected. Finally, using a step-wise forward algorithm, considers the misclassification error of each OOB
OOB ) and after
strategy, the m0 input variables are introduced in the RF only observation introduced by the tree before (EBP
OOB
(EAP(i) ) permuting Xi . Working with the entire distribution of
if the misclasification error rate is larger than a specified
threshold. In VSURF again, the average of misclassifica- errors can help us explain the importance of input variables
tion errors over the OOB observations is used to select or when dealing with imbalanced datasets. Take for example

127422 VOLUME 8, 2020


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

FIGURE 16. Comparison of execution time (in seconds) between the proposed algorithms, mh-χ 2 and GSA based on covariance decomposition,
versus VarSelRF, VSURF and Boruta. The hyperparameters that drive the Decision Tree algorithms have been set to default values.

j
Figure 15, each misclassification error before (i.e. εBP a row will not have any effect on the computation cost of the
OOB ) and after (i.e. ε j
in matrix EBP OOB mh-χ 2 algorithm.
AP(i) a row in matrix EAP(i) )
permuting Xi is represented by the Mahanalobis distances The proposed parametric solution for VIA for imbalanced
BP and d AP(i) . The Mahanalobis distance d BP represents the
dmh classification problems is based on the covariance decom-
j mhj mhj
j position GSA method. We first transform the original mul-
distance between each BP (j ∈ [1, 2, . . . , noob]) and the
OOB . The information related ticlass classification problem into a multioutput continuous
set of distribution of errors EBP
regression problem by fitting a classification algorithm (NN,
to misclassification errors is embedded in the Mahanalobis
SVM or RF) and evaluating all observations of X . The result
distances. We can distinguish 3 clusters corresponding to
of this evaluation is a multioutput continuous dataset [Yc , X ]
misclassification errors made in the majority and minority
where each column of Yc represents the probability of each
classes. The goal is to measure the dissimilarities between
class, Yc = [P(c1 |X ), P(c2 |X ), . . . , P(cl |X )] (i.e Yc |xi =
the distribution of these misclassification errors represented
[P(c1 |xi ), P(c2 |xi ), . . . , P(cl |xi )] is a vector of probabilities
by the Mahanalobis distances through the computation of the
where each element represents the probability of a specific
χ 2 -distance between their histograms bin-wise. The result of
class. In other words, P(cl |xi ) is the level of membership
the χ 2 -distance is our proposed measure of importance for
of the realization xi ∈ X to the class l). We then find the
imbalanced classification problems.
multivariate regression model to link the multivariate output
The performance of the mh-χ 2 algorithm relies on an
and the input variables [Yc , X ] (i.e. Yc = fc (X )). Given
appropriate tuning of the hyperparameters that govern the
the model Yc = fc (X ), we can now apply the covariance
base learner CIT. Five hyperparameters are used to tune
decomposition GSA to assess the influence of each input vari-
the CIT algorithm [16]: ntree, mtry, Maxdepth, minsplit,
able on the multivariate output. One way of quantifying the
minbucket. Reference [41] performed a sensitivity analysis
influence is through the computation of the so-called Sobol
based on the computation of Sobol indices in order to assess
indices. These importance indices measure the uncertainty of
the importance of these hyperparameters on the accuracy
the output(s) caused by each input variable uncertainty (see
of the CIT algoritm when used as base leaner for VIA.
Equ. 9). In practice, the Sobol indices are estimated using
The study showed that mtry has the highest impact on the
for instance the Monte-carlo Pick-Freeze methodology (see
accuracy of the proposed algorithm for VIA. In this paper,
Equ. 13). For interpretability purposes, we rewrite Equ. 13
we choose the default values for all the hyperparameters
for the 3-class (l = 3) problem as:
except mtry, which is set to the total number of input vari-
ables, allowing us to evaluate the association between each 3 (i)
input variable and the output vector in the splitting pro-
X Yck · Yck − Kck
STi = (i)
(23)
cess of the CIT. As shown in Figure 16, this mtry selection k=1 Bk

VOLUME 8, 2020 127423


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

(i) TABLE 4. CD-GSA: Covariance decomposition method. Scenario 1:


where, Bk is the denominator in Equ. 13, Kck =
(i) Classification (Binary/Multiclass) and Regression (univariate/multivariate
1 PN Yck,j +Yck,j 2 (i) output) problems, 2: Classification (Binary/Multiclass) and Regression
N ( j=1 2 ) and Yck · Yck is the scalar product of the univariate output problems, 3: Classification (Binary/Multiclass)
(i)
vector Yck 18 and the vector Yck .19 The goal is to measure problems. CC (Computation Cost) A: High efficiency, B: Moderate
efficiency and C: Low efficiency. B/I (Balance/Imbalanced dataset).
how freezing certain variables (one variable to compute Si
and all input variables except Xi to obtain STi ) changes the
(i)
model outputs from Yc to Yc for STi (similarly for Si ).
In other words, measure how the probabilities of each class
are affected by freezing certain input variables. Each sum
in Equ. 23 represents the influence on each class of holding
the corresponding values of the input variable(s) fixed. This
allows us to analyze the impact of Xi by class (minority and
majority). Notice again, that in order to obtain the importance cution time is proportional to the number of trees in the
of Xi , we work with the vector of probabilities of the classes forest.
and not only a summary of these probabilities. The total In practice, mh-χ 2 and the covariance decomposition GSA
impact or total effect of Xi on each class k (k ∈ [1, 2, . . . , l]) algorithms both have limitations when dealing with imbal-
(i)
is captured by the scalar product Yck · Yck . If Xi is relevant anced classification problems: Table 4 shows a comparison
for class k, then high (low) probability values of Yck will of the discussed VITs.
(i)
be multiplied by high (low) probability values of Yck which • Computationally, the mh-χ 2 algorithm is more time-
will result in a high value of the scalar product. On the other consuming for large datasets (high number of input vari-
hand, if Xi is not important then high and low probability ables), ranking behind the most efficient nonparametric
(i)
values of Yck and Yck are randomly multiplied. The value methods, Boruta and VarSelRF. However, in order to
of STi will result in evaluating these scalar products for all fairly compare mh-χ 2 to the benchmark Boruta method,
classes. the mh-χ 2 algorithm can be optimized using paralleliza-
Other useful information to consider when comparing VIA tion and distributed computing.
algorithms is their execution time. In the case of the mh-χ 2 • Unlike VSURF, mh-χ 2 also needs to be adapted to
algorithm, nvar (number of input variables) and ntree (num- scenarios with highly correlated input variables since it
ber of trees in the forest) drive the computation complexity of relies on the permutation importance technique. Variable
the algorithm. The algorithm contains two nested for loops importance values may be biased toward correlated input
(see Algorithm 1). The outer for loop executes nvar times, variables when the permutation importance technique is
while the inner loop does it ntree times. The operations inside applied [102].
the inner loop execute nvar · ntree times. Therefore, the time • mh-χ 2 , as the compared Decision Tree-based VITs, only
complexity of the proposed algorithm is O(nvar · ntree). It is provides the total effect of Xi on the output variables
worth mentioning that the mh-χ 2 algorithm can be further (total effect = single effect + interaction effects). No dis-
optimized through parallelization and distributed computing. aggregated information about the possible interaction
Figure 16 shows the runtime of the compared algorithms with effects is obtained.
respect to the number of input variables (top left) and the • In order for the sensitivity indices to be interpretable,
number of observations (top right). The proposed parametric the orthogonality assumption between the input vari-
solution based on GSA is the most computationally efficient ables needs to be satisfied: Although the GSA solution
method in terms of both the number of input variables and can provide a disaggregation of importance values into
observations, closely followed by the nonparametric algo- total and single effects, the provided values can only
rithms VarSelRF and Boruta. The runtime for mh-χ 2 is almost be interpreted as importance values if the independence
a linear function of the number of variables. Apart from the between the input variables is met. A possible solu-
VSURF, all algorithms are less sensitive to the number of tion when the input variables are dependent is a previ-
observations of the problem, again the GSA solution being ous orthogonalization process, however, the computed
the least time consuming. We also show the computation importance values will be related to the orthogonalized
cost of mh-χ 2 with respect to the hyperparameters mtry (bot- variables and not the original ones.
tom right) and ntree (bottom left). As it can be observed, In imbalanced classification problems, the minority
except for ntree, varying the rest of the hyperparameters class(es) is the class of interest. In order to imporve the accu-
will not have a significant impact on the runtime. The exe- racy of a classification algorithm that deals with imbalanced
datasets, a previous variable importance analysis is recom-
mended to identify and remove irrelevant input variables.
18 Y
ck = P(ck|X ) vector of probabilities of class k which results from Using the mh-χ 2 and the covariance decomposition GSA
evaluating the input dataset X on fc (·)
19 Y (i) = P(ck|(X 0 , X )) vector of probabilities of class k which results method allow us to take into account the minority class(es)
i (i)
ck
from evaluating the input dataset (Xi0 , X(i) ) on fc (·), freeze all input variables by measuring the dissimilarities (χ 2 -distance) between the
except Xi and sample Xi distribution of misclassification errors before and after per-

127424 VOLUME 8, 2020


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

FIGURE 17. Comparison of the DT-based VITs (VSURF, Boruta, VarSelRF, mh-χ 2 ) for the simulated classification model IV-A. We compare the mean of the
relative importances of each input variable computed using the compared VITs. The mean results from calculating the average of the VI values over
100 simulated dataset for sample sizes n=150 and 900. For each simulated sample size we simulated 6 balance ratio scenarios with different class
distributions. The U test values for each input variable are also included (values under the input variable names) (The larger the U value the more relevant
the predictor is).

muting the Xi in the case of mh-χ 2 and by measuring how the importance metric for VIA. This solution allows us
probabilities of each specific class change when computing to visualize and measure the impact of the Xi on the
the scalar product between the vector of probabilities of the minority class.
classes before and after freezing the input variable(s) under • The parametric solution based on covariance decom-
assessment. The mh-χ 2 algorithm and the covariance decom- position GSA method allows us to break down the
position GSA method outperform the discussed methods in contribution of each input variable into single (Si ),
the literature in the following: total (STi ) P
and interacting effects (STi − Si ). While
• The nonparametric mh-χ 2 algorithm can be seen as large 1 − Si indicates the presence of interactions
an extension of Decision Tree-based VITs (VarSelRF, among the input variables, STi − Si allow us to quantify
VSURF and Boruta) since it provides a better accuracy these interactions. The GSA method also shows to be
when dealing with imbalanced dataset while at the same the most computationally efficient solution among its
time matches the accuracy when working with balanced competitors.
dataset. While the compared VITs base their VI analysis • Scalability: Since mh-χ 2 algorithm uses the CIT algo-
on a summary of the misclassification errors, mh-χ 2 rithm as the base learner (capability of handling multi-
considers the entire distribution of errors, propos- variate response models) and covariance decomposition
ing a probabilistic distance measure (χ 2 -distance) as GSA method employ a multivariate continuous response

VOLUME 8, 2020 127425


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

VOLUME 8, 2020
FIGURE 18. Comparison of the VITs evaluated over 100 dataset for the non-linear model IV-B when the thresholds are set to th1=7 and th2=14 (proposed methodologies based on GSA (Single and Total
effects) and ML (mh-χ 2 ) against the DT-based methods (Boruta, VarSelRF and VSURF)). The performance of each technique is described by µ(σ )[A,B,C ] [Ranking] where µ is the mean, σ the standard
Pnvar
1
i =1 σi . C.T(s): computation time in seconds. Each VIT is compared in terms of the accuracy (the most
deviation and [A, B, C ] = [100, 500, 1000] are the sample sizes. The stability is computed as nvar
relevant input variable is highlighted), stability, balance ratio behavior and speed.

127426
I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

framework, both can be applied as variable importance regression (univariate and multivariate response) problems.
techniques in classification problems with binary or The new techniques were applied in simulated as well as real
multiclass response (balanced or imbalanced) as well problems, providing relative importance values of the input
as regression problems with univariate or multivariate variables involved in the model. We assessed the impact of the
output. 35 companies listed in the IBEX35 index on the political, eco-
nomic and social uncertainty captured by two highly regarded
Spanish economic newspapers during the Covid-19 pandemic
VI. CONCLUSION
and also measured the effect of a set of energy factors on
In this work, we presented the nonparametric mh-χ 2 variable
electricity price spikes in the Spanish electricity market. Due
importance technique that, employing a multivariate contin-
to the fact that mh-χ 2 computationally is less efficient than
uous response framework, allows us to select and rank the
the fastest method Boruta, the mh-χ 2 technique could be
most relevant input variables when dealing with different bal-
optimized and made more competitive by implementing some
ance scenarios. The method captures the importance of each
of the functions in C++, parallelizing and distributing the
variable (total effect) by measuring the dissimilarities using
computation process. An extension of the mh-χ 2 solution
the χ 2 -distance between the distribution of errors generated
can be towards studying the importance of input variables in
by the base learner (Conditional Inference Tree) before and
a multioutput mixed response (continuous and categorical)
after permuting the variable under analysis. We showed that
scenario and the analysis of the interaction effects among
the proposed technique overall outperformed its competitors
input variables on the output.
based on Random Forests since it uses the entire distribution
of errors, which incorporates all the information needed for
APPENDIX A
the variable importance analysis in contrast to those which
use only a summary of the errors information. The paramet- See Figure 17.
ric approach applies the Covariance decomposition Global
Sensitivity Analysis method, where the importance of the APPENDIX B
input variable is estimated using the Pick-Freeze Monte-carlo See Figure 18.
technique. While Global Sensitivity Analysis allows us to
break down the effect (single, total and interaction) of each APPENDIX C
input variable on the output, the assumption of independence See Figure 19.
between the input variables needs to be met. Both variable
importance techniques can be used in classification (binary APPENDIX D
and multiclass with balanced and imbalanced datasets) and See Figure 20.

FIGURE 19. Estimate effect of topics 1 (Spanish market), 2 (Pandemic + Crisis), 4 (Social measures), 5 (Spanish Economy), 11 (Companies Profit),
12 (Coronavirus), 14 (Spanish Growth), 16 (Social impact) and 18 (Employment) during the Pandemic Period reflected in the Spanish economic
newspapers (First quadrimester: From January 1st to April 30th, 2020).

VOLUME 8, 2020 127427


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

FIGURE 20. Each pie chart displays the ranking value of each factor responsible for extreme prices when using different Variable importance techniques
(per columns left to right, mh-χ 2 , Boruta, VarSelRF, VSURF and VSURF). The results are presented by trimester (per row top to bottom, T1, T2, T3, T4).

REFERENCES [13] Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. Wang, ‘‘Cost-sensitive


boosting for classification of imbalanced data,’’ Pattern Recognit., vol. 40,
[1] F. Ferretti, A. Saltelli, and S. Tarantola, ‘‘Trends in sensitivity analysis
no. 12, pp. 3358–3378, Dec. 2007.
practice in the last decade,’’ Sci. Total Environ., vol. 568, pp. 666–670,
Oct. 2016. [14] C.-W. Hsu and C.-J. Lin, ‘‘A comparison of methods for multiclass
support vector machines,’’ IEEE Trans. Neural Netw., vol. 13, no. 2,
[2] I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh, Feature Extraction:
pp. 415–425, Mar. 2002.
Foundations and Applications, vol. 207. Berlin, Germany: Springer, 2008.
[3] A. Saltelli, M. Ratto, T. Andres, F. Campolongo, J. Cariboni, D. Gatelli, [15] N. V. Chawla, ‘‘C4. 5 and imbalanced data sets: Investigating the effect
M. Saisana, and S. Tarantola, Global Sensitivity Analysis: The Primer. of sampling method, probabilistic estimate, and decision tree structure,’’
Hoboken, NJ, USA: Wiley, 2008. in Proc. ICML, vol. 3, Aug. 2003, p. 66.
[4] H. He and Y. Ma, Imbalanced Learning: Foundations, Algorithms, and [16] T. Hothorn, K. Hornik, and A. Zeileis, ‘‘Unbiased recursive partitioning:
Applications. Hoboken, NJ, USA: Wiley, 2013. A conditional inference framework,’’ J. Comput. Graph. Statist., vol. 15,
no. 3, pp. 651–674, Sep. 2006.
[5] V. López, A. Fernández, S. García, V. Palade, and F. Herrera, ‘‘An insight
into classification with imbalanced data: Empirical results and cur- [17] C. Chen, A. Liaw, and L. Breiman, ‘‘Using random forest to learn
rent trends on using data intrinsic characteristics,’’ Inf. Sci., vol. 250, imbalanced data,’’ Univ. California, Berkeley, Berkeley, CA, USA,
pp. 113–141, Nov. 2013. Tech. Rep. 6, 2004, vol. 110, nos. 1–12, p. 24.
[6] H. He and E. A. Garcia, ‘‘Learning from imbalanced data,’’ IEEE Trans. [18] Y. Park and J. Ghosh, ‘‘Ensembles of (α)-trees for imbalanced clas-
Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009. sification problems,’’ IEEE Trans. Knowl. Data Eng., vol. 26, no. 1,
[7] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, pp. 131–143, Jan. 2014.
‘‘Learning from class-imbalanced data: Review of methods and applica- [19] D. A. Cieslak, T. R. Hoens, N. V. Chawla, and W. P. Kegelmeyer,
tions,’’ Expert Syst. Appl., vol. 73, pp. 220–239, May 2017. ‘‘Hellinger distance decision trees are robust and skew-insensitive,’’ Data
[8] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, Mining Knowl. Discovery, vol. 24, no. 1, pp. 136–158, Jan. 2012.
‘‘SMOTE: Synthetic minority over-sampling technique,’’ J. Artif. Intell. [20] S. del Río, V. López, J. M. Benítez, and F. Herrera, ‘‘On the use of
Res., vol. 16, pp. 321–357, Jun. 2002. MapReduce for imbalanced big data using random forest,’’ Inf. Sci.,
[9] M. A. Tahir, J. Kittler, K. Mikolajczyk, and F. Yan, ‘‘A multiple expert vol. 285, pp. 112–137, Nov. 2014.
approach to the class imbalance problem using inverse random under [21] M. Lango and J. Stefanowski, ‘‘Multi-class and feature selection exten-
sampling,’’ in Proc. Int. Workshop Multiple Classifier Syst. Berlin, sions of roughly balanced bagging for imbalanced data,’’ J. Intell. Inf.
Germany: Springer, 2009, pp. 82–91. Syst., vol. 50, no. 1, pp. 97–127, Feb. 2018.
[10] R. Singh and R. Raut, ‘‘Review on class imbalance learning: Binary and [22] S. Hido, H. Kashima, and Y. Takahashi, ‘‘Roughly balanced bagging for
multiclass,’’ Int. J. Comput. Appl., vol. 131, no. 16, pp. 4–8, Dec. 2015. imbalanced data,’’ Stat. Anal. Data Mining, ASA Data Sci. J., vol. 2,
[11] A. Amin, S. Anwar, A. Adnan, M. Nawaz, N. Howard, J. Qadir, nos. 5–6, pp. 412–426, 2009.
A. Hawalah, and A. Hussain, ‘‘Comparing oversampling techniques to [23] R. Blagus and L. Lusa, ‘‘Gradient boosting for high-dimensional predic-
handle the class imbalance problem: A customer churn prediction case tion of rare events,’’ Comput. Statist. Data Anal., vol. 113, pp. 19–37,
study,’’ IEEE Access, vol. 4, pp. 7940–7957, 2016. Sep. 2017.
[12] B. Zhu, B. Baesens, and S. K. L. M. vanden Broucke, ‘‘An empirical [24] Y. Freund and R. E. Schapire, ‘‘A decision-theoretic generalization of
comparison of techniques for the class imbalance problem in churn on-line learning and an application to boosting,’’ J. Comput. Syst. Sci.,
prediction,’’ Inf. Sci., vol. 408, pp. 84–99, Oct. 2017. vol. 55, no. 1, pp. 119–139, Aug. 1997.

127428 VOLUME 8, 2020


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

[25] Y. Sun, M. Kamel, and Y. Wang, ‘‘Boosting for learning multiple classes [47] N. Spolaôr, E. A. Cherman, M. C. Monard, and H. D. Lee, ‘‘A comparison
with imbalanced class distribution,’’ in Proc. 6th Int. Conf. Data Mining of multi-label feature selection methods using the problem transformation
(ICDM), Dec. 2006, pp. 592–602. approach,’’ Electron. Notes Theor. Comput. Sci., vol. 292, pp. 135–151,
[26] S. Wang and X. Yao, ‘‘Multiclass imbalance problems: Analysis and Mar. 2013.
potential solutions,’’ IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, [48] J. Yang, J. Zhou, Z. Zhu, X. Ma, and Z. Ji, ‘‘Iterative ensemble feature
no. 4, pp. 1119–1130, Aug. 2012. selection for multiclass classification of imbalanced microarray data,’’
[27] S. Wang, H. Chen, and X. Yao, ‘‘Negative correlation learning for clas- J. Biol. Res.-Thessaloniki, vol. 23, no. S1, p. 13, May 2016.
sification ensembles,’’ in Proc. Int. Joint Conf. Neural Netw. (IJCNN), [49] G. Georgiev, I. Valova, and N. Gueorguieva, ‘‘Feature selection for mul-
Jul. 2010, pp. 1–8. ticlass problems based on information weights,’’ Procedia Comput. Sci.,
[28] S. García, A. Fernández, and F. Herrera, ‘‘Enhancing the effectiveness vol. 6, pp. 189–194, 2011.
and interpretability of decision tree and rule induction classifiers with [50] G. Y. Wong, F. H. F. Leung, and S.-H. Ling, ‘‘A hybrid evolutionary
evolutionary training set selection over imbalanced problems,’’ Appl. Soft preprocessing method for imbalanced datasets,’’ Inf. Sci., vols. 454–455,
Comput., vol. 9, no. 4, pp. 1304–1314, Sep. 2009. pp. 161–177, Jul. 2018.
[29] Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu, and Y. Zhou, ‘‘A novel ensemble [51] O. Chapelle and S. S. Keerthi, ‘‘Multi-class feature selection with support
method for classifying imbalanced data,’’ Pattern Recognit., vol. 48, no. 5, vector machines,’’ in Proc. Amer. Stat. Assoc., vol. 58, 2008.
pp. 1623–1637, May 2015. [52] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, ‘‘Gene selection for can-
[30] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, ‘‘SmoteBoost: cer classification using support vector machines,’’ Mach. Learn., vol. 46,
Improving prediction of the minority class in boosting,’’ in Proc. Eur. nos. 1–3, pp. 389–422, 2002.
Conf. Princ. Data Mining Knowl. Discovery. Berlin, Germany: Springer, [53] S. Maldonado, R. Weber, and F. Famili, ‘‘Feature selection for high-
2003, pp. 107–119. dimensional class-imbalanced data sets using support vector machines,’’
[31] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, Inf. Sci., vol. 286, pp. 228–246, Dec. 2014.
‘‘RUSBoost: A hybrid approach to alleviating class imbalance,’’ IEEE [54] S. Maldonado and J. López, ‘‘Dealing with high-dimensional class-
Trans. Syst., Man, Cybern. A, Syst. Humans, vol. 40, no. 1, pp. 185–197, imbalanced datasets: Embedded feature selection for SVM classifica-
Jan. 2010. tion,’’ Appl. Soft Comput., vol. 67, pp. 94–105, Jun. 2018.
[32] X.-Y. Liu, J. Wu, and Z.-H. Zhou, ‘‘Exploratory undersampling for class- [55] G. de Lannoy, D. François, and M. Verleysen, ‘‘Class-specific fea-
imbalance learning,’’ IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 39, ture selection for one-against-all multiclass SVMs,’’ in Proc. ESANN.
no. 2, pp. 539–550, Apr. 2009. New York, NY, USA: Citeseer, 2011, pp. 263–268.
[33] S. Wang and X. Yao, ‘‘Diversity analysis on imbalanced data sets by using [56] L. Yin, Y. Ge, K. Xiao, X. Wang, and X. Quan, ‘‘Feature selection for
ensemble models,’’ in Proc. IEEE Symp. Comput. Intell. Data Mining, high-dimensional imbalanced data,’’ Neurocomputing, vol. 105, pp. 3–11,
Mar. 2009, pp. 324–331. Apr. 2013.
[57] M. Alibeigi, S. Hashemi, and A. Hamzeh, ‘‘DBFS: An effective density
[34] W. Feng, W. Huang, and J. Ren, ‘‘Class imbalance ensemble learn-
based feature selection scheme for small sample size and high dimen-
ing based on the margin theory,’’ Appl. Sci., vol. 8, no. 5, p. 815,
sional imbalanced data sets,’’ Data Knowl. Eng., vols. 81–82, pp. 67–103,
May 2018.
Nov. 2012.
[35] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera,
[58] O. Gharroudi, H. Elghazel, and A. Aussem, ‘‘A comparison of multi-label
‘‘A review on ensembles for the class imbalance problem: Bagging-
feature selection methods using the random forest paradigm,’’ in Proc.
, Boosting-, and hybrid-based approaches,’’ IEEE Trans. Syst., Man,
Can. Conf. Artif. Intell. Berlin, Germany: Springer, 2014, pp. 95–106.
Cybern. C, Appl. Rev., vol. 42, no. 4, pp. 463–484, Jul. 2012.
[59] S. Janitza, C. Strobl, and A.-L. Boulesteix, ‘‘An AUC-based permutation
[36] A. P. Bradley, ‘‘The use of the area under the ROC curve in the evalua-
variable importance measure for random forests,’’ BMC Bioinf., vol. 14,
tion of machine learning algorithms,’’ Pattern Recognit., vol. 30, no. 7,
no. 1, p. 119, Dec. 2013.
pp. 1145–1159, Jul. 1997.
[60] L. Breiman, Classification and Regression Trees. Evanston, IL, USA:
[37] M. Kubat, R. Holte, and S. Matwin, ‘‘Learning when negative examples
Routledge, 2017.
abound,’’ in Proc. Eur. Conf. Mach. Learn. Berlin, Germany: Springer,
[61] C. Beyan and R. Fisher, ‘‘Classifying imbalanced data sets using similar-
1997, pp. 146–153.
ity based hierarchical decomposition,’’ Pattern Recognit., vol. 48, no. 5,
[38] N. V. Chawla, N. Japkowicz, and A. Kotcz, ‘‘Editorial: Special issue on pp. 1653–1672, May 2015.
learning from imbalanced data sets,’’ ACM SIGKDD Explor. Newslett., [62] H. Yu and J. Ni, ‘‘An improved ensemble learning method for classifying
vol. 6, no. 1, pp. 1–6, Jun. 2004. high-dimensional and imbalanced biomedicine data,’’ IEEE/ACM Trans.
[39] C. T. Nakas, ‘‘Developments in roc surface analysis and assessment of Comput. Biol. Bioinf., vol. 11, no. 4, pp. 657–666, Jul. 2014.
diagnostic markers in three-class classification problems,’’ REVSTAT- [63] S. Liu, J. Zhang, Y. Xiang, W. Zhou, and D. Xiang, ‘‘A study of
Stat. J., vol. 12, no. 1, pp. 43–65, 2014. data pre-processing techniques for imbalanced biomedical data classi-
[40] L. Yijing, G. Haixiang, L. Xiao, L. Yanan, and L. Jinling, ‘‘Adapted fication,’’ 2019, arXiv:1911.00996. [Online]. Available: http://arxiv.org/
ensemble classification algorithm based on multiple classifier system and abs/1911.00996
feature selection for classifying multi-class imbalanced data,’’ Knowl.- [64] F. Shakeel, A. S. Sabhitha, and S. Sharma, ‘‘Exploratory review on class
Based Syst., vol. 94, pp. 88–104, Feb. 2016. imbalance problem: An overview,’’ in Proc. 8th Int. Conf. Comput.,
[41] I. Ahrazem Dfuf, J. Mira McWilliams, and M. González Fernández, Commun. Netw. Technol. (ICCCNT), Jul. 2017, pp. 1–8.
‘‘Multi-output conditional inference trees applied to the electricity mar- [65] S. Sadeghyan, ‘‘A new robust feature selection method using variance-
ket: Variable importance analysis,’’ Energies, vol. 12, no. 6, p. 1097, based sensitivity analysis,’’ 2018, arXiv:1804.05092. [Online]. Available:
Mar. 2019. http://arxiv.org/abs/1804.05092
[42] D. Chen, D. Hua, J. Reifman, and X. Cheng, ‘‘Gene selection for [66] F. Fernandez-Navarro, M. Carbonero-Ruz, D. B. Alonso, and
multi-class prediction of microarray data,’’ in Proc. CSB, vol. 3, 2003, M. Torres-Jimenez, ‘‘Global sensitivity estimates for neural network
p. 492. classifiers,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 11,
[43] A. Bommert, X. Sun, B. Bischl, J. Rahnenführer, and M. Lang, ‘‘Bench- pp. 2592–2604, Nov. 2017.
mark for filter methods for feature selection in high-dimensional clas- [67] S. Dreiseitl, L. Ohno-Machado, and M. Binder, ‘‘Comparing three-
sification data,’’ Comput. Statist. Data Anal., vol. 143, Mar. 2020, class diagnostic tests by three-way ROC analysis,’’ Med. Decis. Making,
Art. no. 106839. vol. 20, no. 3, pp. 323–331, Jul. 2000.
[44] J. Lee and D.-W. Kim, ‘‘Mutual information-based multi-label feature [68] R. Díaz-Uriarte and S. A. D. Andres, ‘‘Gene selection and classification
selection using interaction information,’’ Expert Syst. Appl., vol. 42, no. 4, of microarray data using random forest,’’ BMC Bioinf., vol. 7, no. 1, p. 3,
pp. 2013–2025, Mar. 2015. 2006.
[45] J. Lee and D.-W. Kim, ‘‘Fast multi-label feature selection based on [69] R. Diaz-Uriarte, ‘‘GeneSrF and varSelRF: A Web-based tool and r pack-
information-theoretic feature ranking,’’ Pattern Recognit., vol. 48, no. 9, age for gene selection and classification using random forest,’’ BMC
pp. 2761–2771, Sep. 2015. Bioinf., vol. 8, no. 1, p. 328, Dec. 2007.
[46] J. Lee and D.-W. Kim, ‘‘Efficient multi-label feature selection using [70] H. Stoppiglia, G. Dreyfus, R. Dubois, and Y. Oussar, ‘‘Ranking a random
entropy-based label selection,’’ Entropy, vol. 18, no. 11, p. 405, feature for variable and feature selection,’’ J. Mach. Learn. Res., vol. 3,
Nov. 2016. pp. 1399–1414, Mar. 2003.

VOLUME 8, 2020 127429


I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets

[71] M. B. Kursa and W. R. Rudnicki, ‘‘Feature selection with the Boruta [98] M. E. Roberts, B. M. Stewart, and D. Tingley, ‘‘STM: R package for
package,’’ J. Stat. Softw., vol. 36, no. 11, pp. 1–13, 2010. structural topic models,’’ J. Stat. Softw., vol. 10, no. 2, pp. 1–40, 2014.
[72] R. Genuer, J.-M. Poggi, and C. Tuleau-Malot, ‘‘Variable selection using [99] J. H. Zhao, Z. Y. Dong, X. Li, and K. P. Wong, ‘‘A framework for
random forests,’’ Pattern Recognit. Lett., vol. 31, no. 14, pp. 2225–2236, electricity price spike analysis with advanced data mining methods,’’
Oct. 2010. IEEE Trans. Power Syst., vol. 22, no. 1, pp. 376–385, Feb. 2007.
[73] R. Genuer, J.-M. Poggi, and C. Tuleau-Malot, ‘‘VSURF: An R package [100] X. Lu, Z. Dong, and X. Li, ‘‘Electricity market price spike forecast
for variable selection using random forests,’’ R J., vol. 7, no. 2, p. 19, with data mining techniques,’’ Electr. Power Syst. Res., vol. 73, no. 1,
2015. pp. 19–29, Jan. 2005.
[74] P. Wei, Z. Lu, and J. Song, ‘‘A comprehensive comparison of two variable [101] C. Ambroise and G. J. McLachlan, ‘‘Selection bias in gene extraction on
importance analysis techniques in high dimensions: Application to an the basis of microarray gene-expression data,’’ Proc. Nat. Acad. Sci. USA,
environmental multi-indicators system,’’ Environ. Model. Softw., vol. 70, vol. 99, no. 10, pp. 6562–6566, May 2002.
pp. 178–190, Aug. 2015. [102] C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn, ‘‘Bias in random
[75] W. Hoeffding, ‘‘A class of statistics with asymptotically normal distribu- forest variable importance measures: Illustrations, sources and a solu-
tion,’’ in Breakthroughs in Statistics. Berlin, Germany: Springer, 1992, tion,’’ BMC Bioinf., vol. 8, no. 1, p. 25, Dec. 2007.
pp. 308–334.
[76] I. M. Sobol, ‘‘Global sensitivity indices for nonlinear mathematical mod-
els and their Monte Carlo estimates,’’ Math. Comput. Simul., vol. 55,
nos. 1–3, pp. 271–280, Feb. 2001.
ISMAEL AHRAZEM DFUF received the B.S.
[77] F. Gamboa, A. Janon, T. Klein, and A. Lagnoux, ‘‘Sensitivity indices
for multivariate outputs,’’ 2013, arXiv:1303.3574. [Online]. Available:
degree in telecommunication engineering from
http://arxiv.org/abs/1303.3574 the University of Sevilla, Spain, in 2012, and
[78] K. Campbell, M. D. McKay, and B. J. Williams, ‘‘Sensitivity analy- the master’s degree in quantitative finance from
sis when model outputs are functions,’’ Rel. Eng. Syst. Saf., vol. 91, The University of Sydney, Australia, in 2017.
nos. 10–11, pp. 1468–1472, Oct. 2006. He is currently pursuing the Ph.D. degree in math-
[79] A. Saltelli, P. Annoni, I. Azzini, F. Campolongo, M. Ratto, and ematical engineering, statistics and operations
S. Tarantola, ‘‘Variance based sensitivity analysis of model output. research (IMEIO) with the Universidad Politéc-
Design and estimator for the total sensitivity index,’’ Comput. Phys. nica de Madrid (UPM), Madrid, Spain. In 2019,
Commun., vol. 181, no. 2, pp. 259–270, Feb. 2010. he was a Research Assistant with the Instituto de
[80] A. Saltelli and R. Bolado, ‘‘An alternative way to compute Fourier ampli- Empresa (IE Business School) and UPM. His research interests include
tude sensitivity test (FAST),’’ Comput. Statist. Data Anal., vol. 26, no. 4, application of machine learning techniques to variable importance analysis
pp. 445–460, Feb. 1998. in multivariate response scenarios, clustering algorithms for pattern recogni-
[81] M. Lamboni, H. Monod, and D. Makowski, ‘‘Multivariate sensitivity tion, and natural language processing algorithms.
analysis to measure global contribution of input factors in dynamic mod-
els,’’ Rel. Eng. Syst. Saf., vol. 96, no. 4, pp. 450–459, Apr. 2011.
[82] E. Borgonovo, ‘‘A new uncertainty importance measure,’’ Rel. Eng. Syst.
Saf., vol. 92, no. 6, pp. 771–784, Jun. 2007.
[83] L. Cui, Z. Lu, and X. Zhao, ‘‘Sensitivity indices of basic variable under JOAQUÍN FORTE PÉREZ-MINAYO received
multiple failure modes and their solutions,’’ Sin. Phys., Mech. Astron., the degree in industrial engineering from the
vol. 40, no. 12, pp. 1532–1541, 2010. ETSII, Universidad Politécnica de Madrid
[84] Q. Liu and T. Homma, ‘‘A new computational method of a moment- (UPM), in 2017. He is currently an IT Audi-
independent uncertainty importance measure,’’ Rel. Eng. Syst. Saf., tor with Pricewaterhouse Coopers. His current
vol. 94, no. 7, pp. 1205–1211, Jul. 2009. research interest includes machine learning tech-
[85] L. Li, Z. Lu, and D. Wu, ‘‘A new kind of sensitivity index for multivariate niques applied to the electricity market.
output,’’ Rel. Eng. Syst. Saf., vol. 147, pp. 123–131, Mar. 2016.
[86] J. Aitchison, The Statistical Analysis of Compositional Data. New York,
NY, USA: Chapman & Hall, 1986.
[87] L. Breiman, ‘‘Random forests,’’ Mach. Learn., vol. 45, no. 1, pp. 5–32,
2001.
[88] G. W. Snedecor and W. G. Cochran, Statistical Methods, 8th ed. Ames,
IA, USA: Iowa State Univ. Press, 1989. JOSÉ MANUEL MIRA MCWILLIAMS received
[89] K. Marola, J. Kbnt, and J. Bibly, Multivariate Analysis. London, U.K.: the master’s degree in nuclear engineering and the
Academic, 1979. Ph.D. degree in applied statistics from the Univer-
[90] J. H. Friedman, ‘‘Multivariate adaptive regression splines,’’ Annu. Statist., sidad Politécnica de Madrid. He is currently an
vol. 19, no. 1, pp. 1–67, Mar. 1991. Associate Professor of statistics with the Univer-
[91] T.-C. Fu, F. Chung, V. Ng, and R. Luk, ‘‘Pattern discovery from stock sidad Politécnica de Madrid. His current research
time series using self-organizing maps,’’ in Proc. Workshop Notes KDD interests include machine learning, Monte Carlo
Workshop Temporal Data Mining, 2001, pp. 26–29. simulations with applications to road safety, and
[92] J. Joseph and I. Indratmo, ‘‘Visualizing stock market data with self- electricity market.
organizing map,’’ in Proc. 26th Int. FLAIRS Conf., 2013, pp. 488–491.
[93] T. Kohonen, ‘‘Self-organizing maps of massive databases,’’ Int. J. Eng.
Intell. Syst. for Electr. Eng. Commun., vol. 9, no. 4, pp. 179–186, 2001.
[94] T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘‘Efficient estimation of
word representations in vector space,’’ 2013, arXiv:1301.3781. [Online].
CAMINO GONZÁLEZ FERNÁNDEZ received
Available: http://arxiv.org/abs/1301.3781
[95] T. Loughran and B. Mcdonald, ‘‘When is a liability not a liability? Textual
the Ph.D. degree in nuclear safety from the Uni-
analysis, dictionaries, and 10-ks,’’ J. Finance, vol. 66, no. 1, pp. 35–65, versidad Politécnica de Madrid (UPM), in 1993.
Feb. 2011. She is currently an Associate Professor with the
[96] S. R. Baker, N. Bloom, and S. J. Davis, ‘‘Measuring economic policy Statistics Department, UPM. She has publications
uncertainty,’’ SSRN Electron. J., vol. 131, pp. 1593–1636, 2016. in highly ranking journals. Her current research
[97] M. E. Roberts, B. M. Stewart, D. Tingley, C. Lucas, J. Leder-Luis, interests include data mining techniques and pat-
S. K. Gadarian, B. Albertson, and D. G. Rand, ‘‘Structural topic models tern recognition methods applied to energy, trans-
for open-ended survey responses,’’ Amer. J. Political Sci., vol. 58, no. 4, port, health, and sustainability.
pp. 1064–1082, Oct. 2014.

127430 VOLUME 8, 2020

You might also like