Professional Documents
Culture Documents
ABSTRACT Decision-making using machine learning requires a deep understanding of the model under
analysis. Variable importance analysis provides the tools to assess the importance of input variables
when dealing with complex interactions, making the machine learning model more interpretable and
computationally more efficient. In classification problems with imbalanced datasets, this task is even more
challenging. In this article, we present two variable importance techniques, a nonparametric solution, called
mh-χ 2 , and a parametric method based on Global Sensitivity Analysis. The mh-χ 2 employs a multivariate
continuous response framework to deal with the multiclass classification problem. Based on the permutation
importance framework, the proposed mh-χ 2 algorithm captures the dissimilarities between the distribution
of misclassification errors generated by the base learner, Conditional Inference Tree, before and after
permuting the values of the input variable under analysis. The GSA solution is based on the Covariance
decomposition methodology for multivariate output models. Both solutions will be assessed in a comparative
study of several Random Forest-based techniques with emphasis in the multiclass classification problem with
different imbalanced scenarios. We apply the proposed techniques in two real application cases in order first,
to quantify the importance of the 35 companies listed in the Spanish market index IBEX35 on the economic,
political and social uncertainties reflected in economic newspapers in Spain during the first quadrimester
of 2020 due to the COVID-19 pandemic and second, to assess the impact of energy factors on the occurrence
of spike prices on the Spanish electricity market.
INDEX TERMS Covid-19 pandemic, electricity market, global sensitivity analysis, multiclass classification
problem, multivariate response scenario, variable importance analysis.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
127404 VOLUME 8, 2020
I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets
(binary or multiclass) or a regression problem in the case of a categorized into Boosting and Bagging methods. In Boosting,
continuous output variable (univariate (scalar) or multivariate the set of base learners are trained sequentially using random
(vector output)). samples. These samples are selected with replacement of
In the case of a classification machine learning algorithm, over-weighted data in which misclassified observations are
we aim to predict (after a training process), with the highest assigned higher weights and then passed to the next base
accuracy, the class of a new observation Xnew . However, learner. The most popular Boosting methods are, AdaBoost,
the accuracy of these algorithms can be influenced by the XGBoost or Gradient Boost. On the other hand, in Bagging
imbalanced nature of the data [4], [5], where observations of (Bootstrap aggregating) methods, the training of each base
one class outnumber the other class(es) (imbalance ratio of at learner is performed independently. Popular methods are
least 5:1 for the binary case [6]). This skewed distribution C4.5, CART, and most recently developed, the Conditional
of classes makes the goal of predicting very challenging Inference Trees (CIT) algorithm [16]. The CIT technique will
since the cost of misclassification of a minority class can be be used in this work as a base learner for the proposed variable
more expensive than misclassifying an observation from the importance algorithm when applied to the imbalanced multi-
majority class (see for instance medical diagnosis cases [7]). class problem due to its capability of handling multivariate
In order to improve the prediction accuracy and lower the response models.
computation expense, a previous identification and quantifi- To overcome the imbalanced classification problem, sev-
cation of the most relevant input variables of the model is eral solutions based on bagging methods were proposed. [17]
always highly advised. This challenge, increases even more explore the idea of modeling the Random Forest (RF) algo-
when coping with multiclass problems with an imbalanced rithm by incorporating a cost-sensitive learning as well as a
dataset. resampling technique in order to turn the imbalanced binary
To address the imbalanced problem, several approaches dataset into a more balanced one. Reference [18] modified
have been considered, which can be grouped into four the splitting criterion of the C4.5 algorithm by measuring
categories: the probability divergence between the joint probability of
1) Preprocessing strategies: Resampling techniques features, the imbalanced binary target class Y (p(x,y)) and
and/or variable importance analysis. the marginal distributions (p(x) and p(y)). Reference [19]
2) Cost-sensitive learning methods. followed a similar approach using the Hellinger distance
3) Adaptation of machine learning techniques. in the splitting phase and various sampling methods. When
4) Hybrid solutions: Combination of the previous three dealing not only with high-dimensional settings (p n), but
approaches. also Bigdata scenarios, [20] used the MapReduce framework
Preprocessing strategies based on resampling techniques to first partition the original dataset and then perform sepa-
can be divided into two subcategories, oversampling (i.e. rate analysis to the imbalanced binary classification problem
process of duplicating minority samples, see SMOTE or combining various sampling techniques along with the RF
synthetic minority oversampling technique [8]) or undersam- algorithm. The Roughly Balanced Bagging (RBBag) tech-
pling methods (i.e. removing a subset of majority classes, nique [21] was applied in combination with a random under-
see RUS or Random Under-Sampling [9]). A comparison of sampling procedure by [22] showing a better performance
sampling methods applied to the imbalanced classification related to over-sampling procedures. On the other hand,
problem can be found in [10]–[12]. The strategy behind boosting methods were also adapted to cope with the imbal-
the cost-sensitive learning is to adjust the machine learning anced problem. To overcome the bias behavior toward major-
technique in order to minimize the misclassification cost. Dif- ity classes seen with Gradient Boosting when classifying rare
ferent solutions implementing cost-sensitive learning tech- events (only for the binary imbalanced scenario), [23] pre-
niques can be found in [6], [13]. Researchers also adapted sented three different undersampling techniques combined
parametric models [14] (i.e. mainly Logistic Regression and with a modified boosting method that consisted of an early
Support Vector Machine (SVM) algorithms) and nonpara- stopping and shrinkage process. Reference [24] extended the
metric methods (Decision Tree algorithms), to find solutions Adaptive Boosting algorithm (AdaBoost) [25], to tackle the
to the imbalanced multiclass classification problem. When imbalanced multiclass scenario. They developed a new algo-
applied as variable importance techniques, SVM techniques rithm, AdaBoost.M1, adding a cost-sensitive learning method
have been used extensively to overcome the imbalance prob- to the boosting process. Reference [26] first upgraded their
lem, however, Ensemble Methods (EM) based on Decision initial binary class algorithm, AdaBoost.NC [27] to handle
Trees techniques have shown better performance compared the multiclass scenario. Their study showed that the class
to parametric-based classifiers [15]. decomposition strategy (i.e. converting the multiclass prob-
EM refers to the collection of techniques in which a group lem into binary problem, known as one-against-all, OAA)
of models, belonging to the same algorithm (base learner), did not perform well, showing the effectiveness of the simul-
are trained together in order to obtain a better performance taneous processing when dealing with the multiclass prob-
compared to the performance achieved with a single algo- lem. They improved the accuracy of classification through a
rithm. The technique helps us to reduce the variance and combination of different over-sampling methods along with
bias of the single learner, improving its accuracy. EM are the upgraded AdaBoost.NC algorithm. Hybrid solutions try
to combine the different approaches aforementioned to deal the Euclidean distance. This metric has been extensively
with the imbalanced issue and improve the classification applied in clustering problems for outlier detection. In our
accuracy. Researchers are referred to [28]–[35] for detailed case, we leverage the properties (sensitivity to class distribu-
descriptions of these hybrid solutions. It is worth mentioning tion) of the Mahalanobis distance to analyze the supervised
that the majority of these solutions belong to the family problem with an imbalanced dataset. Additionally, we sup-
of ensemble-based methods showing an overall advantage port our importance analysis by obtaining a nonparametric
of bagging methods combined with sampling methods over test statistic based on the null hypothesis of equal multivariate
boosting methods and their extensions. cumulative density functions. The parametric alternative is
Whereas many performance metrics (Precision, based on the covariance decomposition multioutput Global
F-measures, Recall, etc.) are used to assess the accuracy of Sensitivity Analysis technique applied to the fitted multi-
classifiers, only the Area Under the Curve (AUC) [36] and output model Yc = fc (X ). The proposed methods will be
G-Mean [37] have shown, due to their robustness, to be compared to existing variable importance techniques based
adequate in the imbalanced case [38]. In this work, we apply on RF algorithms since they are designed for the binary as
the extension of the AUC, the Volume Under the Surface well as the multiclass scenario as opposed to the rest of the
(VUS) [39] to analyze and visualize the importance of fea- models (SVM, NN or Logistic Regression).
tures in multiclass classification problems. This paper is organized as follows: Section 2 presents
Another effective approach to deal with the imbalanced the state of the art related to variable importance meth-
classification problem is to first run a preprocessing step ods with emphasis on the imbalanced multiclass problem.
based on variable importance analysis in order to remove non- Section 3 describes the theoretical concepts and proposed
influential variables. Removing irrelevant features lowers the methodologies. Simulated examples consisting of nonlin-
risk of misclassification, avoids overfitting, bias towards the ear functions with different imbalanced ratios are tested in
majority class and the possibility of considering the minority Section 4. In Section 4, we also apply our proposed method-
class(es) as noise [40]. The variable importance analysis ologies to two real-world problems. In the first case, we inves-
gets more challenging when dealing with the multiclass sce- tigate the impact of the 35 companies listed in the Spanish
nario since the class decomposition strategy overall underper- index IBEX35 on a newly proposed uncertainty index created
forms [26] when compared to the simultaneous processing of using the Natural Language Processing Word2Vec algorithm.
all classes. The index, called IdW2V, captures the political, economic
In this paper, we investigate two VIA approaches to and social uncertainty in Spain due to the COVID-19 pan-
deal with imbalance classification problems. The presented demic during the first quadrimester of 2020. In the second
methodologies can be implemented using either a parametric real application, we aim to identify and quantify the impact
or a nonparametric framework. The key idea is to convert of several factors on extreme prices in the Spanish electricity
a multiclass classification problem into a continuous multi- market. Section 5 is dedicated to discussing the contributions
output regression problem and analyze through the proposed and limitations of the proposed techniques. Final conclusions
importance techniques the influence of the input variables. are presented in Section 6.
The nonparametric approach applies the permutation impor-
tance technique to measure the influence of each predictor. II. RELATED WORK
The changes seen on the predictive output vector (predicted Several Variable Importance (VI) methods have been pro-
probabilities, Yc = fc (X ), Yc ∈ [Y1 , Y2 , . . . , Yl ], posed to tackle the imbalanced multiclass classification prob-
P l
i=1 Yi = 1) are summarized in a distribution error matrix lem. These methods can be grouped into 3 categories [2]:
that is computed as the row-wise difference between the Filter, wrapper and embedded methods. Filter methods are
observed and the predicted matrices. The resulting matrix is model-independent techniques (i.e. no need to have a map-
then vectorize through the computation of the Mahanalobis ping function that maps features to classes). They use sta-
distance between its rows, which leads to a vector of distri- tistical measures (χ 2 -stastistic, Fisher Criterion Score or
bution of errors. Compared to other importance techniques Information-theory based measures) to capture the level of
where only a summary of the available information of errors dependency of the classes on a group of features. Although,
is used, our approach considers all the information related they underperform their competition when dealing with non-
to errors generated by the predictive model. Working with linearities between the features [42], they are computation-
probability distributions has shown to be more robust and ally less expensive and hence suitable for high-dimensional
suitable in skewed class distribution settings since we work settings [43]. Wrapper and embedded methods, on the other
with the entire information related to the class membership. hand, use classification models along with search methods.
Finally, the dissimilarities between the Mahalanobis distance They measure the importance of each feature by assessing its
vectors before and after permuting the predictor under assess- predictive power. Despite their ability to capture interactions
ment are quantified using the χ 2 -distance (symmetric met- among features, their tendency to overfit and their computa-
ric). Although a similar solution was presented in [41], in this tional cost make them less attractive models.
work we improve the algorithm and adapt it to the imbalanced Mutual Information (MI) was used in [44], [45] as a score
classification problem by using the Mahalanobis instead of function to assess both the impact of the single feature as
well as the feature interactions on the multiclass output. its competitors. Reference [59] used the permutation frame-
Reference [46] presented an entropy-based solution to reduce work technique [60] to assess the importance of vari-
the computation costs associated to filter methods in multi- ables (features) in imbalanced scenarios. In their work,
class problems. Their key idea consisted of calculating the the importance of each input variable is measured by com-
dependencies between each feature and a preselected subset puting the difference between the AUC before and after ran-
of classes. Binary Relevance and Label Powerset approaches domly permuting the values of each input variable at each
were considered by [47] to transform the multiclass prob- base learner (CIT algorithm).
lem into a single class problem, to subsequently apply the The feature selection method for imbalanced binary classi-
Relief and Information Gain (IG) as feature selection meth- fication problems proposed by [61] uses a sequential forward
ods to the transformed problem. Reference [48] proposed an technique where features are only added to the classifier
Iterative Ensemble Feature Selection method (IEFS). First, based on clustering and outlier detection if the mean of sen-
the OAA strategy1 was applied to convert the multiclass sitivity and specificity of the classifier increase. The feature
data set into two-class sub-datasets. Then, iteratively for 0 T 0 selection technique, asBagging-FSS [62], aims to remove
times, the Fast-correlation based filter method was applied irrelevant and redundant features by applying hierarchical
to each subdata set, which was previously sampled using the clustering that collect similar features using the Pearson cor-
SMOTE method. More recent feature selection filter-based relation coefficient across two features. A more recent sur-
methods that dealt with the multiclass domain are discussed vey of preprocessing techniques, including resampling and
in [49], [50]. feature selection methods for imbalanced binary biomedical
Parametric models such as SVM were also used datasets are discussed in [63]. A user guide of resampling
as VI methods. For instance, [51] extended the work of [52] techniques as well as feature selection methods according to
by proposing an embedded VI method based on SVM and the type of dataset is also provided. An overview of feature
a recursive feature elimination procedure to deal with the selection techniques can be found in [64].
multiclass scenario. Reference [53] used a similar strategy Few papers have proposed variable selection solutions for
applying a backward elimination technique with SVM as classification problems based on Global Sensitivity Analysis.
a classifier to cope with high dimensional settings. Refer- Reference [65] proposed the FAST (Fourier Amplitud Sen-
ence [54] adapted the SVM algorithm with a cost-sensitive sitivity Test) combined with a trained Feedforward Neural
learning method using the Quasi-Newton-based optimization Network (FNN) as a new feature selection method for clas-
scheme. To overcome the loss of class information when the sification problems. The total effect2 of each input variable
OAA strategy is applied, [55] proposed a normalization pro- (Xi ) on the univariate response Yl = P(cl |X ) (where each
cedure weighing the output of each two-class SVM classifier response represents a probability of a specific class when
with a reliability measure. an observation X is evaluated using the predictive model,
When dealing with skewed class distributions, probabilis- FNN), is calculated as the sum of the individual total effect
tic approaches have shown better performance in imbalanced with respect to all output probability values. The approach
scenarios. Reference [56] presented a VI method based on followed by [66] trains a Neural Network and finds the Sobol
the computation of the Hellinger distance between the two- decomposition of the fitted functions. The sensitivity index
class distributions using the CART and SVM algorithms as attached to each input variable is determined by the difference
base learners. Reference [57] followed a similar solution among the activation functions that described the NN sys-
and proposed the Density-Based Feature Selection algorithm tem. Both solutions fail to consider the possible relationships
(DBFS). The algorithm estimates the probability density of among the probability distributions of the classes, are limited
the analyzed feature in each class PDF(Ci ) and then computes to scenarios with independent continuous variables (assump-
the overlapping of this probability with respect to the rest of tions on the input space) and need a known mathematical
the classes. Reference [58] compared three alternatives for expression that maps the input on the output(s).
the multiclass scenario based on the Random Forest algo-
rithm. In the first two, the Binary Relevance (BRRF) and A. RANDOM FOREST-BASED VARIABLE IMPORTANCE
Label Powerset (RFRF) techniques were first applied to con- TECHNIQUES
vert the multiclass dataset into various single-class datasets, In this subsection, we present several proposed techniques for
while in the third alternative the Random Forest Predictive VIA based on Random Forest that will be compared to our
Clustering Tree (RFPCT) was applied to handle the multi- proposed methodologies.
class data simultaneously. In their study, BRRF outperformed
1) AUC EXTENSION
1 A widely used strategy is the so-called class decomposition technique Reference [59] used the permutation approach [60] to assess
or One-Against-All (OAA) technique. The OAA solution consists of trans- the importance of input variables for the binary output
forming a multiclass dataset with l classes into l different binary sub-dataset.
However, when applying this strategy, we may face a possible loss of class model when considering different imbalanced scenarios.
information, which could potentially lead to an inadequate learning of the
algorithm. For this reason, considering all classes simultaneously could be a
more efficient procedure in the variable importance analysis. 2 Total effect = Main (single) + Interactions effects
Based on the same framework,3 we apply the extended ver- the remaining variables. The selected set of variables will be
sion of AUC, the Volume Under the Surface (VUS) for the the one that leads to the smallest error rate. The proportion
3-class case and the Hypervolume Under the Surface (HUM) of variables to eliminate is an arbitrary parameter on this
for the n-class case. These metrics have shown to be insen- method, and does not depend on the data.
sitive to changes in class distributions and hence suitable in This procedure provides the smallest possible set of vari-
imbalance scenarios. ables that can still lead to good predictive performance.
The level of relevance of each predictor (VIM (Xi ) is mea-
sured as the difference between the performance metric of 3) BORUTA
each tree in the forest (ntrees) when predicting the Out of Based on the permutation importance technique, the pro-
Bag (OOB) observations before (Mti , in our case VUS or cedure ([70] and R Package [71]) adds randomness to the
HUM) and after (MPti ) permuting the values of the predictor dataset and evaluates, in a recursive way, the new data using
under analysis (Xi ). the Random Forest algorithm. At each step irrelevant input
ntrees variables are removed.
1 X
The Boruta algorithm duplicates the original dataset, and
VIM (Xi ) = · (MPti − Mti ) (1)
ntrees shuffles the values for each feature. The decision of keeping
t=1
or discarding a given variable will be based on the probability
Evaluating an OOB observation XOOB by using any classi-
that this feature is ranked higher or lower than the added
results in a vector of probabilities, Yc ∈ [Y1 , Y2 , . . . , Yl ]
fier P
random variables in terms of VI. The threshold that will lead
and li=1 Yi = 1, where each element of the vector represents
this decision will be the top VI value of all random variables
the level of membership to each class of the observation. If no
added to the original model. That is, the algorithm will check
assumption is made about the distribution of the predicted
for each of the real features if they have a higher importance
probabilities, an unbiased nonparametric estimator for VUS
than the defined threshold. Only if this requirement is met,
is defined as (3-class case) [67]:
will they be considered as significant in the model.
n1 X n3
n2 X
1 X
VUS = I (Y1i , Y2j , Y3k ) (2) 4) VSURF
n1 n2 n3
i=1 j=1 k=1
Consisting of a two-stage strategy, VSURF technique ( [72]
where, and R Package [73]) first applies the permutation technique to
eliminate the irrelevant input variables whose importance is
if Y1i < Y2j < Y3k
1
1
lower than a predefined threshold. The model ranks the vari-
if Y1i = Y2j < Y3k or Y1i < Y2j = Y3k ables by sorting the Variable Importance in descending order.
I (Y1i , Y2j , Y3k ) = 21 The set of eliminated variables will be based on this sorted
if Y1i = Y2j = Y3k order. The calculation of the VI threshold is based on Variable
6
0 otherwise Importance’s standard deviations, and only those variables
(3) whose importance exceeds this value will be considered in
the model.
Analogously, for the n-class case, the non-parametric Second, a variable selection step is performed. A sequence
HUM estimate is defined as: of RF models are built with the variables that produce the
1
n1
X nn
X smallest OOB error. As result of this step, two sets of vari-
HUM = ... I (Y1i , . . . , Ynk ) (4) ables are produced. The first set counts for the variables that
n1 . . . nn
i=1 k=1 are highly related to the response, including the redundant
ones. The second set is formed by variables with low redun-
2) VarSelRF dancy and the highest prediction accuracy.
Based on an iterative elimination algorithm, the VarSelRF
procedure ( [68] and R Package [69]) builds recursively each III. THEORETICAL CONCEPTS AND PROPOSED
new forest (using the RF algorithm) with variables that pro- METHODOLOGIES
duce the smallest error rate, removing the irrelevant variables In this section we describe the theoretical concepts from
in the process. which we build our proposed variable importance techniques
To select those variables that will be included in the model, for the imbalanced multiclass scenario.
this algorithm builds a new model at each iteration, discarding The importance of input variables implicated in a input-
the variables associated with the smallest variable impor- output model can be generally assessed using different
tance. At each step, 20% of the variables with the smallest approaches, within which Machine Learning (ML) tech-
importance will be eliminated, and a new model is built with niques and Global Sensitivity Analysis (GSA) are the most
popular. By applying ML techniques, we aim to identify
3 This technique will be applied in this work as a visualization method the most important variables in terms of their predictive
(minor contribution) in order to show the difference between the volume power, while GSA provides the set of variables that contribute
under the surface when the input variable is permuted [59]. the most to the output variability. In both cases, a set of
FIGURE 1. Schematic design of the proposed methodologies for variable importance analysis in multiclass classification problems based on mh-χ 2 and
multiresponse GSA. (*) Transform the multiclass response with l labels into a continuous output vector with l elements, which represent the probabilities
of membership to each class. VI(Xi ) variable importance of input variable Xi . Si and STi are the main and total effects of Xi . PI: Permutation Importance.
C.I.T.: Conditional Inference Tree. D.T.: Decision Tree.
influential input variables is extracted and ranked. A detailed imbalanced multiclass classification problems using a con-
comparison and connection between these techniques for the tinuous multivariate output framework.
continuous univariate response model can be found in [74]. A schematic design of the proposed methodologies for
In this paper, we apply a continuous multivariate out- VIA in multiclass classification problems are displayed
put framework to analyze the importance of input variables in Figure 1.
involved in a multiclasss classification output model with
imbalanced class distribution (Figure 1). Two solutions are A. GSA: THEORETICAL CONCEPTS
investigated. From a ML perspective, the importance analysis Given a multioutput computational model Yc = fc (X ) (fc :
is based on the permutation importance method (Figure 2) Rp − > Rl p, l are integers) where X ∈ [X1 , X2 , . . . , Xp ]
along with a φ-divergence measure. The covariance decom- are assumed independent input variables defined on
position approach is proposed as a solution for the GSA some probability space (, P), Yc a multioutput vector
methodology. To the best of our knowledge, this is the Yc ∈ [Y1 , . . . , Yl ] with a semi positive definite covariance
first work analyzing the importance of input variables in matrix C = Cov(Y1 , . . . , Yl ) and Yr = gr (X1 , . . . , Xp ) a
problem, the single effect Si is computed as the average area and total sensitivity indices. Given the fitted model
under the conditional Probability Density Function (PDF) Yc = fc (X ), we obtain two new model outputs as
when Xi is fixed fY |Xi (y) and the unconditional PDF fY (y). follows:
Reference [83] extended the previous method to the multi- a) Yc (i) = fc (Xi0 , X(i) ), where Xi0 is an independent
variate response scenario. copy of Xi (i.e. freeze all input variables except
1
Z ∞ Xi and sample Xi ). Yc (i) is the model output when
Si = EXi
2
|fY (y) − fY |Xi (y)|dy (10) evaluating all N observations (Xi0 , X(i) ) on fc (·).
−∞ (i) (i) (i)
Yc (i) = [Yc1 , Yc2 , . . . , Ycl ] is a matrix of size
Due to the difficulty of approximating the PDF func- (i)
Nxl. Each element Yck (k ∈ [1, ..l]) of the matrix
tion, [84] employed instead the Cumulative Density Func- (i)
Yc is a vector of size Nx1 that represents the
tion (CDF) and computed the Kolmogorov Smirnov (KS)
vector of probabilities of class k.
distance between the CDFs (FY |Xi (y) and FY (y)). In this case,
b) Yc i = fc (Xi , X(i)
0 ), where X 0 is an independent
(i)
the sensitivity indexes are computed as the distance metric
copy of X(i) (i.e. freeze Xi and sample the rest of
between the corresponding conditional and unconditional
the input variables). Yc i is the model output when
CDFs.
evaluating all N observations (Xi , X(i) 0 ) on f (·).
c
Si = EXi maxy |FY (y) − FY |Xi (y)| (11) Yc = [Yc1 , Yc2 , . . . , Ycl ] is a matrix of size Nxl.
i i i i
YcOOB − ŶBP OOB and E OOB = Y OOB − Ŷ OOB . Both outliers are being affected by the permutation of the
AP(i) c AP(i)
matrices provide us with the necessary information to input variable.
j
assess the importance of the variable under analysis. 5) In order to capture how the realizations εBP and
j
For instance, see Figure 3, if Xi has any predictive εAP(i) change their locations when permuting the input
power, when permuting its values, the distribution of variable, we measure the distance between these
j
errors will be affected (i.e. the association between realizations εBP and the distribution EBP
OOB while tak-
classes and the predictors will change according to the ing into account the correlations among the realiza-
power of the variable). When dealing with imbalanced tions, before and after permuting Xi (i.e. same with
j
datasets ([200 : 90 : 10] and [10 : 280 : 10] datasets), εAP(i) and EAP
OOB ). This is performed computing the
(i)
j j
the realizations εBP and εAP(i) , j = 1, . . . , nOOB (rows Mahanalobis distance. When performed for all real-
OOB and E OOB ) corresponding to the izations, we are vectorizing the matrices EBPOOB and
of the matrices EBP AP(i) OOB
EAP(i) leading to the distributions of Mahanalobis
minority classes appear as outliers in the graphs. Some
of these outliers change their positions while others distances.
remain almost invariant, which may suggest that the r
permutation impacts one class more than the other. The j j
(εBP − µ
j
BP ) · S OBB · (εBP − µ
BP T −1
dmh (εBP ) = d dBP )
permutation also impacts the majority and minority EBP
classes differently. The goal is to measure how these (14)
where j = 1, . . . , nOOB , µ
d −1
BP and S OBB are the column- where ε̄ is the mean vector and the covariance matrix:
EBP
OOB . OOB + n
nOOB · SBP OOB
wise mean and inverse covariance matrix of EBP OOB · SAP(i)
AP j Su = (18)
Similarly for dmh (i) (εAP(i) ).4 (N − 2)
j
BP (ε ) and d (i) (ε AP j
6) dmh BP mh AP(i) ) are the Mahanalobis dis-
OOB and E OOB respectively.
of EBP AP(i)
tances between each realization before and after Now, consider the hypothesis:
permuting Xi . The distribution of these distances OOB OOB
BP ∼ P and d AP(i) ∼ P
(dmh Ho : F(εBP ) = G(εAP )
d mh d(i) ) incorporate information (i)
about the errors generated by the CIT base learner for OOB OOB
Ha : F(εBP ) 6= G(εAP(i)
) (19)
each OOB observation.
7) The importance of the input variable under analysis Since we expect that the alternative hypothesis is true if Xi
Xi is then measured as the χ 2 histogram distance [88] is relevant, U can be used as a test statistic for this purpose.
BP and d AP(i) .5 If U is large, the Ho is rejected and hence Xi is influential.
between the random variables dmh mh OOB ) =
On the contrary, if the Ho is not rejected, then F(εBP
OOB OOB OOB OOB
G(εAP(i) ), which implies that EBP and EAP(i) yield to the
1 X (Pdj − Pd(i)j )2
χd2 (Pd , Pd(i) ) = AP
2 (Pdj + Pd(i)j ) same distribution of errors, hence χt2 (dmh
BP , d (i) ) = 0.
t mht
j=1
Although the U test does not provide an importance rank-
The importance metric VI (Xi ) for Xi is defined as: ing, it can be used as an assessment tool to select variables in
AP a multioutput scenario. This could be used as a preprocessing
VI (Xi ) = E[χd2 (dmh
BP
t
, dmht(i) )] step to then perform a variable importance analysis where a
ntree
1 X 2 t t ranking of importance is generated.
= χd (Pd , Pd(i) ) (15) The U test is nonparametric since the null distribution
ntree
t=1 does not rely on the underlying distribution of the posterior
where ntree is the number of trees in the Random Forest and predicted probabilities of each class as these probabilities do
χd2 (Ptd , Ptd(i) ) the χ 2 -distance between the random variables not follow a known distribution. In imbalanced scenarios we
AP
BP and d (i) evaluated in each tree t. may encounter cases where the predicted probabilities are
dmh t mht
Some properties of VI (Xi ): hard to fit and hence this test could be more suitable. The
accuracy of the U for variable importance analysis is highly
1) VI (Xi ) = 0 if Xi is irrelevant.
dependent on the sample size nOOB . The larger the nOOB the
2) 0 ≤ VI (Xi ) ≤ 1.
more accurate the test is.7
3) VI (Xj ) > VI (Xi ) if Xj is more relevant than Xi .
OOB = E OOB due to Ŷ OOB =
The pseudo-code describing the methodology proposed for
If Xi is irrelevant, then EBP AP(i) BP VIA in multiclass classification problems based on the non-
AP
OOB , thus d BP = d (i) , which implies χ 2 (d BP , d (i) ) = 0 AP
ŶAP mht mht t mht mht
parametric approach is shown in the Algorithm 1.
(i)
(i.e. Ptd X = Ptd X ).
(i) IV. SIMULATED EXAMPLES AND REAL
The relevance of Xi can also be tested using the nonpara-
CASE APPLICATIONS
metric test for the bivariate two-sample problem 6 [89]. Let
j j We discuss the performance of the proposed methodolo-
εBP and εAP(i) be independent random samples (realizations)
gies based on GSA and Machine Learning techniques
extracted from EBPOOB and E OOB with cumulative density func-
AP(i) (sections III-B, III-C) and compare them to methods based
OOB
tions F(εBP ) and G(εAP OOB ). Given a U test statistic defined
(i)
on Random Forest (VarSelRF section II-A2 and Boruta
as: section II-A3, VSURF section II-A4). We choose these tech-
(N − 1) · T 2 niques for the following reasons: a) All methods use deci-
U= (16) sion tree learners, hence they all have intrinsic capability for
(N − 2)
variable importance analysis, no variable scaling is needed
where N = 2nOOB , T 2 is the Hotelling’s two-sample T 2 and mixed data (categorical and continuous) is allowed. b)
statistic defined as: Variable importance analysis for binary and multiclass classi-
fication problems can be performed with the same algorithm.
T 2 = nOOB (ε̄BP
OOB
− ε̄AP
OOB T
) · Su−1 (ε̄BP
OOB
− ε̄AP
OOB
) (17)
(i) (i) c) All methods use the permutation importance framework as
4 The Mahanalobis distance measures the distance between a point (in a tool to assess the importance of variables under analysis.
our case a realization) and the set of points characterized by a mean and d) The compared methods provide numerical estimates of
a covariance matrix. It reduces to the Euclidean distance if the covariance importance. To help with the comparison and interpretability
matrix is the identity matrix. This distance is widely used in clustering of the importance values, we standardize (i.e. PpVI (X i)
%)
problems to assess the importance of outliers in a distribution. i=1 VI (Xi )
5 As a distance metric, it satisfies the conditions: non-negativity the importance values provided by each method. e) All meth-
(χd2 (A, B) ≥ 0), symmetry (χd2 (A, B) = χd2 (B, A)) and subadditivity ods have an open code implementation.
(χd2 (A, B) ≤ χd2 (A, C) + χd2 (C, B))
OOB → ∞, it can be proved that under Ho asymtotically U ∼ χ2 .
6 For non-normal data, which is our case. 7 As n 2
Algorithm 1 mh-χ 2 : Imbalanced MultiClass Algorithm TABLE 1. Balance/Imbalance classification model (Class distributions for
2 sample sizes (n=150, 900): Balanced [33%,33%,33%], Moderate
1: Original Multi-Class output dataset [Y ; X ] Balancing [60%,30%,10%] and [80%,10%,10%], Imbalanced
2: Transformed continuous multi-output dataset [Yc ; X ]. [96%,2%,2%], [2%,96%,2%], [2%,2%,96%]. For example, in a model with
900 observations with moderate balancing 60% of observations (i.e
3: Variable selection: U Nonparametric test. 540 observations) belong to class 1, 30% of observations (i.e. 270) to
4: for Xi in 1 : col(X ) do class 2 and 10% (i.e 90) to class 3).
5: for treet in 1 : ntrees do
6: Build tree CITt using [YcIOB ; X IOB ]t .
Compute ŶBP OOB = f OOB ) and E OOB =
7:
t CITt (Xt BPt
YcOOB OOB .
− ŶBP
t t
BPt j
8: Calculate dmh (εBPt ), j = 1, . . . , nOOB (dmh
BP ∼
t
t
Pd X ).
9: Permute the values of the variable under analysis Xi .
B. NONLINEAR MODEL
about misclassification errors made at each OOB observa- The nonlinear model described in [90] maps 10 independent
tion). In other words, the information related to the misclas- uniformly distributed input variables (Xi ∼ U (0, 1)) to a
sification error of each OOB observation is now embedded in continuous output Y = f (X ) which is then discretized to build
the corresponding Mahanalobis distance. Finally, as a result, the classification model.
we have the distribution of Mahanalobis distances dmh BP with
Y = 10sin(πXA XB ) + 20(XC − 0.5)2 + 10XD + 5XE + i
information about the majority as well as the minority classes
BP captures
(outliers in this distribution). The distribution of dmh where i ∼ N (0, 0.012 )
the misclassification errors of both the majority and minority Three dataset sizes are tested, n=100, 500 and 1000.
classes regardless of how small the number of classes is. This For each size, 100 randomly generated dataset are created.
procedure is then performed for permuted Xi providing us Notice that only the first five input variables have a real
AP
with dmh (i) . The importance of Xi is then computed using impact on the response with an expected impact ranking
the χ 2 -distance between dmh BP and d AP(i) which allows us to of XD > XB > XA > XE > XC . The discretiza-
mh
measure how permuting Xi has affected the distributions of tion process of Y consists of setting thresholds to the out-
Mahanalobis distances of minority and majority classes at an put variable. For the 3-class problem, by modifying the
FIGURE 5. Comparison of the Volume Under the Surface before and after permuting the analyzed input variables xa , xc , xj (left to right) for the
simulated classification model IV-A and class distributions [18:864:18] (highly imbalanced case).
FIGURE 7. Comparison of the Volume Under the Surface before and after permuting the analyzed input variables XD , XE , XJ (left to right) for the
simulated non-linear model IV-B.
the input variables is well defined. In terms of computation In order to perform the VIA we need to construct our
performance, the Boruta and Covariance decomposition GSA dataset. The following workflow describes the process:
techniques are the fastest methods. On the contrary, VSURF 1) Daily adjusted close prices of the 35 companies listed
and PI(VUS) have shown to be computationally inefficient. in the IBEX35 were extracted for the period January 01,
Figure 6 displays the comparison among the techniques for 2020 to April 30, 2020 (daily returns were computed).9
the setting n=100, 500, 1000 and th1=5, th2=10. At sample 2) We find the clusters of companies with similar
size n=100 and class distribution [3 : 21 : 76], mh-χ 2 identi- movement patterns [91], [92] using the unsupervised
fies all relevant and irrelevant variables, with importance val- machine learning algorithm SOM (Self-Organizing
ues of the irrelevant predictors close to zero in comparison to Map) [93]. Variable clustering based on SOM will
the remaining techniques. On average, VITs generate similar allow us to reduce the dimensionality of the problem
outputs (accuracy and stability) as the sample size increases. gathering the variables (i.e. companies) with similar
Figure 7 visualizes the effect of permuting the input vari- information (i.e. movement return patterns) in clusters
able under analysis. Important variables create more separa- [C1, C2, . . . , Ck].
tion between the ROC surfaces leading to a bigger difference 3) Each cluster will be represented by the first principal
VUSBP −VUSAP , which is in concordance with their influence component PCCi when performing a PCA to the set of
on the multiclass response companies that are in the cluster. This synthetic variable
will explain the variability of the corresponding cluster.
C. REAL APPLICATIONS Steps 2 and 3 help us to generate the input variables of
We evaluate the proposed VIA algorithm mh-χ 2 on two real our dataset X = [PCC1 , PCC2 , . . . , PCCk ].
case problems. In the first analysis, we quantify the impact 4) We use the Natural Language Processing (NLP) algo-
of the 35 components listed in the market capitalization rithm Word2Vec (W2V) [94] to create an uncertainty
weighted index IBEX35 on the Spanish economic, political index that captures the daily variability reflected in
and social uncertainty reflected in newspapers during the economic newspapers (IdW 2V ) due to the politi-
first quadrimester of 2020. During this exceptional period cal, economic and social uncertainty during the first
we generated an uncertainty index that captures the uncer- quadrimester of 2020.
tainty reflected in the media due to the market behavior. 5) Perform the VIA to the dataset [IdW 2V ; X ] applying
VIA will allow us to understand and quantify the factors that the proposed algorithm mh-χ 2 .
contributed to that uncertainty, assessing which among those
Figure 8 and Table 2 show the clusters created by SOM.
factors have more impact in the occurrence of high/low levels
At a glance, we can clearly see that GRF (Pharmaceuti-
of uncertainty (i.e. imbalance problem). In the second case,
cal sector) and VIS (Food industry) have been clustered far
we study the impact of energy factors on the occurrence of
from the remaining companies (Figure 8). Both companies
spike prices on the Spanish electricity market.
have shown the best performance (highest returns) during
this crisis. The high volatility seen in the oil market and the
1) SPANISH MARKET UNCERTAINTY REFLECTED IN
dependency of the rest of sectors on the energy sector caused
NEWSPAPERS IN THE HEIGHT OF COVID-19: VIA
similar patterns among these companies. Clusters C7 and
Exceptional time periods are normally characterized by high
C8 gather most of the financial companies. We apply PCA
levels of uncertainty. The goal in this section is to analyze
to characterize the variability of each cluster. Only the first
the impact of the uncertainty created by different types of
principal components are kept (PC1 explained at least 70%
companies clustered by their return movement patterns on the
of the total variability of the cluster).
Spanish written media during the first four months of 2020.
The study will allow us to assess how returns of stock prices
shaped the uncertainty in the news. 9 The price quotes were extracted from Yahoo finance news website.
In order to quantify the level of uncertainty perceived where, kw stands for keyword, wiUTt is the uncer-
in the writen media, we first parsed daily news from the tainty/negative term that appeared on a specific day t in the
two most read economic and financial Spanish newspapers newspaper text (TUW total number of uncertainty/negative
j
(CincoDias10 and Expansion11 ) for the period January 01, terms day t), wPTt is the positive term that appeared
2020 to April 30, 2020. We then used the financial dictionary on a specific day t (TPW total positive terms day t).
created by (Loughran and McDonald, 2011) [95] composed keyword = [coronavirus, covid].
by 297 uncertainty risk related terms, 2373 words with nega- Figure 9 displays the evolution of the news measure uncer-
tive meaning and 371 positive meaning words.12 tainty index based on W2V (IdW 2Vt ) compared to the his-
Two uncertainty indices were generated: Index based on torical volatility (HV) 14 of the IBEX35. Until February,
counting daily uncertainty terms and a new index based on IdW 2V presents negative values which can be interpreted as
the Word2Vec algorithm. certain political, economic and social stability. While there is
Index based on counting uncertainty terms [96]: a delay between the two time series, the HV and the news
articles index show an increasing trend with a correlation
k
1 X i 13 Some similarity values are: sim(coronavirus, vacuna) = 0.06,
Id1t = wUt (20) sim(coronavirus, volatilidad) = 0.388, sim(coronavirus, crisis) = 0.68,
Wt
i=1 sim(coronavirus, dudas) = 0.614, sim(coronavirus, tratamiento) = 0.232,
sim(coronavirus, normalidad) = 0.311
14 Historical volatility is a measure of how much the stock price fluctuates
10 https://cincodias.elpais.com
during a given period. The formula used in this analysis for the HV is:
11 https://www.expansion.com/
σHVt = 12 (ln( H t 2 Ct 2
Lt )) − (2ln(2) − 1)(ln( Ot )) , Ht : High price, Lt : Low price,
12 The lists were translated from English to Spanish. Ct : Close price and Ot : Open price
FIGURE 10. Relative importance of each cluster responsible for high/low spikes in IdW 2V and the market volatility.
FIGURE 11. Estimate effect of topics 2 (Pandemic + Crisis), 5 (Spanish economy) and 14 (Spanish growth) during the Pandemic Period reflected in the
Spanish economic newspapers. Please see Appendix C for the rest of Estimate effects of the remaining topics.
3) IMPACT OF ENERGY FACTORS ON SPIKE PRICES ON THE • Superior spike: Price that surpasses the value of a
SPANISH ELECTRICITY MARKET: VIA limit price previously fixed (th2 or high price threshold)
The study presented in this section investigates the impact defined as th2 = µ + 2σ .
that certain factors have on the occurrence of spike prices on where, µ = mean(Pt )|t=Tt=1 and σ = std(Pt )|t=1 .
t=T
the Spanish electricity market. Understanding and measuring The previous definitions allow us to categorize each Pt as
the relationship between extreme electricity prices and the belonging to one of three categories: superior (class 3, C3),
factors that drive them may have important implications when normal (class 2, C2) and inferior price spike (class 1, C1).
it comes to forecast those prices. If normality is assumed by Pt ∼ N (0, 1), the defined clas-
Before any prediction of price or demand of electricity, sification thresholds will include approximately 95% of total
it is important to identify the exogenous variables that might prices (normal prices), leaving 5% of them to be considered
influence the spikes in prices. The goal of this analysis is extreme prices. This allows us to convert the original contin-
to identify the relative impact of some factors on extreme uous response problem into a classification problem with an
electricity prices (i.e. spike prices) in both cases, abnormally imbalanced domain.
high and low prices. Figure 12 shows the evolution of the hourly Spanish
A spike price is a price Pt that is significantly dif- electricity prices during the year 2016 (from January to
ferent from the previous one Pt−1 . Reference [99] estab- December) as well as the classification thresholds for each
lished thresholds to define the occurrence of spike prices. trimester (T1, T2, T3 and T4).
The proposed definitions distinguish two groups of spike Spike prices can occur due to diverse effects. Although,
prices: in an ideal market superior/inferior spikes should only be
• Inferior spike: Price that falls under the value of a attributed to high/low levels of demand, reality is far from
limit price previously fixed (th1 or low price threshold) ideality and other factors might play a significant role on the
defined as th1 = µ − 2σ . appearance of these rare prices. Among others, the hour of
FIGURE 12. The figure shows the evolution of the hourly Spanish
electricity prices during the year 2016 along with the piecewise graphs
per trimester (T1 to T4) using thresholds (low and high) computed as:
th1 = µ − 2σ and th2 = µ + 2σ .
FIGURE 14. Importance values of each factor responsible for extreme prices during the fourth trimester applying the VITs: mh-χ 2 , Boruta and VSURF. Find
rest of VIA per trimester on full page image in Appendix D.
note that after the Price1T factor, the Wind Energy Produc-
tion is identified as one of the most relevant factors by mh-
χ 2 and Boruta. This is consistent with the analysis of the
price electricity behavior shown in the evolution of electricity
prices on that period (Figure 12) where we can see a severe
negative spike (low prices are influenced by an increased use
of renewable energy).
FIGURE 16. Comparison of execution time (in seconds) between the proposed algorithms, mh-χ 2 and GSA based on covariance decomposition,
versus VarSelRF, VSURF and Boruta. The hyperparameters that drive the Decision Tree algorithms have been set to default values.
j
Figure 15, each misclassification error before (i.e. εBP a row will not have any effect on the computation cost of the
OOB ) and after (i.e. ε j
in matrix EBP OOB mh-χ 2 algorithm.
AP(i) a row in matrix EAP(i) )
permuting Xi is represented by the Mahanalobis distances The proposed parametric solution for VIA for imbalanced
BP and d AP(i) . The Mahanalobis distance d BP represents the
dmh classification problems is based on the covariance decom-
j mhj mhj
j position GSA method. We first transform the original mul-
distance between each BP (j ∈ [1, 2, . . . , noob]) and the
OOB . The information related ticlass classification problem into a multioutput continuous
set of distribution of errors EBP
regression problem by fitting a classification algorithm (NN,
to misclassification errors is embedded in the Mahanalobis
SVM or RF) and evaluating all observations of X . The result
distances. We can distinguish 3 clusters corresponding to
of this evaluation is a multioutput continuous dataset [Yc , X ]
misclassification errors made in the majority and minority
where each column of Yc represents the probability of each
classes. The goal is to measure the dissimilarities between
class, Yc = [P(c1 |X ), P(c2 |X ), . . . , P(cl |X )] (i.e Yc |xi =
the distribution of these misclassification errors represented
[P(c1 |xi ), P(c2 |xi ), . . . , P(cl |xi )] is a vector of probabilities
by the Mahanalobis distances through the computation of the
where each element represents the probability of a specific
χ 2 -distance between their histograms bin-wise. The result of
class. In other words, P(cl |xi ) is the level of membership
the χ 2 -distance is our proposed measure of importance for
of the realization xi ∈ X to the class l). We then find the
imbalanced classification problems.
multivariate regression model to link the multivariate output
The performance of the mh-χ 2 algorithm relies on an
and the input variables [Yc , X ] (i.e. Yc = fc (X )). Given
appropriate tuning of the hyperparameters that govern the
the model Yc = fc (X ), we can now apply the covariance
base learner CIT. Five hyperparameters are used to tune
decomposition GSA to assess the influence of each input vari-
the CIT algorithm [16]: ntree, mtry, Maxdepth, minsplit,
able on the multivariate output. One way of quantifying the
minbucket. Reference [41] performed a sensitivity analysis
influence is through the computation of the so-called Sobol
based on the computation of Sobol indices in order to assess
indices. These importance indices measure the uncertainty of
the importance of these hyperparameters on the accuracy
the output(s) caused by each input variable uncertainty (see
of the CIT algoritm when used as base leaner for VIA.
Equ. 9). In practice, the Sobol indices are estimated using
The study showed that mtry has the highest impact on the
for instance the Monte-carlo Pick-Freeze methodology (see
accuracy of the proposed algorithm for VIA. In this paper,
Equ. 13). For interpretability purposes, we rewrite Equ. 13
we choose the default values for all the hyperparameters
for the 3-class (l = 3) problem as:
except mtry, which is set to the total number of input vari-
ables, allowing us to evaluate the association between each 3 (i)
input variable and the output vector in the splitting pro-
X Yck · Yck − Kck
STi = (i)
(23)
cess of the CIT. As shown in Figure 16, this mtry selection k=1 Bk
FIGURE 17. Comparison of the DT-based VITs (VSURF, Boruta, VarSelRF, mh-χ 2 ) for the simulated classification model IV-A. We compare the mean of the
relative importances of each input variable computed using the compared VITs. The mean results from calculating the average of the VI values over
100 simulated dataset for sample sizes n=150 and 900. For each simulated sample size we simulated 6 balance ratio scenarios with different class
distributions. The U test values for each input variable are also included (values under the input variable names) (The larger the U value the more relevant
the predictor is).
muting the Xi in the case of mh-χ 2 and by measuring how the importance metric for VIA. This solution allows us
probabilities of each specific class change when computing to visualize and measure the impact of the Xi on the
the scalar product between the vector of probabilities of the minority class.
classes before and after freezing the input variable(s) under • The parametric solution based on covariance decom-
assessment. The mh-χ 2 algorithm and the covariance decom- position GSA method allows us to break down the
position GSA method outperform the discussed methods in contribution of each input variable into single (Si ),
the literature in the following: total (STi ) P
and interacting effects (STi − Si ). While
• The nonparametric mh-χ 2 algorithm can be seen as large 1 − Si indicates the presence of interactions
an extension of Decision Tree-based VITs (VarSelRF, among the input variables, STi − Si allow us to quantify
VSURF and Boruta) since it provides a better accuracy these interactions. The GSA method also shows to be
when dealing with imbalanced dataset while at the same the most computationally efficient solution among its
time matches the accuracy when working with balanced competitors.
dataset. While the compared VITs base their VI analysis • Scalability: Since mh-χ 2 algorithm uses the CIT algo-
on a summary of the misclassification errors, mh-χ 2 rithm as the base learner (capability of handling multi-
considers the entire distribution of errors, propos- variate response models) and covariance decomposition
ing a probabilistic distance measure (χ 2 -distance) as GSA method employ a multivariate continuous response
VOLUME 8, 2020
FIGURE 18. Comparison of the VITs evaluated over 100 dataset for the non-linear model IV-B when the thresholds are set to th1=7 and th2=14 (proposed methodologies based on GSA (Single and Total
effects) and ML (mh-χ 2 ) against the DT-based methods (Boruta, VarSelRF and VSURF)). The performance of each technique is described by µ(σ )[A,B,C ] [Ranking] where µ is the mean, σ the standard
Pnvar
1
i =1 σi . C.T(s): computation time in seconds. Each VIT is compared in terms of the accuracy (the most
deviation and [A, B, C ] = [100, 500, 1000] are the sample sizes. The stability is computed as nvar
relevant input variable is highlighted), stability, balance ratio behavior and speed.
127426
I. Ahrazem Dfuf et al.: VIA in Imbalanced Datasets
framework, both can be applied as variable importance regression (univariate and multivariate response) problems.
techniques in classification problems with binary or The new techniques were applied in simulated as well as real
multiclass response (balanced or imbalanced) as well problems, providing relative importance values of the input
as regression problems with univariate or multivariate variables involved in the model. We assessed the impact of the
output. 35 companies listed in the IBEX35 index on the political, eco-
nomic and social uncertainty captured by two highly regarded
Spanish economic newspapers during the Covid-19 pandemic
VI. CONCLUSION
and also measured the effect of a set of energy factors on
In this work, we presented the nonparametric mh-χ 2 variable
electricity price spikes in the Spanish electricity market. Due
importance technique that, employing a multivariate contin-
to the fact that mh-χ 2 computationally is less efficient than
uous response framework, allows us to select and rank the
the fastest method Boruta, the mh-χ 2 technique could be
most relevant input variables when dealing with different bal-
optimized and made more competitive by implementing some
ance scenarios. The method captures the importance of each
of the functions in C++, parallelizing and distributing the
variable (total effect) by measuring the dissimilarities using
computation process. An extension of the mh-χ 2 solution
the χ 2 -distance between the distribution of errors generated
can be towards studying the importance of input variables in
by the base learner (Conditional Inference Tree) before and
a multioutput mixed response (continuous and categorical)
after permuting the variable under analysis. We showed that
scenario and the analysis of the interaction effects among
the proposed technique overall outperformed its competitors
input variables on the output.
based on Random Forests since it uses the entire distribution
of errors, which incorporates all the information needed for
APPENDIX A
the variable importance analysis in contrast to those which
use only a summary of the errors information. The paramet- See Figure 17.
ric approach applies the Covariance decomposition Global
Sensitivity Analysis method, where the importance of the APPENDIX B
input variable is estimated using the Pick-Freeze Monte-carlo See Figure 18.
technique. While Global Sensitivity Analysis allows us to
break down the effect (single, total and interaction) of each APPENDIX C
input variable on the output, the assumption of independence See Figure 19.
between the input variables needs to be met. Both variable
importance techniques can be used in classification (binary APPENDIX D
and multiclass with balanced and imbalanced datasets) and See Figure 20.
FIGURE 19. Estimate effect of topics 1 (Spanish market), 2 (Pandemic + Crisis), 4 (Social measures), 5 (Spanish Economy), 11 (Companies Profit),
12 (Coronavirus), 14 (Spanish Growth), 16 (Social impact) and 18 (Employment) during the Pandemic Period reflected in the Spanish economic
newspapers (First quadrimester: From January 1st to April 30th, 2020).
FIGURE 20. Each pie chart displays the ranking value of each factor responsible for extreme prices when using different Variable importance techniques
(per columns left to right, mh-χ 2 , Boruta, VarSelRF, VSURF and VSURF). The results are presented by trimester (per row top to bottom, T1, T2, T3, T4).
[25] Y. Sun, M. Kamel, and Y. Wang, ‘‘Boosting for learning multiple classes [47] N. Spolaôr, E. A. Cherman, M. C. Monard, and H. D. Lee, ‘‘A comparison
with imbalanced class distribution,’’ in Proc. 6th Int. Conf. Data Mining of multi-label feature selection methods using the problem transformation
(ICDM), Dec. 2006, pp. 592–602. approach,’’ Electron. Notes Theor. Comput. Sci., vol. 292, pp. 135–151,
[26] S. Wang and X. Yao, ‘‘Multiclass imbalance problems: Analysis and Mar. 2013.
potential solutions,’’ IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, [48] J. Yang, J. Zhou, Z. Zhu, X. Ma, and Z. Ji, ‘‘Iterative ensemble feature
no. 4, pp. 1119–1130, Aug. 2012. selection for multiclass classification of imbalanced microarray data,’’
[27] S. Wang, H. Chen, and X. Yao, ‘‘Negative correlation learning for clas- J. Biol. Res.-Thessaloniki, vol. 23, no. S1, p. 13, May 2016.
sification ensembles,’’ in Proc. Int. Joint Conf. Neural Netw. (IJCNN), [49] G. Georgiev, I. Valova, and N. Gueorguieva, ‘‘Feature selection for mul-
Jul. 2010, pp. 1–8. ticlass problems based on information weights,’’ Procedia Comput. Sci.,
[28] S. García, A. Fernández, and F. Herrera, ‘‘Enhancing the effectiveness vol. 6, pp. 189–194, 2011.
and interpretability of decision tree and rule induction classifiers with [50] G. Y. Wong, F. H. F. Leung, and S.-H. Ling, ‘‘A hybrid evolutionary
evolutionary training set selection over imbalanced problems,’’ Appl. Soft preprocessing method for imbalanced datasets,’’ Inf. Sci., vols. 454–455,
Comput., vol. 9, no. 4, pp. 1304–1314, Sep. 2009. pp. 161–177, Jul. 2018.
[29] Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu, and Y. Zhou, ‘‘A novel ensemble [51] O. Chapelle and S. S. Keerthi, ‘‘Multi-class feature selection with support
method for classifying imbalanced data,’’ Pattern Recognit., vol. 48, no. 5, vector machines,’’ in Proc. Amer. Stat. Assoc., vol. 58, 2008.
pp. 1623–1637, May 2015. [52] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, ‘‘Gene selection for can-
[30] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, ‘‘SmoteBoost: cer classification using support vector machines,’’ Mach. Learn., vol. 46,
Improving prediction of the minority class in boosting,’’ in Proc. Eur. nos. 1–3, pp. 389–422, 2002.
Conf. Princ. Data Mining Knowl. Discovery. Berlin, Germany: Springer, [53] S. Maldonado, R. Weber, and F. Famili, ‘‘Feature selection for high-
2003, pp. 107–119. dimensional class-imbalanced data sets using support vector machines,’’
[31] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, Inf. Sci., vol. 286, pp. 228–246, Dec. 2014.
‘‘RUSBoost: A hybrid approach to alleviating class imbalance,’’ IEEE [54] S. Maldonado and J. López, ‘‘Dealing with high-dimensional class-
Trans. Syst., Man, Cybern. A, Syst. Humans, vol. 40, no. 1, pp. 185–197, imbalanced datasets: Embedded feature selection for SVM classifica-
Jan. 2010. tion,’’ Appl. Soft Comput., vol. 67, pp. 94–105, Jun. 2018.
[32] X.-Y. Liu, J. Wu, and Z.-H. Zhou, ‘‘Exploratory undersampling for class- [55] G. de Lannoy, D. François, and M. Verleysen, ‘‘Class-specific fea-
imbalance learning,’’ IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 39, ture selection for one-against-all multiclass SVMs,’’ in Proc. ESANN.
no. 2, pp. 539–550, Apr. 2009. New York, NY, USA: Citeseer, 2011, pp. 263–268.
[33] S. Wang and X. Yao, ‘‘Diversity analysis on imbalanced data sets by using [56] L. Yin, Y. Ge, K. Xiao, X. Wang, and X. Quan, ‘‘Feature selection for
ensemble models,’’ in Proc. IEEE Symp. Comput. Intell. Data Mining, high-dimensional imbalanced data,’’ Neurocomputing, vol. 105, pp. 3–11,
Mar. 2009, pp. 324–331. Apr. 2013.
[57] M. Alibeigi, S. Hashemi, and A. Hamzeh, ‘‘DBFS: An effective density
[34] W. Feng, W. Huang, and J. Ren, ‘‘Class imbalance ensemble learn-
based feature selection scheme for small sample size and high dimen-
ing based on the margin theory,’’ Appl. Sci., vol. 8, no. 5, p. 815,
sional imbalanced data sets,’’ Data Knowl. Eng., vols. 81–82, pp. 67–103,
May 2018.
Nov. 2012.
[35] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera,
[58] O. Gharroudi, H. Elghazel, and A. Aussem, ‘‘A comparison of multi-label
‘‘A review on ensembles for the class imbalance problem: Bagging-
feature selection methods using the random forest paradigm,’’ in Proc.
, Boosting-, and hybrid-based approaches,’’ IEEE Trans. Syst., Man,
Can. Conf. Artif. Intell. Berlin, Germany: Springer, 2014, pp. 95–106.
Cybern. C, Appl. Rev., vol. 42, no. 4, pp. 463–484, Jul. 2012.
[59] S. Janitza, C. Strobl, and A.-L. Boulesteix, ‘‘An AUC-based permutation
[36] A. P. Bradley, ‘‘The use of the area under the ROC curve in the evalua-
variable importance measure for random forests,’’ BMC Bioinf., vol. 14,
tion of machine learning algorithms,’’ Pattern Recognit., vol. 30, no. 7,
no. 1, p. 119, Dec. 2013.
pp. 1145–1159, Jul. 1997.
[60] L. Breiman, Classification and Regression Trees. Evanston, IL, USA:
[37] M. Kubat, R. Holte, and S. Matwin, ‘‘Learning when negative examples
Routledge, 2017.
abound,’’ in Proc. Eur. Conf. Mach. Learn. Berlin, Germany: Springer,
[61] C. Beyan and R. Fisher, ‘‘Classifying imbalanced data sets using similar-
1997, pp. 146–153.
ity based hierarchical decomposition,’’ Pattern Recognit., vol. 48, no. 5,
[38] N. V. Chawla, N. Japkowicz, and A. Kotcz, ‘‘Editorial: Special issue on pp. 1653–1672, May 2015.
learning from imbalanced data sets,’’ ACM SIGKDD Explor. Newslett., [62] H. Yu and J. Ni, ‘‘An improved ensemble learning method for classifying
vol. 6, no. 1, pp. 1–6, Jun. 2004. high-dimensional and imbalanced biomedicine data,’’ IEEE/ACM Trans.
[39] C. T. Nakas, ‘‘Developments in roc surface analysis and assessment of Comput. Biol. Bioinf., vol. 11, no. 4, pp. 657–666, Jul. 2014.
diagnostic markers in three-class classification problems,’’ REVSTAT- [63] S. Liu, J. Zhang, Y. Xiang, W. Zhou, and D. Xiang, ‘‘A study of
Stat. J., vol. 12, no. 1, pp. 43–65, 2014. data pre-processing techniques for imbalanced biomedical data classi-
[40] L. Yijing, G. Haixiang, L. Xiao, L. Yanan, and L. Jinling, ‘‘Adapted fication,’’ 2019, arXiv:1911.00996. [Online]. Available: http://arxiv.org/
ensemble classification algorithm based on multiple classifier system and abs/1911.00996
feature selection for classifying multi-class imbalanced data,’’ Knowl.- [64] F. Shakeel, A. S. Sabhitha, and S. Sharma, ‘‘Exploratory review on class
Based Syst., vol. 94, pp. 88–104, Feb. 2016. imbalance problem: An overview,’’ in Proc. 8th Int. Conf. Comput.,
[41] I. Ahrazem Dfuf, J. Mira McWilliams, and M. González Fernández, Commun. Netw. Technol. (ICCCNT), Jul. 2017, pp. 1–8.
‘‘Multi-output conditional inference trees applied to the electricity mar- [65] S. Sadeghyan, ‘‘A new robust feature selection method using variance-
ket: Variable importance analysis,’’ Energies, vol. 12, no. 6, p. 1097, based sensitivity analysis,’’ 2018, arXiv:1804.05092. [Online]. Available:
Mar. 2019. http://arxiv.org/abs/1804.05092
[42] D. Chen, D. Hua, J. Reifman, and X. Cheng, ‘‘Gene selection for [66] F. Fernandez-Navarro, M. Carbonero-Ruz, D. B. Alonso, and
multi-class prediction of microarray data,’’ in Proc. CSB, vol. 3, 2003, M. Torres-Jimenez, ‘‘Global sensitivity estimates for neural network
p. 492. classifiers,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 11,
[43] A. Bommert, X. Sun, B. Bischl, J. Rahnenführer, and M. Lang, ‘‘Bench- pp. 2592–2604, Nov. 2017.
mark for filter methods for feature selection in high-dimensional clas- [67] S. Dreiseitl, L. Ohno-Machado, and M. Binder, ‘‘Comparing three-
sification data,’’ Comput. Statist. Data Anal., vol. 143, Mar. 2020, class diagnostic tests by three-way ROC analysis,’’ Med. Decis. Making,
Art. no. 106839. vol. 20, no. 3, pp. 323–331, Jul. 2000.
[44] J. Lee and D.-W. Kim, ‘‘Mutual information-based multi-label feature [68] R. Díaz-Uriarte and S. A. D. Andres, ‘‘Gene selection and classification
selection using interaction information,’’ Expert Syst. Appl., vol. 42, no. 4, of microarray data using random forest,’’ BMC Bioinf., vol. 7, no. 1, p. 3,
pp. 2013–2025, Mar. 2015. 2006.
[45] J. Lee and D.-W. Kim, ‘‘Fast multi-label feature selection based on [69] R. Diaz-Uriarte, ‘‘GeneSrF and varSelRF: A Web-based tool and r pack-
information-theoretic feature ranking,’’ Pattern Recognit., vol. 48, no. 9, age for gene selection and classification using random forest,’’ BMC
pp. 2761–2771, Sep. 2015. Bioinf., vol. 8, no. 1, p. 328, Dec. 2007.
[46] J. Lee and D.-W. Kim, ‘‘Efficient multi-label feature selection using [70] H. Stoppiglia, G. Dreyfus, R. Dubois, and Y. Oussar, ‘‘Ranking a random
entropy-based label selection,’’ Entropy, vol. 18, no. 11, p. 405, feature for variable and feature selection,’’ J. Mach. Learn. Res., vol. 3,
Nov. 2016. pp. 1399–1414, Mar. 2003.
[71] M. B. Kursa and W. R. Rudnicki, ‘‘Feature selection with the Boruta [98] M. E. Roberts, B. M. Stewart, and D. Tingley, ‘‘STM: R package for
package,’’ J. Stat. Softw., vol. 36, no. 11, pp. 1–13, 2010. structural topic models,’’ J. Stat. Softw., vol. 10, no. 2, pp. 1–40, 2014.
[72] R. Genuer, J.-M. Poggi, and C. Tuleau-Malot, ‘‘Variable selection using [99] J. H. Zhao, Z. Y. Dong, X. Li, and K. P. Wong, ‘‘A framework for
random forests,’’ Pattern Recognit. Lett., vol. 31, no. 14, pp. 2225–2236, electricity price spike analysis with advanced data mining methods,’’
Oct. 2010. IEEE Trans. Power Syst., vol. 22, no. 1, pp. 376–385, Feb. 2007.
[73] R. Genuer, J.-M. Poggi, and C. Tuleau-Malot, ‘‘VSURF: An R package [100] X. Lu, Z. Dong, and X. Li, ‘‘Electricity market price spike forecast
for variable selection using random forests,’’ R J., vol. 7, no. 2, p. 19, with data mining techniques,’’ Electr. Power Syst. Res., vol. 73, no. 1,
2015. pp. 19–29, Jan. 2005.
[74] P. Wei, Z. Lu, and J. Song, ‘‘A comprehensive comparison of two variable [101] C. Ambroise and G. J. McLachlan, ‘‘Selection bias in gene extraction on
importance analysis techniques in high dimensions: Application to an the basis of microarray gene-expression data,’’ Proc. Nat. Acad. Sci. USA,
environmental multi-indicators system,’’ Environ. Model. Softw., vol. 70, vol. 99, no. 10, pp. 6562–6566, May 2002.
pp. 178–190, Aug. 2015. [102] C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn, ‘‘Bias in random
[75] W. Hoeffding, ‘‘A class of statistics with asymptotically normal distribu- forest variable importance measures: Illustrations, sources and a solu-
tion,’’ in Breakthroughs in Statistics. Berlin, Germany: Springer, 1992, tion,’’ BMC Bioinf., vol. 8, no. 1, p. 25, Dec. 2007.
pp. 308–334.
[76] I. M. Sobol, ‘‘Global sensitivity indices for nonlinear mathematical mod-
els and their Monte Carlo estimates,’’ Math. Comput. Simul., vol. 55,
nos. 1–3, pp. 271–280, Feb. 2001.
ISMAEL AHRAZEM DFUF received the B.S.
[77] F. Gamboa, A. Janon, T. Klein, and A. Lagnoux, ‘‘Sensitivity indices
for multivariate outputs,’’ 2013, arXiv:1303.3574. [Online]. Available:
degree in telecommunication engineering from
http://arxiv.org/abs/1303.3574 the University of Sevilla, Spain, in 2012, and
[78] K. Campbell, M. D. McKay, and B. J. Williams, ‘‘Sensitivity analy- the master’s degree in quantitative finance from
sis when model outputs are functions,’’ Rel. Eng. Syst. Saf., vol. 91, The University of Sydney, Australia, in 2017.
nos. 10–11, pp. 1468–1472, Oct. 2006. He is currently pursuing the Ph.D. degree in math-
[79] A. Saltelli, P. Annoni, I. Azzini, F. Campolongo, M. Ratto, and ematical engineering, statistics and operations
S. Tarantola, ‘‘Variance based sensitivity analysis of model output. research (IMEIO) with the Universidad Politéc-
Design and estimator for the total sensitivity index,’’ Comput. Phys. nica de Madrid (UPM), Madrid, Spain. In 2019,
Commun., vol. 181, no. 2, pp. 259–270, Feb. 2010. he was a Research Assistant with the Instituto de
[80] A. Saltelli and R. Bolado, ‘‘An alternative way to compute Fourier ampli- Empresa (IE Business School) and UPM. His research interests include
tude sensitivity test (FAST),’’ Comput. Statist. Data Anal., vol. 26, no. 4, application of machine learning techniques to variable importance analysis
pp. 445–460, Feb. 1998. in multivariate response scenarios, clustering algorithms for pattern recogni-
[81] M. Lamboni, H. Monod, and D. Makowski, ‘‘Multivariate sensitivity tion, and natural language processing algorithms.
analysis to measure global contribution of input factors in dynamic mod-
els,’’ Rel. Eng. Syst. Saf., vol. 96, no. 4, pp. 450–459, Apr. 2011.
[82] E. Borgonovo, ‘‘A new uncertainty importance measure,’’ Rel. Eng. Syst.
Saf., vol. 92, no. 6, pp. 771–784, Jun. 2007.
[83] L. Cui, Z. Lu, and X. Zhao, ‘‘Sensitivity indices of basic variable under JOAQUÍN FORTE PÉREZ-MINAYO received
multiple failure modes and their solutions,’’ Sin. Phys., Mech. Astron., the degree in industrial engineering from the
vol. 40, no. 12, pp. 1532–1541, 2010. ETSII, Universidad Politécnica de Madrid
[84] Q. Liu and T. Homma, ‘‘A new computational method of a moment- (UPM), in 2017. He is currently an IT Audi-
independent uncertainty importance measure,’’ Rel. Eng. Syst. Saf., tor with Pricewaterhouse Coopers. His current
vol. 94, no. 7, pp. 1205–1211, Jul. 2009. research interest includes machine learning tech-
[85] L. Li, Z. Lu, and D. Wu, ‘‘A new kind of sensitivity index for multivariate niques applied to the electricity market.
output,’’ Rel. Eng. Syst. Saf., vol. 147, pp. 123–131, Mar. 2016.
[86] J. Aitchison, The Statistical Analysis of Compositional Data. New York,
NY, USA: Chapman & Hall, 1986.
[87] L. Breiman, ‘‘Random forests,’’ Mach. Learn., vol. 45, no. 1, pp. 5–32,
2001.
[88] G. W. Snedecor and W. G. Cochran, Statistical Methods, 8th ed. Ames,
IA, USA: Iowa State Univ. Press, 1989. JOSÉ MANUEL MIRA MCWILLIAMS received
[89] K. Marola, J. Kbnt, and J. Bibly, Multivariate Analysis. London, U.K.: the master’s degree in nuclear engineering and the
Academic, 1979. Ph.D. degree in applied statistics from the Univer-
[90] J. H. Friedman, ‘‘Multivariate adaptive regression splines,’’ Annu. Statist., sidad Politécnica de Madrid. He is currently an
vol. 19, no. 1, pp. 1–67, Mar. 1991. Associate Professor of statistics with the Univer-
[91] T.-C. Fu, F. Chung, V. Ng, and R. Luk, ‘‘Pattern discovery from stock sidad Politécnica de Madrid. His current research
time series using self-organizing maps,’’ in Proc. Workshop Notes KDD interests include machine learning, Monte Carlo
Workshop Temporal Data Mining, 2001, pp. 26–29. simulations with applications to road safety, and
[92] J. Joseph and I. Indratmo, ‘‘Visualizing stock market data with self- electricity market.
organizing map,’’ in Proc. 26th Int. FLAIRS Conf., 2013, pp. 488–491.
[93] T. Kohonen, ‘‘Self-organizing maps of massive databases,’’ Int. J. Eng.
Intell. Syst. for Electr. Eng. Commun., vol. 9, no. 4, pp. 179–186, 2001.
[94] T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘‘Efficient estimation of
word representations in vector space,’’ 2013, arXiv:1301.3781. [Online].
CAMINO GONZÁLEZ FERNÁNDEZ received
Available: http://arxiv.org/abs/1301.3781
[95] T. Loughran and B. Mcdonald, ‘‘When is a liability not a liability? Textual
the Ph.D. degree in nuclear safety from the Uni-
analysis, dictionaries, and 10-ks,’’ J. Finance, vol. 66, no. 1, pp. 35–65, versidad Politécnica de Madrid (UPM), in 1993.
Feb. 2011. She is currently an Associate Professor with the
[96] S. R. Baker, N. Bloom, and S. J. Davis, ‘‘Measuring economic policy Statistics Department, UPM. She has publications
uncertainty,’’ SSRN Electron. J., vol. 131, pp. 1593–1636, 2016. in highly ranking journals. Her current research
[97] M. E. Roberts, B. M. Stewart, D. Tingley, C. Lucas, J. Leder-Luis, interests include data mining techniques and pat-
S. K. Gadarian, B. Albertson, and D. G. Rand, ‘‘Structural topic models tern recognition methods applied to energy, trans-
for open-ended survey responses,’’ Amer. J. Political Sci., vol. 58, no. 4, port, health, and sustainability.
pp. 1064–1082, Oct. 2014.