You are on page 1of 9

Does Dataset Complexity Matters for Model

Explainers?
1st José Ribeiro 2nd Raı́ssa Silva 3nd Lucas Cardoso
Federal University of Pará - UFPA IRMB, Montpellier University ICEN, Federal University of Pará
Federal Institute of Pará - IFPA La Ligue Contre le Cancer Belém, Brazil
Belém, Brazil Montpellier, France lucas.cardoso@icen.ufpa.br
jose.ribeiro@ifpa.edu.br r.lorenna@gmail.com

4rd Ronnie Alves


arXiv:2107.02661v2 [cs.LG] 17 Nov 2021

Federal University of Pará - UFPA


Vale Institute of Technology - ITV DS
Belém, Brazil
ronnie.alves@itv.org

Abstract—Strategies based on Explainable Artificial Intelli- therefore, less used in problems related to sensitive contexts,
gence - XAI have emerged in computing to promote a better such as health and safety [3] [4] [5].
understanding of predictions made by black box models. Most
The limited understanding of black box models requires a
XAI measures used today explain these types of models, generat-
ing attribute rankings aimed at explaining the model, that is, the search for measures or tools that can provide information about
analysis of Attribute Importance of Model. There is no consensus local explanations — aiming to predict around an instance
on which XAI measure generates an overall explainability rank. through various methods to obtain a local attribute importance
For this reason, several proposals for tools have emerged (Ciu, ranking [6] — and global explanations — when it is possible to
Dalex, Eli5, Lofo, Shap and Skater). An experimental bench-
understand the logic of all instances of the model by generating
mark of explainable AI techniques capable of producing global
explainability ranks based on tabular data related to different a global attribute importance rank [6] [7] — as a means of
problems and ensemble models are presented herein. Seeking making decisions interpretable and, thus, more reliable [8].
to answer questions such as “Are the explanations generated Efforts have been made in the Explainable Artificial In-
by the different measures the same, similar or different?” and telligence - XAI area regarding the development of different
“How does data complexity play along model explainability?”
The results from the construction of 82 computational models
measures to explain black box models, even after their train-
and 592 ranks shed some light on the other side of the problem ing and testing process, called post-hoc analysis [3]. Thus,
of explainability: dataset complexity! measures such as Ciu [9], Dalex [10], Eli5 [11], Lofo [12],
Index Terms—Explainable Artificial Intelligence - XAI, Black Shap [13], and Skater [14] emerged to promote the creation of
box model, Dataset complexity. model-agnostic3 explanations. Each of these tools is capable of
generating explanations using different techniques, but a fact
I. I NTRODUCTION they have in common is that they all generate global attribute
Recently, technology has increasingly evolved and allowed importance rankings related to the explanation of a model.
intelligent algorithms to be present in our daily lives through The proposal to analyze the ranking of global importance
solutions to the most diversified types of problems, thus — and not the local — allows general analysis of how a given
prompting a need for machine learning models to solve model treats and explains the problem to be solved, and for
increasingly complex problems which require characteristics this reason, it was chosen for this study.
that justify decision making, in addition to high problem- This research raises discussions for the current moment in
solving performance [1] [2]. the XAI area through two main questions, namely: — Con-
Computational models based on bagging and boosting sidering the current measures geared at explaining black box
algorithms, which provide for high performance and high machine learning models, can it be inferred that they generate
generalization capacity, are commonly used in computation to global rankings of same, similar, or different explainabilities?
solve regression and classification problems based on tabular — Following the same idea as in the previous question, are
data. However, these models are not considered transparent the generations of equal, similar, or different explainabilities
algorithms1 , being considered black box algorithms2 and, related to specific properties of a dataset?
1 Transparent Algorithms: Algorithms that generate explanations for how a
Seeking to answer the two hypotheses above, this research
given output was produced. Such examples include Decision Tree, Logistic
emerges as a comparative analysis of different XAI measures,
Regression and K-nearest Neighbors.
2 Black Box Algorithms: Machine learning algorithms that have classifica- 3 Model-Agnostic: means it does not depend on the type of model to be
tion or regression decisions that are hidden from the user. explained.
capable of producing global explainability ranks based on tab- create equal attribute ranks, but with different values for each
ular data related to different problems and ensemble models. index, thus generating global explanations [9]. Only the CI is
The main contributions of this research to the machine used in this research, for obeying the same CU rank sequence.
learning area focused on XAI are as follows: Dalex is a set of XAI tools based on the LOCO (leave-one
• Development of a benchmark that measures the corre- covariate out) approach and it can generate explainabilities
lations existing between the explanations generated by from this approach. In general, the measure receives the model
the main current XAI measures by comparing them with and the data to be explained; it calculates the performance
each other through different perspectives (properties of of the model, performs new training processes with new
the datasets used); generated data sets, and it does the inversion of each attribute
• Generation of comparative results between explainability of the data in a unitary and iterative way, thereby measuring
ranks that lead researchers in the XAI area to identify which attributes are most important to the model based on the
which explainability measures to use against different performance [27] [10].
characteristics of a dataset. Leave One Feature Out - Lofo is a XAI measure that has
a similar proposal than Dalex, but with a main difference
II. BACKGROUND
regarding the inversion of the attribute iterativity, because in
A. Explainable Artificial Intelligence the Lofo measure the iterative step is the removal of the
Recently, there is a growing need to explain models of attribute to find its global importance to the model based on
black box machine learning in an agnostic way. This includes the performance [12].
not only more robust computational models like Deep Neural Explain Like I’m Five - Eli5 is a tool that helps explore
Networks but also simpler models like ensembles [15] [16] machine learning classifiers and explains their predictions by
[17] [18]. assigning weights to decisions, as well as exporting decision
In this way, the herein research identifies in the current trees and presenting the importance of the attributes of the
literature the necessity of specific studies on model-agnostic model submitted to the tool [11].
XAI measures for tabular data, thus enabling explanations SHapley Additive exPlanations - SHAP is proposed as
of computational models with wide applicability, such as a unified measure of attribute importance that explains the
ensemble trees [17] [3]. prediction of an instance X from the contribution of an
A bibliographic and practical survey (development) on the attribute. The contribution of this attribute is calculated from
main existing XAI measures was performed through this re- the game theory of Shapley Value [28]. In this tool, it is
search, specifically aimed at generating model-agnostic global possible to calculate the average of the Shapley Value values
explainability ranks that support tabular data. As a result, a by attributes to obtain the global importance [13] [29].
total of six tools were selected, with updated and compatible li- Skater is a set of tools capable of generating rankings
braries with current Python development platforms and Scikit- of the importance of model attributes, based on Information
Learning algorithms [19]. These tools are: CIU [9], Dalex [20], Theory [30], through measurements of entropy in changing
Eli5 [11], Lofo [12], SHAP [13] and Skater [14]. predictions, through a perturbation of a certain attribute. The
Note that all the six measures referenced herein generate central idea is that the more a model is dependable upon an
explanatory ranks based on the same machine learning models attribute, the greater the change in predictions [14].
previously trained, manipulate their inputs or/and produce new A general comparison of the main characteristics of the
intermediate models (copies). Therefore, a comparison of the measures that meet the requirements of this research is pre-
generated explanatory ranks is fair and feasible. sented in table I.
Due to the incompatibility between the current XAI mea- All the tools described above are capable of generating
surement libraries and the versions of the scikit-lean [19] explainabilities that go beyond the generation of attribute
computational model libraries and dependencies, only the importance ranks; therefore, the focus of this article is to
Random Forest and Gradient Boosting algorithms were used compare this most basic unit of model explainability, i.e., the
in this study. global importance rank.
During the initial analyzes and executions, this research
With the explanatory ranks generated by the different mea-
found XAI measures with incompatibilities at the level of
sures, it is fair to compare existing correlations between these
libraries and dependencies, which made it impossible to create
ranks, even if each measure generates them based on different
uniform tests with the other measures. For this reason, some
algorithms, since the central objective of each measure is to
XAI means were not used for this research (for example:
explain a black box model.
Alibe-ALE [21], Lime [22], Ethical ExplainableAI [23], IBM
Explainable AI 360 [24], and Interpreter ML [25]).
B. Dataset Properties
Contextual Importance and Utility - CIU is a XAI measure
based on Decision Theory [26] that focuses on serving as a After being processed through attribute engineering, a
unified measure of model-agnostic explainability based on the dataset is expected to exhibit characteristics that involve the
implementation of two different indexes, namely: Contextual nature of the problem in a contextual (problem properties) and
Importance - CI and Contextual Utility - CU, which generally technical (dataset properties) manner and which are general-
TABLE I
M AIN XAI MEASURES SURVEYED

Global
XAI Local API
Autor Algorithm basis explanation
meansure explanation compatible
(by rank)
CIU [9] Decision Theory Yes No Yes
Dalex [20] Leave-one covariate out Yes Yes Yes
Eli5 [11] Assigning weights to decisions Yes Yes Yes
Lofo [12] Leave One Feature Out Yes No Yes
SHAP/Tree [13] Game Theory Yes Yes Yes
Skater [14] Information Theory Yes Yes Yes

ized by machine learning algorithms to carry out a prediction research seeks to fill in this gap in the framework of XAI
or a classification process [8]. measure studies.
Although the contextual side of the problem significantly
helps in interpreting the explainabilities of the models, the III. M ATERIALS AND M ETHODS
herein researchers opted not to work with these characteristics
of the analyzed data, since datasets utilized refer to differ- This research developed a benchmark, Figure 1, which uses
ent problems with contexts that often go beyond computing tabular data already known and endorsed by the machine
knowledge. learning community (Figure 1 - 1), extracted properties from
these datasets (Figure 1 - 2), clustered these datasets according
On the other hand, there are properties of datasets which are
to their properties (Figure 1 - 3), performed the construction
feasible to be analyzed, since they are common — even though
of computational models (Figure 1 - 4 and 5), generated
these datasets represent different problems. Properties such
explainability ranks for all models (Figure 1 - 6 and 7),
as dimensionality, number of numerical attributes, number of
calculated the correlations obtained from the all rank pairs and
binary attributes, the balance between classes, and entropy are
analyzed the different results (Figure 1 - 8 and 9). All these
examples of properties that every tabular dataset has and which
procedures are explained in detail in the following topics.
directly influence the generalizability of the proposed model
[31].
A. Datasets and Preprocess
By the time this article was published, research that con-
fronted the inherent characteristics of different datasets (com- A total of 41 datasets of binary problems were used, with no
plexities) and their explanations through machine learning data loss, and selected from the OpenML [32] platform. The
models was not identified in the literature. In this sense, this most used datasets were selected, so as to make the results

Fig. 1. Visual scheme of all steps and processes performed by the proposed benchmark.
of this research even better for use by the machine learning
community.
The datasets used were as follows: australian, phishing
websites, spec, satellite, analcatdata lawsuit, banknote authen-
tication, blood transfusion service center, churn, climate model
simulation crashes, credit-g, delta ailerons, diabetes, eeg-
eye-state, haberman, heart-statlog, ilpd, ionosphere, jEdit-4.0-
4.2, kc1, kc2, kc3, kr-vs-kp, mc1, monks-problems-1, monks-
problems-2, monks-problems-3, mozilla4, mw1, ozone-level-
8hr, pc1, pc2, pc3, pc4, phoneme, prnn crabs, qsar-biodeg,
sonar, spambase, steel-plates-fault, tic-tac-toe and wdbc.
All datasets went through the following pre-processing
steps, as required: converting categorical attributes to
frequency-based ordinals, converting boolean attributes to in-
teger (0 and 1), and min-max normalization to values between
0 and 1 [33].
B. Clustering and Multiple Correspondence Analysis
Analyzing different datasets based on their properties allows
for comparing similarities and differences between them, even
if these datasets are from different contexts. For example, we
can group a set of datasets based on their class entropy values
and then identify which datasets have more and less informa-
tion, thus enabling a dataset complexity impact relationship
for future model explanations.
Based on the proposal laid out above, this research used
Fig. 2. Silhouette coefficients for clustering, using the Kmeans algorithm, for
15 different properties (provided by OpenML) extracted from a = 3. Distance means (x axis) and label of clusters 0, 1 and 2 (axis y).
each of the 41 selected datasets and then used the clustering
algorithm k-means [34] to identify groups of datasets based
on their similarities. categories of nominal variables (p) — here are the properties
The interpretation and validation algorithm between data of each dataset.
clusters, called Silhouettes [35], was used, the K value (clus- This research performed the process of binarization in the
ters) of which ranging between 2 and 20, and K = 3 with same table in which clustering was applied, replaced by h
Average Silhouette Scores = 0.28 being identified, Figure 2. (equal to or above the mean of attribute values) and s (below
In figure 2, it can be seen that for a k=3 the distances the mean of attribute values). For more information about the
between each cluster (0, 1 and 2) are above the average (red new properties dataset, refer to: https://github.com/josesousa
line), so this is an adequate value of k in relation to the study ribeiro/XAI-Benchmark/blob/main/Openml/df properties bin
presented hereby. arized.csv.
The following 15 properties being used included: Number This analysis allows for identifying different relationships
of Features, Number of Instances, Dimensionality, Percentage between dataset (or even dataset group) and properties (or even
of Binary Features, Standard Deviation Nominal of Attribute property value ranges), as advocated by the literature [36].
Distinct Values, Mean Nominal Attribute Distinct Values, Class
Entropy, Autocorrelation, Number of Numeric Features, Num- C. Construction of Models
ber of Symbolic Features, Number of Binary Features, Percent- The algorithms used in building each dataset model were
age of Symbolic Features, Percentage of Numeric Features, Random Forest - RF and Gradient Boosting - GB, both being
Majority Class Percentage, and Minority Class Percentage. tree-based, low-explainability ensembles [3].
Further information about the values of each property for This research tested newer ensemble algorithms such as
each dataset can be found in: https://github.com/josesousaribe Light Gradient Boost [37], CatBoost [38], and Extreme Gra-
iro/XAI-Benchmark/blob/main/Openml/df dataset properties. dient Boosting [39]; however, not all XAI metrics that were
csv. used supported the output encodings of these new models.
In order to identify the different correlations between each Thus, an option was made to adhere to the Gradient Boosting
dataset cluster identified with the 15 properties listed above, and Random Forest algorithms, only.
this research used the Multiple Correspondence Analysis - Notably, the Tuning step was performed through Grid
MCA [36] in the properties dataset, where the rows of this Search [40] with cross-validation [41] of f old = 3, thus that
table are the observations or individuals (n) concerned — the algorithms had a better calibration on the analyzed data,
here are the datasets — and the columns are the different and a total of 192 candidates through the parameters: max
depth: (1, 10), bootstrap: (True, False), n estimators: (100, E. Benchmark of XAI measures
200), min samples leaf: (1, 10), min samples split:(2,10), max All the methodology described so far was consolidated and
features: (sqrt, log2), and loss: (deviance, exponential) (the implemented in a single benchmark. More details about the
latter only valid for the GB algorithm). implementation and execution of the procedures presented
Finally, each model was trained and tested respectively hereby are available in the repository: — https://github.com/j
based on a 70% − 30% ration for each dataset. Then, each osesousaribeiro/XAI-Benchmark.
model was measured as to their performance and stability
regarding accuracy and precision, recall and Friedman test IV. R ESULTS AND D ISCUSSION
[42], so as to characterize each model that was created. The results obtained were computed in groups (2 different
computational models) and clusters (3 different dataset clus-
D. XAI Ranks and Correlations ters).
The principle for understanding the results of this article is
For each of the 41 datasets (divided into 3 different clusters), to acknowledge the different dataset cluster profiles identified
2 models were created, one based on the Random Forest al- in the clustering process and MCA.
gorithm and the other on Gradient Boosting, thus generating a In this sense, according to clustering, based on the 15
total of 82 models. Global explainability ranks were produced different properties of the 41 datasets that were analyzed, it is
through the 6 different XAI measures for each model, resulting possible to assess the existence of at least 3 different dataset
in a total of 492 ranks. clusters. In quantitative terms, each cluster has 21 (cluster 0),
The Spearman Rank Correlation [43] was used to calculate 17 (cluster 1), and 3 (cluster 2) datasets.
the correlations between the ranks that were created. The To identify each different dataset profile, an inspection of
reason for using this algorithm in particular, is that it measures the value ranges of each cluster for each property was carried
the correlation between rank pairs considering the idea of out. This analysis verified the possible existence of specific
ranks (positions) in which different values (in this case, dataset property values for datasets of specific clusters.
attributes) may appear. In this step, two comparison matrices At this stage, it was identified that the quantity of datasets
of rank correlation pairs are generated for each dataset (one in cluster 2 had a small n and, therefore, it was disregarded
matrix for each algorithm). Figure 3 shows an example of this from these results (the datasets for this cluster are: kr-vs-kp,
matrix created from the Gradient Boosting - GB model. Phishing Web sites, and SPECT). Keeping only clusters 0 and
1.
Intending to consolidate the profile analysis of the clusters,
Multiple Correspondence Analysis - MCA was applied, and
the relation between the datasets and the value ranges of their
properties was verified, thus providing for a better understand-
ing of the complexity of the datasets existing in both clusters
— this practice is already known in the literature [36].
By inspecting the MCA shown in Figure 4, the relation
(shorter distances) between high quantity of datasets in cluster
0 and above-average values (symbol h) for properties (big-
ger circles) such as MajorityClassPercentage, Dimensionality,
NumberOfInstances, NumberOfFeatures, PercentageOfNumer-
icFeatures, and AutoCorrelation was identified. Also, a rela-
tion between a high quantity of datasets of cluster 0 and below-
average values (symbol s) for properties such as ClassEntropy,
PercentageOfSymbolicFeatures, and PercentageOfBinaryFea-
tures was observed.
Fig. 3. Example Spearman correlation matrix (represented in value and color Also with regard to the MCA shown in Figure 4, by inspec-
in each cell) based on explicability ranks generated from Gradient Boosting tion, assessing the existence of a relation (shorter distances)
model to the dataset wdbc. Note that, for this dataset, chosen at random, the
correlations between the rank pairs are low or non-existent (close to zero).
between a high quantity of datasets in cluster 1 and above-
average values (symbol h) is possible for properties (big-
ger circles) such as ClassEntropy, PercentageOfSymbolocFea-
Correlations such as those shown in Figure 3 are calculated tures, PercentageOfBinaryFeatures, NumberOfBinaryFeatures,
for each dataset and show how the correlations between and MinorityClassPercentage. It is also possible to identify
different pairs of XAI measures are presented for each problem a relation between a high quantity of datasets in cluster
and for each computational model being developed. 1 and below-average values (symbol s) for properties sush
The results shown in Figure 3 refer to only one dataset as AutoCorrelation,NumberOfFeatures, PercentageOfNumer-
and, therefore, are intermediate results, since they will be icFeatures, NumberOfNumericFeatures, NumberOfInstances,
summarized in the results of the other datasets. Dimensionality, and MajorityClassPercentage.
Fig. 4. Multiple Correspondence Analysis - MCA with rows (datasets) and columns (properties) depending on the first and second PCA components.

It is worth noting that some properties were not mentioned Based on the profile characterization of the two main clus-
above because they appear at very close distances for the two ters that were identified, this research executed the Benchmark
dataset clusters. for these two sets of datasets separately.
However, based on the inspection of MCA results, it can be The benchmark execution for the datasets belonging to
concluded that cluster 0 is formed by a significant amount of cluster 0, Figure 5, exhibits low or negligible correlations
datasets with greater complexities, whereas cluster 1 is formed found in most tests, thus showing that the six XAI measures
by a significant amount of datasets with minor complexities. used hereby generate different explainability ranks for most of
The entire cluster profile identification process can be found the datasets in this group, regardless of the algorithm (RF or
in: https://github.com/josesousaribeiro/XAI-Benchmark/blob GB) the computational model is based upon.
/main/Cluster Profile.md. The results presented by cluster 0, Figure 5, show that
Fig. 5. Boxplot graphs that summarize all rank pair comparisons (x axis) and their respective Spearman correlations (y axis) calculated for the datasets of
cluster 0. Results of the model based on Random Forest (left) and Gradient Boosting (right). The dashed blue lines show the different levels of correlations.
The points identify the positions of the correlation values of each comparison, and the color of the points refer to the attribute quantities of each dataset. Note
the low variance of most boxplots along with levels of negligent or low correlations found.

Fig. 6. Boxplot graphs that summarize all rank pair comparisons (x axis) and their respective Spearman correlations (y axis) calculated for the datasets of
cluster 1. Results of the model based on Random Forest (left) and Gradient Boosting (right). The dashed blue lines show the different levels of correlations.
The points identify the positions of the correlation values of each comparison, and the color of the points refer to the attribute quantities of each dataset. Note
the high variances of most boxplots along with the high levels of correlations found.
datasets with greater complexity generate more complex mod- other (as seen in Figure 5).
els, and in turn, different explanations. This could justify the Based on the logical reasoning 1 and 2 above, there was
low correlations found in the results. identified existence of evidence about Ensemble-Tree Models
The benchmark execution for the datasets of cluster 1, with Complex Explainabilities - EMCX (created from the
Figure 6, shows different correlations from those shown in data of cluster 0) and Ensemble-Tree Models with Simple
Figure 5, since there is high variance in boxplots as well as Explainabilities - EMSX (created from the data of cluster 1),
high positive and negative correlations between rank pairs in both terms proposed hereby. Furthermore, it was identified
cluster 1. that the divergence between the ranks is inherent, primarily
The results presented in Figure 5 and 6 provide evidence from the complexity of the dataset and, consequently, from the
on how to answer the first hypothesis launched herein: “— generated model — thus answering the article’s title question.
Considering the current measures aimed at explaining black The datasets depicted below, which served as the basis for
box machine learning models, can it be inferred that they the creation of EMSX and EMCX, are intended to comply
generate global rankings of same, similar, or different ex- with one of the contributions hereto:
plainabilities?”. It depends, since the different correlations
• EMSX Datasets: Ozone level 8hr, Sonar, Spambase, Qsar
found demonstrate that the properties of the datasets directly
biodeg, Kc3, Mc1, Pc3, Mw1, Pc4, Satellite, Pc2, Steel
influence the explainabilities of the models created from them.
plates fault, Kc2, Pc1, Kc1, Climate model simulation
Thus, it is possible to make the tools generate explainabilities
crashes, and Analcatdata lawsuit;
with higher correlations with one another (more similar ex-
• EMCX Datasets: Ionosphere, Wdbc, Credit-g, Churn,
plainabilities) and lower correlations with one another (less
Australian, Eeg eye state, Heart statlog, Ilpd, Tic tac toe,
similar explainabilities).
JEdit 4.0 4.2, Diabetes, Prnn crabs, Monks problems 1,
Noticeably, despite the similar results of the models based
Monks-problems 3, Monks problems 2, Delta ailerons,
on the RF and GB algorithms, there are important differences
Mozilla4, Phoneme, Blood transfusion service center,
in the correlations found, as shown in Figure 5. For example,
Banknote authentication, and Haberman;
the larger variance of some comparisons in the results of the
GB algorithm (shap vs lofo, eli5 vs ci, eli5 vs skater, ci vs It is clear that evidence of the existence of EMSX and
skater and shap vs ci). This proves that the complexity of EMCX was observed from experiments with ensemble algo-
the algorithm used in the model (in this case, bagging and rithms. For these terms to become generally applicable (other
boosting tree-based) does have an influence on explainability algorithms), experiments with algorithms of other natures are
generation. necessary.
Importantly, the outliers existing in the boxplots of Figures
5 and 6, show the possibility of other dataset clusters existing V. F INAL C ONSIDERATIONS
amongst those being analyzed. The colors, used to distinguish
the amount of attributes from different datasets in each boxplot Given all the analyzes that were carried out, this research
show that the results obtained in the correlations are not a achieves its objective of being able to observe the impacts of
function of the number of attributes that a dataset has. dataset complexity in explaining black box machine learning
The comparison of the results presented in Figures 5 and models, through the different properties of the data under anal-
6 demonstrate the answer to the second hypothesis of this ysis. That allowed for finding that the properties of a dataset
research, namely: “— According to the same idea as in the — facing the creation of models based on Random Forest and
previous question, are the generations of equal, similar, or Gradiant Boosting algorithms — show evidence that EMSX
different explainabilities related to specific properties of a and EMCX do exist. However, many other properties of the
dataset?”. Yes, since the results of the rank correlations for data and of other algorithms still need to be further explored.
the two dataset clusters presented in the benchmark output
were different, thus showing that the complexity properties of R EFERENCES
the dataset interfere with explainability ranks.
[1] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning:
Based on the experiments performed with the 6 different From theory to algorithms. Cambridge university press, 2014.
XAI measures and the resulted obtained therefrom, the follow- [2] Z. Ghahramani, “Probabilistic machine learning and artificial intelli-
ing logical reasoning could be constructed: 1 - If a ensemble gence,” Nature, vol. 521, no. 7553, pp. 452–459, 2015.
model (algorithm and dataset) solves a classification problem [3] A. Barredo Arrieta, N. Dı́az-Rodrı́guez, J. Del Ser, A. Bennetot,
S. Tabik, A. Barbado, S. Garcia, S. Gil-Lopez, D. Molina, R. Benjamins,
for a low-complexity dataset, there must be few ranks (or even R. Chatila, and F. Herrera, “Explainable Artificial Intelligence (XAI):
only one rank) of explainabilities referring to this model, thus Concepts, taxonomies, opportunities and challenges toward responsible
allowing to infer that the different XAI measures have higher AI,” Information Fusion, vol. 58, pp. 82–115, Jun. 2020.
[4] R. Nambiar, R. Bhardwaj, A. Sethi, and R. Vargheese, “A look at
correlations with each other (as seen in Figure 6); 2 - If an challenges and opportunities of big data analytics in healthcare,” in 2013
ensemble model (algorithm and dataset) solves a classification IEEE International Conference on Big Data, 2013, pp. 17–22.
problem for a high-complexity dataset, there must be many [5] S. Lee, L. Chen, S. Duan, S. Chinthavali, M. Shankar, and B. A. Prakash,
“Urban-net: A network-based infrastructure monitoring and analysis
explainability ranks for this model, leading the ranks of the system for emergency management and public safety,” in 2016 IEEE
different XAI measures to show lower correlations with each International Conference on Big Data (Big Data), 2016, pp. 2600–2609.
[6] S. M. Lundberg, G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, and Fairness in Python,” arXiv:2012.14406, 2020. [Online]. Available:
B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and S.-I. Lee, “From local https://arxiv.org/abs/2012.14406
explanations to global understanding with explainable AI for trees,” [28] A. E. Roth, The Shapley value: essays in honor of Lloyd S. Shapley.
Nature Machine Intelligence, vol. 2, no. 1, pp. 56–67, Jan. 2020, Cambridge University Press, 1988.
number: 1 Publisher: Nature Publishing Group. [Online]. Available: [29] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model
https://www.nature.com/articles/s42256-019-0138-9 predictions,” in Proceedings of the 31st international conference on
[7] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and neural information processing systems, 2017, pp. 4768–4777.
D. Pedreschi, “A survey of methods for explaining black box models,” [30] F. M. Reza, An Introduction to Information Theory. Courier Corpora-
ACM computing surveys (CSUR), vol. 51, no. 5, pp. 1–42, 2018. tion, Jan. 1994, google-Books-ID: RtzpRAiX6OgC.
[8] D. Gunning and D. Aha, “DARPA’s Explainable Artificial Intelligence [31] D. Oreski, S. Oreski, and B. Klicek, “Effects of dataset characteristics
(XAI) Program,” AI Magazine, vol. 40, no. 2, pp. 44–58, Jun. 2019, on the performance of feature selection techniques,” Applied Soft Com-
number: 2. [Online]. Available: https://ojs.aaai.org/index.php/aimagazin puting, vol. 52, pp. 109–119, 2017.
e/article/view/2850 [32] OpenML, https://www.openml.org/search?q=qualities.NumberOfClass
[9] K. Främling, “Decision theory meets explainable ai,” in International es%3A2%2520qualities.NumberOfMissingValues%3A0&type=data&s
Workshop on Explainable, Transparent Autonomous Agents and Multi- ort=runs&order=desc, Last accessed 01 Mar 2021.
Agent Systems. Springer, 2020, pp. 57–74. [33] “MinMaxScaler,” https://scikit-learn.org/stable/modules/generated/
[10] P. B. a. T. Burzykowski, Explanatory Model Analysis. [Online]. sklearn.preprocessing.MinMaxScaler.html, Last accessed 15 Nov 2020.
Available: https://ema.drwhy.ai/ [34] “KMeans,” https://scikit-learn.org/stable/modules/generated/
[11] M. Korobov and K. Lopuhin, “Eli5,” sklearn.cluster.KMeans.html, Last accessed 2 Mar 2021.
https://eli5.readthedocs.io/en/latest/index.html, Last accessed 21 Jan [35] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and
2021. validation of cluster analysis,” Journal of computational and applied
[12] U. Çayır, I. Yenidoğan, and H. Dağ, “Use case study: Data science mathematics, vol. 20, pp. 53–65, 1987.
application for microsoft malware prediction competition on kaggle,” [36] H. Abdi and D. Valentin, “Multiple correspondence analysis,” Encyclo-
Proceedings Book, p. 98, 2019. pedia of measurement and statistics, vol. 2, no. 4, pp. 651–657, 2007.
[13] S. M. Lundberg, G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, [37] Microsoft, “LightGBM,” https://lightgbm.readthedocs.io/en/latest/, Last
B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and S.-I. Lee, “From accessed 2 Mar 2021.
local explanations to global understanding with explainable ai for trees,” [38] Yandex, “CatBoost,” https://catboost.ai, Last accessed 5 Mar 2021.
Nature Machine Intelligence, vol. 2, no. 1, pp. 2522–5839, 2020. [39] “Xgboost,” Last accessed 5 Mar 2021. [Online]. Available:
[14] “Skater,” https://oracle.github.io/Skater/overview.html#skater, Last https://github.com/dmlc/xgboost
accessed 21 Jan 2021. [40] “Grid-search-CV,” https://scikit-learn.org/stable/modules/generated/skle
[15] Z. Qi, S. Khorram, and F. Li, “Visualizing deep networks by optimizing arn.model selection.GridSearchCV.html, Last accessed 10 Mar 2021.
with integrated gradients.” in CVPR Workshops, vol. 2, 2019. [41] “Cross Validation,” https://scikit-learn.org/stable/modules/cross validat
[16] P.-J. Kindermans, K. T. Schütt, M. Alber, K.-R. Müller, D. Erhan, ion.html, Last accessed 01 Mar 2021.
B. Kim, and S. Dähne, “Learning how to explain neural networks: [42] Friedman Test, https://scikit- posthocs.readthedocs.io/en/latest/gen
Patternnet and patternattribution,” in International Conference on erated/scikit posthocs.posthoc nemenyi friedman/, 2020, acessado:
Learning Representations, 2018. [Online]. Available: https://openrevi 02/07/2020.
ew.net/forum?id=Hkn7CBaTW [43] R. Artusi, P. Verderio, and E. Marubini, “Bravais-Pearson and Spearman
[17] P. Linardatos, V. Papastefanopoulos, and S. Kotsiantis, “Explainable Correlation Coefficients: Meaning, Test of Hypothesis and Confidence
ai: A review of machine learning interpretability methods,” Entropy, Interval,” The International Journal of Biological Markers, vol. 17,
vol. 23, no. 1, 2021. [Online]. Available: https://www.mdpi.com/1099- no. 2, pp. 148–151, Apr. 2002, publisher: SAGE Publications Ltd STM.
4300/23/1/18 [Online]. Available: https://doi.org/10.1177/172460080201700213
[18] “Ethical ML,” https://github.com/EthicalML/awesome-production-
machine-learning#explaining-black-box-models-and-datasets, Last
accessed 20 Apr 2021.
[19] “Scikit-learn,” https://scikit-learn.org/0.22/, Last accessed 21 Nov 2020.
[20] P. Biecek, “Dalex: Explainers for complex predictive models in r,”
Journal of Machine Learning Research, vol. 19, no. 84, pp. 1–5, 2018.
[Online]. Available: https://jmlr.org/papers/v19/18-416.html
[21] D. W. Apley and J. Zhu, “Visualizing the effects of predictor variables in
black box supervised learning models,” Journal of the Royal Statistical
Society: Series B (Statistical Methodology), vol. 82, no. 4, pp. 1059–
1086, 2020.
[22] M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should I trust you?”:
Explaining the predictions of any classifier,” in Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, San Francisco, CA, USA, August 13-17, 2016, 2016, pp.
1135–1144.
[23] T. I. for Ethical AI and M. Learning, “Ethical XAI,” 2021, Last
accessed 22 Jul 2021. [Online]. Available:
https://ethical.institute/xai.html
[24] V. Arya, R. K. Bellamy, P.-Y. Chen, A. Dhurandhar, M. Hind, S. C.
Hoffman, S. Houde, Q. V. Liao, R. Luss, A. Mojsilovic et al., “Ai
explainability 360: An extensible toolkit for understanding data and
machine learning models.” J. Mach. Learn. Res., vol. 21, no. 130, pp.
1–6, 2020.
[25] H. Nori, S. Jenkins, P. Koch, and R. Caruana, “Interpretml: A uni-
fied framework for machine learning interpretability,” arXiv preprint
arXiv:1909.09223, 2019.
[26] R. L. Keeney, K. R. L, and R. Howard, Decisions with Multiple
Objectives: Preferences and Value Trade-Offs, revised ed. edição ed.
Cambridge England ; New York, NY, USA: Cambridge University Press,
Aug. 1993.
[27] H. Baniecki, W. Kretowicz, P. Piatyszek, J. Wisniewski, and P. Biecek,
“dalex: Responsible Machine Learning with Interactive Explainability

You might also like