Professional Documents
Culture Documents
Tools
Yoseph Hailemariam∗ , Abbas Yazdinejad† , Reza M. Parizi∗ , Gautam Srivastava‡ , Ali Dehghantanha†
∗ College of Computing and Software Engineering, Kennesaw State University, GA, USA
yhailema@students.kennesaw.edu, rparizi1@kennesaw.edu
† Cyber Science Lab, School of Computer Science, University of Guelph, Ontario, Canada
2020 IEEE Globecom Workshops (GC Wkshps) | 978-1-7281-7307-8/20/$31.00 ©2020 IEEE | DOI: 10.1109/GCWkshps50303.2020.9367541
abbas@cybersciencelab.org, adehghan@uoguelph.ca
‡ Department of Mathematics and Computer Science, Brandon University, Manitoba, Canada
srivastavag@brandonu.ca
Abstract—Success in machine learning has led to a wealth of cerns around the lack of transparency in the underlying
Artificial Intelligence (AI) systems. A great deal of attention is decision-making process. Some application domains even are
currently being set on the development of advanced Machine reluctant to use AI systems. The reason being the degree
Learning (ML)-based solutions for a variety of automated pre-
dictions and classification tasks in a wide array of industries. of criticality involved in security-sensitive domains where
However, such automated applications may introduce bias in prediction should be accurate; otherwise, it would cost a
results, making it risky to use these ML models in security- fatal situation. Security sensitive domains are health-related
and privacy-sensitive domains. The prediction should be accurate domains, monetary domains, mission-critical infrastructures,
and models have to be interpretable/explainable to understand etc. The majority of the previous research focuses on the
how they work. In this research, we conduct an empirical
evaluation of two major explainer/interpretable methods called performance of ML models, when it comes to prediction, not
LIME and SHAP on two datasets using deep learning models, focusing precisely on explaining the rationale for making the
including Artificial Neural Network (ANN) and Convolutional prediction. Lately, the community has seen the emergence of a
Neural Network (CNN). The results demonstrated that SHAP new field, commonly referred to as Explainable AI (XAI) [9]
performs slightly better than LIME in terms of Identity, Stability, that provides apparatuses and systems to assist developers and
and Separability from two different datasets (Breast Cancer
Wisconsin (Diagnostic) and NIH Chest X-Ray) that we used. researchers with creating interpretable and comprehensive ML
models and deploy them with confidence. As a result, there are
Index Terms—AI, Explainable AI, Interpretable Techniques, interpretability techniques that would help interpret models to
ANN, CNN, XAI better understand how everything works. Even though there
is a trade-off between performance and interpretability where
I. I NTRODUCTION the more the model is interpretable such as linear models and
In recent years, Artificial intelligence (AI) [1] has been decision trees, the lower the performance would be [10]. It is
considered one of the world-transforming research fields. AI still important that a complex model should be interpretable in
is being utilized in different application domains such as a security-sensitive domain. Recently, the European parliament
data security [2], financial trading, advertising, marketing, in May 2018 constitutes a law where it is mandatory for
healthcare, and blockchain [3]–[5]. AI helps solve very com- companies to ‘explain’ any decision taken using machine
plex problems where it would not have been possible using learning and deep learning [11]: ”a right of explanation for
traditional algorithmic or other methods. Machine learning all individuals to obtain meaningful explanations of the logic
(ML) [6], and particularly deep learning, as the enabling arm involved”.
of AI, has gained lots of traction due to its capability in As explained by Doshi-Velez et al. [12], in the context of
building intelligent systems in recent years [7], [8]. ML systems, interpretability is defined as the ability to explain
Designing and building an AI system requires four steps: or to present results in understandable terms to a human. There
Identifying the problem, preparing data, choosing an algo- is still a question regarding what constitutes a good explana-
rithm, and training the algorithm using the data. About 80 tion, where one type of explanation is good for one audience or
percent of data scientists’ time requires cleaning, moving, bad for the other. In this research, we focus on interpretability
checking, organizing data before even actually using or writing when it comes to safety, which discusses robustness and
a single algorithm. So, it is important that the data collected security. More specifically, we conduct an empirical evaluation
have to be properly prepared. Once we have identified the of two model-agnostic explainers/interpretable methods called
problem and prepared the data, we can select the algorithm. SHAP1 and LIME2 (they are called model-agnostic because
Knowing the distinction of the algorithms is important to have they can explain any machine learning models). We evaluate
a good prediction.
Despite great use and demand for the AI systems and 1 https://github.com/slundberg/shap
Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 08:34:41 UTC from IEEE Xplore. Restrictions apply.
Fig. 1. Experiment procedure for Diagnostic dataset with ANN
Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 08:34:41 UTC from IEEE Xplore. Restrictions apply.
virtually certain that the computation time is one of the most our categorical variables to numerical values, and finally, we
critical issues in the Shapely amount method. Particularly, applied feature scaling to normalize the range of independent
during the precise computation of the Shapley amount, it variables or features of data. The information that flows
should consider all feasible sets of specifications, although through the network affects the ANN structure because a
there is no specification of interest. The precise amount of neural network changes - or learns, in a sense, based on that
calculation gets hectic to compute when there are many input and output. Thus, we constructed three layers: one entry
specifications since the number of sets will grow exponentially layer, one hidden layer, and an output layer. The first two
with the number of specifications. One of the viable ways to layers used the ReLU activation function, and the last uses
tackle this issue is to apply sampling techniques with a fixed the Sigmoid function. Fig. 2 shows the experimental design
number of the sample. of NIH Chest Xray and CNN. For the second experiment, we
used the NIH chest Xray dataset. Before we use the dataset,
D. Measures
we performed image preprocessing. We image resizing for
The properties to measure used in this paper was inspired by the image preprocessing, changing the color to grayscale, and
the Explainability Fact sheets [18]. Specifically, we focused on generating new images. This dataset’s deep model is CNN
safety (security) property to systematically assess the explain- since CNN is known to work well with image datasets. Fig.
able system. Sokol et al. [18] proposed that when assessing 3 shows a sample of the actual explanation’s output when we
an explainable system for safety, it is recommended to focus executed the experimental design for LIME with the ANN
on four criteria: Information Leakage, Explanation Misuse, model and Diagnostic dataset, and Fig. 4 shows the same
Explanation invariance, and Explanation Quality. The first dataset being explained in SHAP when it is executed.
two criteria consider how much information an explanation As can be seen in these figures, the amount of output
reveals about the underlying model and its training data. If value in diagnosis detection at ANN, Diagnostic dataset is
the explanation tells sensitive information about the model, 1 and 0.9902 for SHAP and LIME execution explanation
hackers can use it to exploit it. Explanation invariance con- respectively that SHAP has better performance in comparison
cerns with measuring the explanation similarity with the same with LIME since LIME has 0.0428 loss during diagnosis
dataset and different models but different explainers. Lastly, detection.
Explanation quality concerns with evaluating the quality and Similarly, Fig. 5 shows a sample of the actual explanation’s
correctness of an explanation underlying models and dataset. output when we executed the experimental design for LIME
The explainer should only explain the underlying model oth- with the CNN model and NIH dataset, and Fig. 6 shows the
erwise, it would be ambiguous. Diogo et al. [26] proposed that same dataset being explained in SHAP when it is executed.
explanation quality can be measured by correctness. The goal As can be seen in the figures, the SHAP is so exact and
of correctness is to show trust and explain what the model precise to interpret and explain methods on the Xray dataset,
is doing, without being ambiguous. Explanation invariance whereas LIME is vague and hard to matches human intuition
is measured using separability, stability, and identity. The for NIH chest Xray dataset.
goal of identity is to show identical instances should have an
identical explanation. If an interpretable technique is run the IV. R ESULTS AND D ISCUSSION
same instance several times, it is expected to show the same After the experiment execution specified in the previous
explanation otherwise, this shows how unreliable the explainer section, we collected all the results, which are presented in
is. Stability is somewhat similar to identity, but it also accounts this section.
for a similar instance (instance with a small difference) should Table I shows the experimental results on the Diagnostic
have similar explanations. And lastly, Separability concerns dataset. The number in the table represents the percentage
making sure different instances have a different explanation. of instances that satisfy the defined security metrics. Table II
It cannot have identical explanations. shows the experimental results on the NIH chest Xray dataset.
E. Experimental Design and Execution
TABLE I
We carried out the experiment using online machine learn- E VALUATION OF I NTERPRETABILITY METHODS ON D IAGNOSTIC DATASET
ing running services on Kaggle and Google Colab. We used AND ANN
Kaggle on the image dataset and CNN. And we used Google Metrics LIME SHAP
Colab for the tabular dataset and ANN. The code for the whole Identity 100% 100%
experimental package for both datasets are hosted in these Stability 94.7% 100%
Separability 100% 100%
GitHub pages3 ,4 .
Fig. 1 shows the experimental design for the first dataset
In order to measure explanation invariance, as provided in
and the ANN model. In the preparation of the Diagnostic
the measure section, it is through three metrics: Identity, Sep-
dataset, we made sure all unnecessary features and raw data
arability, and Stability. In Table I, both LIME and SHAP did
were removed. Then, we applied one-hot encoding to convert
well on the majority of the metrics on the Diagnostic dataset
3 https://github.com/yosepppph/BreastCancerANN where the only difference is in Stability. SHAP performed
4 https://github.com/yosepppph/NIHXrayCNN 100% and LIME slightly performed lower with 94.7% (it only
Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 08:34:41 UTC from IEEE Xplore. Restrictions apply.
Fig. 3. LIME Execution Explanation (ANN, Breast Cancer Dataset)
TABLE II A. Limitations
E VALUATION OF I NTERPRETABILITY METHODS ON NIH C HEST XRAY
DATASET AND CNN For this experiment, we couldn’t find meaningful metrics to
use to estimate the last two criteria Information Leakage and
Metrics LIME SHAP
Identity 51% 100%
Explanation Misuse since they both are subjective to find a
Stability 0% 0% threshold that constitutes whether the information is sensitive
Separability 100% 100% or not. We believe more research should be performed to
define quantifiable/objective metrics for information leakage
and explanation misuse. Besides, we weren’t also able to
got wrong 3 samples out of 57 samples). Based on Table I, measure explanation correctness for the image dataset because
we can conclude that SHAP performs better than LIME. In it needs expert validation to assert if an explanation is valid
Table II, both LIME and SHAP didn’t perform as expected or not. This was one of the challenges that we faced during
on stability metrics. One possible reason is since the image’s the conduct of this research.
pixels account for the number of features, having that amount
V. C ONCLUSIONS AND F UTURE W ORK
of pixels makes it harder to group similar samples. The slight
difference between the two explainers was on identity where With the popularity of ML applications and incredibly deep
SHAP performed 100 and LIME performed 54. Just based on learning, many companies and organizations have created
that factor, SHAP performs better than LIME. intelligent systems for many aspects of modern lives. Despite
Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 08:34:41 UTC from IEEE Xplore. Restrictions apply.
[5] D. Połap, G. Srivastava, A. Jolfaei, and R. M. Parizi, “Blockchain tech-
nology and neural networks for the internet of medical things,” in IEEE
INFOCOM 2020 - IEEE Conference on Computer Communications
Workshops (INFOCOM WKSHPS), 2020, pp. 1–6.
[6] P. Louridas and C. Ebert, “Machine learning,” IEEE Software, vol. 33,
no. 5, pp. 110–115, 2016.
[7] M. Saharkhizan, A. Azmoodeh, A. Dehghantanha, K. R. Choo, and
R. M. Parizi, “An ensemble of deep recurrent neural networks for
detecting iot cyber attacks using network traffic,” IEEE Internet of
Things Journal, pp. 1–1, 2020.
[8] A. Yazdinejad, H. HaddadPajouh, A. Dehghantanha, R. M. Parizi,
G. Srivastava, and M.-Y. Chen, “Cryptocurrency malware hunting: A
deep recurrent neural network approach,” Applied Soft Computing,
vol. 96, p. 106630, 2020.
[9] D. Gunning, “Explainable artificial intelligence (xai),” Defense Ad-
vanced Research Projects Agency (DARPA), nd Web, vol. 2, p. 2, 2017.
[10] R. El Shawi, Y. Sherif, M. Al-Mallah, and S. Sakr, “Interpretability in
healthcare a comparative study of local machine learning interpretability
techniques,” in 2019 IEEE 32nd International Symposium on Computer-
Based Medical Systems (CBMS), 2019, pp. 275–280.
[11] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and
D. Pedreschi, “A survey of methods for explaining black box models,”
Fig. 6. SHAP Execution Explanation (CNN, NIH Dataset) ACM Computing Surveys, vol. 51, no. 5, 2018.
[12] F. Doshi-Velez and B. Kim, “Towards a rigorous science of interpretable
machine learning,” arXiv preprint arXiv:1702.08608, 2017.
[13] A. Yazdinejad, R. M. Parizi, A. Dehghantanha, H. Karimipour, G. Sri-
the advantages such ML-based systems bring about, they are vastava, and M. Aledhari, “Enabling drones in the internet of things
subjected to bias and being black-boxed in their internal com- with decentralized blockchain-based security,” IEEE Internet of Things
Journal, pp. 1–1, 2020.
plex working process, which is not widely understood in most [14] A. B. Arrieta], N. Dı́az-Rodrı́guez, J. D. Ser], A. Bennetot, S. Tabik,
industries. Explainable AI (XAI) is a relatively new research A. Barbado, S. Garcia, S. Gil-Lopez, D. Molina, R. Benjamins,
field that works towards making ML models more transparent R. Chatila, and F. Herrera, “Explainable artificial intelligence (xai):
Concepts, taxonomies, opportunities and challenges toward responsible
by creating techniques to enable developers/ adopters to be ai,” Information Fusion, vol. 58, pp. 82 – 115, 2020.
more confident when they deploy them. As a result, there [15] M. T. Ribeiro, S. Singh, and C. Guestrin, ““why should i trust you?”:
have been some tools to put into practice these techniques. Explaining the predictions of any classifier,” in Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and
While there is research on this topic, sufficient progress has not Data Mining, ser. KDD ’16, 2016, p. 1135–1144.
been made regarding understanding such tools’ performance [16] M. Robnik-Šikonja and M. Bohanec, Perturbation-Based Explanations
on a deeper, technical level. This paper experimented with of Prediction Models. Springer International Publishing, 2018, pp.
159–175.
evaluating two major explainers, LIME and SHAP, for deep [17] T. Miller, “Explanation in artificial intelligence: Insights from the social
learning models on image and non-image datasets. The results sciences,” Artificial Intelligence, vol. 267, pp. 1–38, 2019.
from this research suggest that SHAP can perform better than [18] K. Sokol and P. Flach, “Explainability fact sheets: a framework for
systematic assessment of explainable approaches,” in Proceedings of the
LIME in a security-sensitive domain for both tabular and 2020 Conference on Fairness, Accountability, and Transparency, 2020,
image datasets. Future research should consider the potential pp. 56–67.
effects of empirical approaches more thoroughly in XAI. We [19] R. El Shawi, Y. Sherif, M. Al-Mallah, and S. Sakr, “Interpretability in
healthcare a comparative study of local machine learning interpretability
intend to continue our effort to assess the performance of techniques,” in 2019 IEEE 32nd International Symposium on Computer-
various explainable and interpretability frameworks on other Based Medical Systems (CBMS), 2019, pp. 275–280.
medical image datasets like COVID-19. Furthermore, it would [20] A. Adadi and M. Berrada, “Explainable ai for healthcare: From black
box to interpretable models,” in Embedded Systems and Artificial Intel-
be viable to use other local interpretability tools like layer-wise ligence, V. Bhateja, S. C. Satapathy, and H. Satori, Eds. Singapore:
relevance propagation (LRP) or Anchors and then compare Springer Singapore, 2020, pp. 327–337.
such results with our study. [21] C. Panigutti, A. Perotti, and D. Pedreschi, “Doctor xai: an ontology-
based approach to black-box sequential data classification explanations,”
in Proceedings of the 2020 Conference on Fairness, Accountability, and
Transparency, 2020, pp. 629–639.
R EFERENCES [22] “Breast cancer wisconsin (diagnostic) data set.” [Online]. Available:
https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
[1] S. Makridakis, “The forthcoming artificial intelligence (ai) revolution: [23] “Nih chest x-rays data set.” [Online]. Available: https://www.kaggle.
Its impact on society and firms,” Futures, vol. 90, pp. 46 – 60, 2017. com/nih-chest-xrays/data
[2] M. Kantarcioglu and F. Shaon, “Securing big data in the age of ai,” in [24] A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi, “A survey of the
2019 First IEEE International Conference on Trust, Privacy and Security recent architectures of deep convolutional neural networks,” Artificial
in Intelligent Systems and Applications (TPS-ISA), 2019, pp. 218–220. Intelligence Review, pp. 1–62, 2019.
[3] A. Ekramifard, H. Amintoosi, A. H. Seno, A. Dehghantanha, and R. M. [25] D. Butnariu and T. Kroupa, “Shapley mappings and the cumulative
Parizi, A Systematic Literature Review of Integration of Blockchain and value for n-person games with fuzzy coalitions,” European Journal of
Artificial Intelligence. Springer International Publishing, 2020, pp. 147– Operational Research, vol. 186, no. 1, pp. 288–299, 2008.
160. [26] D. V. Carvalho, E. M. Pereira, and J. S. Cardoso, “Machine learning
[4] A. Yazdinejad, G. Srivastava, R. M. Parizi, A. Dehghantanha, K.- interpretability: A survey on methods and metrics,” Electronics, vol. 8,
K. R. Choo, and M. Aledhari, “Decentralized authentication of dis- no. 8, p. 832, 2019.
tributed patients in hospital networks using blockchain,” IEEE Journal
of Biomedical and Health Informatics, 2020.
Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 08:34:41 UTC from IEEE Xplore. Restrictions apply.