A Karim Conf Seminar Presentation

Interpretable Classifications
Confirmation Seminar
Candidate: Abdul Karim
Institute of Integrated and Intelligent

Systems, Griffith University
Supervisors: Abdul Sattar

MA Hakim Newton
Agenda
 Introduction:
• What, Why and How of Interpretability
• Model-agnostic methods for extracting interpretations
• Interpretability levels
 Research Challenges:
• A framework that is highly accurate and capable of answering interpretability questions
 Research Goals:
• Stage 1: Searching for optimum features subset
• Stage 2: Formulating rules to extract statistical interpretability
• Stage 3: Formulating rules to extract causal interventional and counterfactual interpretability
 Preliminary Work: Under Review, ACS omega

• “Toxicity prediction using hybridization of shallow neural networks and decision trees”
 Conclusion:
Low and High Risk Classifications
Application Risk Factor Criticality Potential of Knowledge

Discovery
Movie Recommendation System Low Risk Less Critical Less
Image Classification (Cat and Dog) Low Risk Less Critical Less
Image Classification (Cancer and Non-Cancer) High Risk High Critical Medium
Drug Discovery High Risk High Critical High
 Interpretability is very crucial
3/45
Introduction 4/45
Classification Black-Box Algorithm

Classification (Deep Neural Network)
Toxic Non-Toxic
Drug Discovery Company Toxic/Non-Toxic
Machine Learning Engineer
Chemical Features
Chemical Features
Evaluation Metrics
AUROC: Area Under the Receiver Operating Characteristic curve

𝑇𝑃
𝑻𝒓𝒖𝒆 𝑷𝒐𝒔𝒊𝒕𝒊𝒗𝒆 𝑹𝒂𝒕𝒆 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃+𝐹𝑁 :measures the proportion of actual positives that are correctly identified
𝑇𝑁
𝑭𝒂𝒍𝒔𝒆 𝑷𝒐𝒔𝒊𝒕𝒊𝒗𝒆 𝑹𝒂𝒕𝒆 = 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑡𝑦 = 𝐹𝑃+𝑇𝑁 :measures the proportion of actual negatives that are correctly identified
Introduction
Analysis
 Interpretability is the degree to which human can understand
the cause of a machine learning decision. (Miller, 2017)
Drug Discovery Company
1. How did the machine decide that this specific
Toxic Non-Toxic compound is toxic?
2. What makes the compound toxic?
Evaluation Metrics 3. How can I change the toxic compound to non-
toxic?
 Incompleteness in the problem formulization (Doshi-Velez and Kim 2017).

 Model opaqueness
 Local Comprehension I have no clue.
 Global Comprehension
5/45
Introduction
DNN
Chemical Features
Input
1. How did the machine decide that this specific compound is toxic?
3. How can I change the toxic compound to non-toxic?
6/45
Introduction
 Model opaqueness
o Big pile of incomprehensible noise

o A very big parameter matrix
7/45
 Local Comprehension
Introduction
Chemical Features of single compound
Weight Matrix
 Why did the model decides that this

specific compound is toxic?
Toxic
 Local region of the conditional distribution
8/45
 Global Comprehension
Introduction
Chemical Features of ALL compound
Weight Matrix Toxic

Non-Toxic
Toxic
 How does the trained model make
predictions for all compounds?
Toxic
 The complete conditional distribution
Non-Toxic
Toxic
9/45
Big Research Question
 How to incorporate more details in problem formularization?

 How to tap into additional knowledge captured by the model?
 Using observed data and statistical modelling

 Causal graphical modelling
 Expert knowledge
10/45
Accuracy vs Interpretability
Learning Techniques (today)
Neural Nets
Graphical
Prediction Accuracy
Models
Deep
Learning Ensemble
Bayesian Methods
Belief Nets
SRL Random
CRFs HBNs Forests
AOGs
Statistical MLNs
Models Decision
Markov Trees
SVMs Models Interpretability
11/45
Model-agnostic methods
Proxy Models – Knowledge Distillation Local Interpretable Model-agnostic Explanations (LIME)

(Ivan Sanchez et.al 2015) (MT Ribeiro et. al 2016)
 Possible but It is itself a research gap to be filled.

1. How did the machine decide that this specific compound is toxic?
3. How can I change the toxic compound to non-toxic?
12/45
Proxy Models – Knowledge Distillation
(Ivan Sanchez et.al 2015)
Prediction accuracy
is near to Black-Box
Features and Target Predicted Target values Model
Train the black-Box
Features
Train the Interpretable

decision tree
13/45
Local Interpretable Model-agnostic Explanations (LIME)
14/45
15/45
16/45
Working of LIME:
 Permute data.
 Make predictions on new data using complex model.
 Calculate distance between permutations and original observations.
 Pick m features best describing the complex model outcome from the permuted data.
 Fit a simple model to the permuted data with m features and similarity scores as weights.
 Feature weights from the simple model make explanations for the complex models local behavior.
17/45
Drawbacks:
 Mostly applied to text and Image datasets (Low Risk with less potential of knowledge discovery).
 Not driven by interpretability questions.
 Lacking intelligent feature selection.
 Too much of wiggle-room for optimization.
18/45
Interpretability Levels
Drug Discovery Company
• What if I do x? (Causal Intervention)
Can be made possible

 Causal Intervention  𝑝 𝑦 𝑑𝑜(𝑥 , 𝑧)
(Judea Pearl ,2018) Impossible
Association  𝑝 𝑦 𝑥 I have no clue.
19/45
Interpretability Levels 20/45
Statistical Modelling
 Association  𝑝 𝑦 𝑥
Activity: Seeing, Observing
Questions: What is the probability of “y” if we observe “x”?

(How are the variables related, How would seeing “x” change my belief in “y”)
Examples: What does the symptom tell me about a certain disease?
Causal Interventional Modelling Counterfactual Modelling (Judea Pearl ,2018)

(Judea Pearl ,2018)
 Causal Intervention  𝑝 𝑦 𝑑𝑜(𝑥 , 𝑧)  Counterfactuals 𝑝 𝑦𝑥 𝑥 ` 𝑦 ` )
Activity: Doing, Intervening Activity: Imagining, Retrospection
Questions: What is the probability of “y” if we do “x”? What If I Questions: What if I had done something else?
do?
Examples: What if I had not smoked for the last
Examples: What will happen to my headache if I will take aspirin? two years?
Research Challenge
 A framework with the following attributes.
 Classification Performance
 Specifications of interpretability
 Transparency
21/45
Research Proposal
My research focus!
Defining
Optimum
interpretability
Features Subset
questions
Interpretability Statistical Statistical

Framework Modelling Interpretability
Evaluating
Interpretability
Causal
Causal Graphical Intervention and
Modelling Counterfactual
Interpretability
22/45
Framework Block diagram
Research Plan
Key points of the proposed method:
 Shallow Neural Network (SNN) can be used to achieve

state of the art performance.
 Optimum features subset will help us in interpretability

and performance.
 Joint optimization of a neural network fitting function

and number of features selection.
 Graphical causal modelling for causal interpretations.
23/45
24/45
Preliminary Work
Under Technical Review: ACS Omega ( American Chemical Society)
Submission Date: 30th April, 2018
Preliminary Work
Key Attributes:
1) Feature ranking using decision tree gini- index.

2) Shallow Neural Network (SNN) (one hidden layer with only 10 neurons) for training.
3) Joint optimization of SNN and number of features selected.
4) Answers few of the statistical interperability questions
Key Results:
1) Highest average accuracy on 12 toxicity tasks.

2) Lowest training time with minimum compute resources.
3) Less number of features (~512 out of ~1442) on average for all 12 toxicity tasks.
4) Simple/transparent training model.
5) Class discrimination plots based on the important features.
25/45
Datasets Preliminary Work
Tox21 dataset for 12 Nuclear Receptor and Stress Response

Panel. It includes train, cross validation (CV) and test set
Toxic/ Toxic/ Toxic/
Test
Task Train Non-Toxic CV Non-Toxic Non-Toxic
AHR 7863 937/6926 268 30/238 594 73/521

AR 9036 374/7950 288 3/285 573 12/559
AR-LBD 8234 284/7950 249 4/245 567 8/559
Stress response assays Aromatase 6959 352/6607 211 18/193 515 37/478
ER 7421 916/6505 261 27/234 505 50/455
ER-LBD 8431 415/8016 283 10/273 585 20/565
PPARG 7883 193/7690 263 15/248 590 30/560
ARE 6915 1040/5875 230 47/183 540 90/450
HSE 7879 386/7493 263 10/253 594 19/575
Nuclear receptor assays MMP 7071 1117/5954 234 38/196 530 58/472
p53 8349 509/7840 265 28/237 601 40/561
ATAD5 8775 317/8458 268 25/243 606 36/570
26/45
Framework Preliminary Work
Hybrid Network
2D Features
(~1442) extraction Mixing Cross-
Feature Selection
Hybrid Network
Valid and Train
via Decision Trees
Optimization
x4
Train ~8000
Cross-Valid ~296 Model Training
Final-Test~647 SNN Training
Cross-Valid Ensemble
Minority Class Up-
Sampling AUCROC
Final TEST
Optimized AUCROC
Normalization
Parameters
Classification
Pre-Processing Optimization
27/45
Hybrid approach: Optimization-I Preliminary Work
Total available features 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 , 𝑥6 , 𝑥7 , 𝑥8 , 𝑥9 , 𝑥10
Feature Selection
1) Decision Tree Classifier

2) Gini-Importance of each feature
3) Mean Gini-Importance
Joint optimization
Selected features subset 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 , 𝑥6 , 𝑥7 , 𝑥8 , 𝑥9 , 𝑥10
Shallow Neural Network
28/45
CV AUC-ROC
Feature Selection

Joint optimization
Selected features subset 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 , 𝑥6
29/45
CV AUC-ROC
Feature Selection

Joint optimization
Selected features subset 𝑥1 , 𝑥2
 Optimum features subset 30/45

𝑥1 , 𝑥2, 𝑥6 , 𝑥9 , CV AUC-ROC
Hybrid approach: Optimization-II Preliminary Work
 Optimum features subset

𝑥1 , 𝑥2, 𝑥6 , 𝑥9 ,
CV AUC-ROC
 Optimized parameters for Neural Network

Epochs, Initialization Function, Drop-out, Activation,
Mini-batch size
31/45
Final Test Results Preliminary Work
Performance on Tox21 dataset for 12 Nuclear Receptor

and Stress Response Panel
Hybrid Model RF SVM
Feature AUC-ROC AUC-ROC AUC-ROC

Task
Selected Test Set Test Set Test Set
AHR 270 0.921 0.907 0.889
AR 284 0.743 0.638 0.730
AR-LBD 365 0.881 0.800 0.702
Aromatase 815 0.794 0.792 0.782
ER 292 0.822 0.778 0.791
ER-LBD 755 0.836 0.768 0.786
PPARG 528 0.858 0.789 0.744
ARE 615 0.828 0.774 0.779
HSE 1028 0.832 0.859 0.798
MMP 685 0.958 0.978 0.916
p53 223 0.875 0.847 0.810
ATAD5 390 0.820 0.812 0.765
Average 521 0.847 0.812 0.791 32/45
Performance Comparison Preliminary Work
Comparative analysis of different methods used for NR and
SR toxicity prediction on Tox21 data set.
NR SR Total
Name Average Average Average
AUC-ROC AUC-ROC AUC-ROC
Our Method 0.836 0.862 0.847 Competitive Landscape
DeepTox 0.826 0.858 0.846
Training time and model complexity of the top 5 toxicity
AMAZIZ 0.816 0.854 0.838 prediction models
Capuzziet 0.831 0.848 0.840
dmlab 0.811 0.85 0.824
Models Method Features Training Time Average
T 0.798 0.842 0.823
AUC-ROC
SMILES2VEC NA NA 0.810 Our DT+SNN 521 ~1 min CPU 0.847
microsomes 0.785 0.814 0.810 DeepTox DNN 273577 ~10 min GPU 0.846
filipsPL 0.765 0.817 0.798 AMAZIZ ASNN NA NA 0.838
Charite 0.75 0.811 0.785
Capuzziet DNN 2489 NA 0.840
RCC 0.751 0.781 0.772
dmlab RF + ET 681 ~13 sec CPU 0.824
frozenarm 0.759 0.768 0.771
ToxFit 0.753 0.756 0.763
CGL 0.72 0.791 0.759
SuperTox 0.682 0.768 0.743 • Random forest and extra tree classifier (ET)
kibutz 0.731 0.731 0.741 • Deep Neural Networks (DNN)
MML 0.7 0.753 0.734 • Decision Tree (DT)
NCI 0.651 0.791 0.717
VIF 0.702 0.692 0.708 • Shallow Neural Network (SNN)
Toxic Avg 0.659 0.607 0.644 • Associative Neural Network (ASNN)
Swamidass 0.596 0.593 0.576 33/45
Chemception 0.787 0.739 0.773
Feature Importance Preliminary Work
(a) Cumulative gini index score of 1422 features

across 12 toxicity tasks
(b) Average ranking of 1422 features against cumulative gini
index score in all 12 tasks
(c) Ranking of top 29 features.
Knowledge
Extraction
34/45
Class Discrimination Screen Preliminary Work
Cut-off
Knowledge
Extraction
35/45
Class Discrimination Screen Preliminary Work
36/45
Research Plan
Feature Selection Module of Interpretability Framework:

 Defining a better and efficient way of feature raking, optimizing the number of features selected for
training.
Statistical Interpretation Module of Interpretability Framework:

 Creating statistical interpretations from the trained statistical model in the context of already defined
interpretability questions.
Causal Interpretation Module of Interpretability Framework:

 Creating causal interpretations from the graphical causal model in the context of already defined
interpretability questions.
37/45
Research Plan (Feature Selection)
Search for optimum feature subset: Searching the optimum subset of features from the total available feature space.
Research Challenge: Defining criteria for selecting the feature subset, adding or eliminating features
Gains: 1) Reduced number of features

2) Relatively less complex training model is required
3) More comprehensible
Joint optimization
Feature
Selection/Ranking
Selected features subset 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4
Statistical Model training
Objective Function 38/45

Research Plan (Feature Selection)
Search for optimum feature subset: Searching the optimum subset of features from the total available feature space.
Research Challenge: Defining criteria for selecting the feature subset, adding or eliminating features
Gains: 1) Reduced number of features

2) Relatively less complex training model is required
3) More comprehensible
Joint optimization
Feature
Selection/Ranking
Selected features subset 𝑥1 , 𝑥2 , 𝑥3 , 𝑥5 , 𝑥7 , 𝑥9
Statistical Model training
Objective Function 39/45

Research Plan (Statistical Interpretability)
 Applying “LIME” Predefined Statistical

 Applying “Proxy Models” Interpretability questions:
 We want to realize “Universal Approximation

Theorem”
 Devising our own algorithm (Inspired by shortcomings in LIME, Proxy

Models and Decision Trees)
40/45
Research Plan (Causal Interpretability)
Inspired by Judea Pearl Work:
Causal interventional and counterfactual questions can be addressed by using “Causal Bayesian Networks and Structural
Equations Models”. (Pearl ,2000, Chapter 3) and (Judea Pearl, 2018)
Observed data Statistical Interpretations Expert Knowledge
Causal
Graphical
Modelling
Predefined Causal
Interpretability questions:
41/45
Research Plan
42/45
Research Papers Progress
Submitted to ACS Omega :
“Toxicity prediction using hybridization of shallow neural networks and decision trees.”
To be completed by 30th July:

“Roadmap towards machine learning complete interpretability”
Yet to begin:
“An interpretable framework to classify chemical toxicity”
43/45
Conclusion
Interpretability
Causal
Questions for Statistical
High Risk
Specific Domain
High Accuracy
Helping the user

to understand
Answering the the problem
Questions
44/45
45/45
Forward pass A Reasonable Deep Neural Network
http://places.csail.mit.edu/
new scene-centric database called Places
𝑤
1 𝐴
𝑤𝑥 𝑔 𝑤𝑥 =
1 + 𝑒 −𝑤𝑥
𝑥
1 2 3
𝐸𝛼(𝑊 ,𝑊 ,𝑊 , 𝑊 [4] )
𝑖
𝑋 𝐴[1] = 𝑔(𝑊 1 𝑋 1 ) 𝐴[2] = 𝑔(𝑊 2 𝐴1) 𝐴[3] = 𝑔(𝑊 3
𝐴2) 𝐴[4] = 𝑔(𝑊 4 𝐴3)
12288x 10 million
10x10 million 10x10 million 10x10 million 1x10 million
1
2
3
10x12288 10 x10 10x10 1x10
Error
𝑊 [1] 𝑊 [2] 𝑊 [3] 𝑊 [4]
12288
10 10 10
46
Introduction
Today Task
• Why did you do that?
Decision or • Why not something else?
Machine Recommendation • When do you succeed?
Training Learned
Learning • When do you fail?
Data Function
Process • When can I trust you?
• How do I correct an error?
User
XAI Task
• I understand why
New • I understand why not
Training Machine Explainable Explanation • I know when you succeed
Data Learning Model Interface • I know when you fail
• I know when to trust you
Process
• I know why you erred
User
Credit: DARPA XAI
 Interpretability is the degree to which human can understand the cause of a machine learning
decision. (Miller, 2017)
47
Introduction
Why not just trust the model and ignore why it

made a certain decision?
We should not because….
 “Single evaluation metric” is an incomplete description of the most real-world tasks.

 Gaol of Science
 Safety measurements
 Detect bias
 Manage social interactions
48
Scope of Interpretability
 Algorithm transparency
How does the algorithm create the model?
 Global, Holistic Model Interpretability

How does the trained model make predictions?
o The complete conditional distribution.
 Local Interpretability for a Single Prediction

Why did the model make a specific decision for an instance?
o Local region of the conditional distribution.
49
Evaluating Interpretability
Doshi-Velez and Kim (2017) propose three major levels of evaluating interpretability:
 Application level evaluation (real task)

How an expert will explain the decision ?
 Human level evaluation (simple task)

How a lay human will explain the decision?
 Function level evaluation (proxy task)

Does not require any humans.
More dimensions to interpretability evaluations
• Model sparsity: How many features are being used by the explanation?
• Monotonicity: Is there a monotonicity constraint?
• Interactions: Is the explanation able to include interactions of features?
• Cognitive processing time: How long does it take to understand the explanation?
• Feature complexity: What features were used for the explanation?
• Description length of explanation.
50
Human-style Explanations
 Research from the humanities can help us to figure that out (Miller, 2017)
 Short explanations
 Abnormal causes
 What is an explanation?
An explanation is the answer to a why-question ( Miller 2017).
• Why did the treatment not work on the patient?

• Why was my loan rejected?
 Questions starting with “how” can usually be turned into “why” questions: “How was my loan
rejected?” can be turned into “Why was my loan rejected”.
51
Human-style Explanations
 What is good explanation?
“Many artificial intelligence and machine learning publications and methods claim to be about ‘interpretable’ or
‘explainable’ methods, yet often this claim is only based on the authors intuition instead of hard facts and
research.” - Miller (2017)
Explanations are contrastive (Liptopn 2016):

Explanations are selected:
Explanations are social:
Explanations focus on the abnormal (Kahnemann 1981):
Good explanations are general and probable:
52
Interpretable Models
 Linear regression model  Decision Tree Model
𝑦 = 𝑎0 + 𝑎1 𝑥 1 + 𝑎2 𝑥 2 + 𝑎3 𝑥 3 … … … . .
 Assumptions(not satisfied quite often in real world problems):

Linearity, Normality, Homoscedasticity, independence,
Fixed features, Absence of multicollinearity.
 Logistic regression model

1
𝑝 𝑦=1 = 1 +𝑎 2 +𝑎 3
1 + 𝑒 (− 𝑎0+𝑎1𝑥 2𝑥 3𝑥 )
 The interpretations always come with a clause that

“all other features stay the same”.
Drawbacks: Drawbacks:
 Interactions has to be hand crafted
 Poor performance in real world tasks  Handling of linear relationships
 Fail when relation between target and features is non-  Lack of smoothness
linear  Unstable
 Fail when features are interacting with each other
53
Partial Dependence Plot (PDP)

 Interpreting complex machine learning algorithms
 Graphical visualizations of the marginal effect of a given
variable (or multiple variables) on an outcome.
 Restricted to only one or two variables due to limit of
human perception.
 May mislead due to higher-order interactions.
 Assumption of independence is a big issue.
Individual Conditional Expectations (ICE)

 For a chosen feature, individual expectation plot
draw one line per instance, representing how the
instance’s prediction changes when the feature Credit: Sci-kit Learn
changes.
54
Research Challenges
Research Goal-I:
 Need of generic interpretability questions
What is Science?
 Sciences are primarily defined by their questions rather than by tools.
o Astrophysics: The discipline that learns the composition of stars, not as the discipline that use the
spectroscopes.
o Machine Learning Interpretability: Should be a discipline that answers generic questions related to
interpretability.
 The questions should fulfil the criteria of “Human style good explanations”
55
Research Challenges
Research Goal-III:
 Evaluating the interpretability
• Model sparsity: How many features are being used by the explanation?
• Monotonicity: Is there a monotonicity constraint?
• Interactions: Is the explanation able to include interactions of features?
• Cognitive processing time: How long does it take to understand the explanation?
• Feature complexity: What features were used for the explanation?
• Description length of explanation.
56
Research Plan
Formulation of general interpretability questions
• Which feature/s are the most important ones in the context of classification or prediction? (Statistical)
• What is the range of values for selected important features to discriminate a specific class? (Statistical)
• How the model came across certain decision? (Statistical and Transparency)
• How sensitive is the class output to a specific feature? (Statistical and Causal Intervention)
• How does the final class output change if we force some specific feature to get some value which is not
observed in the data? (Causal Intervention)
• How effective is a specific feature in resulting a specific class? (Causal Effectiveness)
• What are those features (or their specific values) if they had not occurred, specific class would not have
occurred? (Counterfactual)
…….. On going
57
Tox21 dataset Chemical Feature
NAME ACTIVITY piPC8 piPC9 piPC10

NCGC00261443 0 6.752065 6.714265 6.822368
NCGC00261266 0 3.944006 3.446011 0
NCGC00261559 0 6.243256 6.376833 6.534198
NCGC00261121 0 6.210537 6.451753 6.655762
NCGC00261374 0 5.182836 0 0
NCGC00261612 0 5.153653 4.928611 4.878531
NCGC00261002 1 7.0973 7.192502 7.409491
NCGC00261311 0 5.281934 4.447053 3.66276
58
59
60
Post-Hoc interpretations
Inform
Extract
Learn
Capture
61
Literature Review Downsides of KNN :
 Scalability and computational
 Classification using k-nearest neighbours and Support Vectors Machines (SVM) cost issues . (Deng & Zhao, 2017)
(Chavan, Friedman, & Nicholls, 2015; Kauffman & Jurs, 2001) (Ajmani, Jadhav, & Kulkarni, 2006) (Keiser et al., 2009)  Gives optimum accuracy for a
small data set with few
features.
(Chavan et al., 2015) KNN (Kauffman & Jurs, 2001) KNN (Ajmani et al., 2006) KNN
Train Test Features Train Test Features Train Test Features
 KNN can be used for relatively
94 24 8 314 74 30 3 smaller data sets with few
features
 A non-linear SVM can handle high dimensional data but not robust enough to handle the diversity of chemical descriptors
(Svetnik et al., 2003).
 Mostly not the state of art classification accuracy
A better classification Model (Svetnik et al., 2003)
 Classification accuracy is high  Wolpert’s No Free Lunch theorem

 Can handle molecular diversity ( large and diverse descriptors)
 Ease in training (The Mathematics of Generalization; Wolpert, D. H., Ed.; AddisonWesley:
Reading, 1995)
 Interpretability
Literature Review
 Classification using Random Forest
 Capable of handling large datasets size with diverse features  Classification accuracy is high
 Not state of the art accuracy in many cheminformatics datasets  Can handle molecular diversity
 Less control over granular level optimization  Ease in training
(Svetnik et al., 2003) (Polishchuk et al., 2009) (Tong, Hong, Fang, Xie, & Perkins, 2003)  Interpretability
 Wolpert’s No Free Lunch theorem
Deep Neural Network is an Artificial Neural Network with more than one hidden layer between the input and output.
(Bengio, 2009) (Schmidhuber, 2015) (Goodfellow, Bengio, Courville, & Bengio, 2016)

Conditions for DNN: (Nasrabadi, 2007) (Trunk, 1979) (Schmidhuber, 2015) (Goodfellow et al., 2016)
 Classification accuracy is high
 Huge training data with many many features (relevant and irrelevant)  Can handle molecular diversity
 The network should contain more than one hidden layer with many  Ease in training
neurons  Interpretability
Curse of dimensionality (Nasrabadi, 2007) (Trunk, 1979)

Increasing the accuracy of shallow neural Network
Performance
Number of features
Deep Neural
Network We can reduce this gap
if we train shallow
neural network using
Medium Neural optimum number of
Network shallow Neural relevant features.
Network
Traditional
Learning
Algorithms
Amount of Data
(https://blogs.nvidia.com/wp-content/uploads/2016/07/Deep_Learning_Icons_R5_PNG.jpg.png)
Relevant Features Selection using Decision Tree
• The relative rank (i.e. depth) of a feature used as

a decision node in a tree.
• Features used at the top of the tree contribute to

the final prediction decision of a larger fraction of
the input samples.
Average = most important feature

• The expected fraction of the samples they
contribute to can thus be used as an estimate of
the relative importance of the features.
• Decision trees use Gini Index to determine

which variables would go at top.
66

A Karim Conf Seminar Presentation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Karim Conf Seminar Presentation

Uploaded by

Copyright:

Available Formats

Interpretable Classifications

Institute of Integrated and Intelligent

Supervisors: Abdul Sattar

 Preliminary Work: Under Review, ACS omega

Application Risk Factor Criticality Potential of Knowledge

Drug Discovery High Risk High Critical High

 Interpretability is very crucial

Classification Black-Box Algorithm

Machine Learning Engineer

AUROC: Area Under the Receiver Operating Characteristic curve

 Incompleteness in the problem formulization (Doshi-Velez and Kim 2017).

o Big pile of incomprehensible noise

 Why did the model decides that this

Weight Matrix Toxic

 How to incorporate more details in problem formularization?

 Using observed data and statistical modelling

Learning Techniques (today)

Proxy Models – Knowledge Distillation Local Interpretable Model-agnostic Explanations (LIME)

 Possible but It is itself a research gap to be filled.

Train the black-Box

Train the Interpretable

 Make predictions on new data using complex model.

 Calculate distance between permutations and original observations.

 Not driven by interpretability questions.

 Lacking intelligent feature selection.

 Too much of wiggle-room for optimization.

Drug Discovery Company

• What if I do x? (Causal Intervention)

Can be made possible

Activity: Seeing, Observing

Questions: What is the probability of “y” if we observe “x”?

Examples: What does the symptom tell me about a certain disease?

Causal Interventional Modelling Counterfactual Modelling (Judea Pearl ,2018)

 Causal Intervention  𝑝 𝑦 𝑑𝑜(𝑥 , 𝑧)  Counterfactuals 𝑝 𝑦𝑥 𝑥 ` 𝑦 ` )

Activity: Doing, Intervening Activity: Imagining, Retrospection

 A framework with the following attributes.

Interpretability Statistical Statistical

Key points of the proposed method:

 Shallow Neural Network (SNN) can be used to achieve

 Optimum features subset will help us in interpretability

 Joint optimization of a neural network fitting function

 Graphical causal modelling for causal interpretations.

1) Feature ranking using decision tree gini- index.

1) Highest average accuracy on 12 toxicity tasks.

Tox21 dataset for 12 Nuclear Receptor and Stress Response

AHR 7863 937/6926 268 30/238 594 73/521

1) Decision Tree Classifier

Shallow Neural Network

1) Decision Tree Classifier

Shallow Neural Network

1) Decision Tree Classifier

Shallow Neural Network

 Optimum features subset 30/45

 Optimum features subset

 Optimized parameters for Neural Network

Performance on Tox21 dataset for 12 Nuclear Receptor

Feature AUC-ROC AUC-ROC AUC-ROC

(a) Cumulative gini index score of 1422 features

Feature Selection Module of Interpretability Framework:

Statistical Interpretation Module of Interpretability Framework:

Causal Interpretation Module of Interpretability Framework:

Gains: 1) Reduced number of features

Total available features 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 , 𝑥6 , 𝑥7 , 𝑥8 , 𝑥9 , 𝑥10

Selected features subset 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4