You are on page 1of 57

03/12/2019

Robustness of
Evaluation Metrics
For Probabilistic Predictions of
Binary Outcomes

Erine de Leeuw
Supervisor: Dr. Mikhail Zhelonkin
Second Assessor: Dr. Andreas Alfons
https://www.cnbc.com/2019/11/04/twilio-says-it-got-the-math-wrong-on-its-full-year-earnings-forecast-last-week.html
https://www.express.co.uk/news/science/1181185/Climate-change-news-climate-forecast-wrong-global-warming-rising-temperatures
http://www.spectator.co.uk/features/8959941/whats-wrong-with-the-met-office/
Introduction
Research Gap

 Croux et al. (2008) investigated robust properties of error rate


 Not done for probabilistic forecast
 Add more current data sets
Agenda
Introduction

Methodology

Robustness Properties

Results

Conclusions and Discussion

Questions
2 Methodology
Methodology
Generalized Linear Method (GLM)

 Logistic regression solved using Maximum Likelihood Estimation (mle)


 M-estimator
 Influence function of M-estimator:
, looking at
ps IF is unbounded w.r.t.
 Score function of classic logistic regression:
Methodology
Generalized Linear Method

 Cantoni et al. (2001) introduced robust version by changing :

1 2 3
1. Weight functions to ensure robustness
2. Where and
is the pdf of the logit distribution
3. Ensures Fisher consistency

 Consistent
 Asymptotially normal
 Parameter estimation less efficient
Methodology
Machine Learning Methods and Calibration Methods

Machine Learning Methods


 Random Forest (RF)
 Support Vector Machine (SVM)

Calibration Methods
 Isotonic Regression (iso) by Zadrozny and Elkan (2002)
 Platt Scaling (Platt) by Platt (1999)

Why were these methods chosen?


Methodology
Evaluation Metrics

Strictly Proper Scoring Rules


 Brier score (BS)

 Logarithmic score (LS)

 Spherical score (SP)

 For a set of forecasts the scores are the average  scores seen as population
 Functional of evaluation metric is τ
3 Robustness
Properties
Robustness Properties
Theoretical Background

Population version of
mean forecast score:
Functional calculating mean Representation of score

of the forecasts

Von Mises
expansion:
Influence function of evaluation metric
Robustness Properties
Analysis of Mean Brier Score

1. Estimated value close to true value for correctly specified assumed


model
- Integrating this over F makes this term small

2. Same form as the IF of the expectation (Hampel et al. 1986)


Robustness Properties
Analysis of Mean Logarithmic Score

1. Estimating equations of the MLE of logistic regression


- Integrating this over F makes this term small

2. Same form as the IF of the expectation (Hampel et al. 1986)

Under suitable regularity conditions (Fernholz 1983):

 Loss of efficiency
vanishes
4 Results
Results
Numerical Examples in Thesis

Simulations: Applications:
 Correctly specified model  Vaso constriction
 Heavy-tailed error  Banknote Authentification
 Misspecified link function  Breastcancer
 Creditcard
 Foodstamp
 Leukemia
Simulation set-up:
 Binomial Distribution (canonical link)
 Three explanatory variables zero mean and
Toeplitz matrix as covariances
 Noisy linear predictor (LP1) or discriminatory
linear predictor (LP2) used
 R = 500
 Moderate outliers and heavy outliers can be added
Results
Numerical Examples Explained Here

Simulations: Applications:
 Correctly specified model  Vaso constriction
 Heavy-tailed error  Banknote Authentification
 Misspecified link function  Breastcancer
 Creditcard
 Foodstamp
 Leukemia
Results Simulations
Sensitivity Analysis at the Model

 Standardized bias due to one


outlier in training set
 Bias is negligible
 Similar results by Croux
(2008)
Results Simulations
Scores of Simulation

For LP1 and moderate outliers

no clear discrimination

For LP2 and moderate outliers

better discrimination
Results Simulations
Asymptotic Relative Efficiencies
At the (correctly specified) model:
Results Simulations
Reliability Curves of Simulation of LP1
Results Simulations
Reliability Curves of Simulation of LP1
Results Application
Scores and Reliability Curves of Vaso Constriction Application
When one influential observation is in training set:
5 Conclusions
Discussion
Conclusions
Predictions from robust Calibration improves Out of all machine learning
GLM minimally as good as reliability methods considered, SVM
classic GLM gives most stable forecasts
Supported by: Supported by: Supported by:
 Equal mean scores  Reliability plots have  Reliability curves most
uncontaminated setting straighter lines after stable
 Better mean scores calibration  Effect of contamination
training contamination  Scores improve after in testing set much less
 In all numerical calibration extreme than for GLM
examples robust GLM  In some applications
reliable scores obtained better
than GLM
6 Questions
Appendix
Methodology
Machine Learning Methods

 Random Forest (RF):


 Support Vector Machine (SVM):

 The probabilistic estimate is collection of classification rules or trees


 The goal of one tree is to predict the dependent variable by estimating the expected value
of it given the explanatory variables, using the training set
Methodology
Machine Learning Methods

 Random Forest (RF):


ρ
 Support Vector Machine (SVM):

 SVM uses kernels as a nonlinear higher-dimensional mapping tool


 nonlinear classification become linear
 SVM solely results in a decision boundary
 a calibration method is needed to obtain the resulting probability estimates
Methodology
Calibration Methods

 Isotonic Regression (iso) by Zadrozny and Elkan (2002):


- Assumes monotonicity of the uncalibrated posterior probabilities
- Pool Adjacent Violators Algorithm (PAVA) generally used:
1. Sort all observations based on uncalibrated probabilities
2. Calibrated probabilities equal to true outcome
3. If ordered value per pair is greater than true outcome  placed in new list
4. New calibrated values equal to average of actual and next true outcome in new list
5. Repeat until no violation of ordering

 Platt Scaling (Platt) by Platt (1999):


1. The uncalibrated estimates (for SVM the distance to the decision boundary) are transformed
2. Posterior predictive cumulative density function is fit using logistic regression
Methodology
Calibration Methods

 Isotonic Regression (iso) by Zadrozny and Elkan (2002):


- Assumes monotonicity of the uncalibrated posterior probabilities
- Pool Adjacent Violators Algorithm (PAVA) generally used:
1. All observations sorted based on the uncalibrated probabilities of the calibration training set
2. Calibrated probabilities first taken equal to the true outcome of the calibration training set
3. Whenever for each consecutive pair the ordered value is greater than the true outcome: they are placed
in a new list.
4. Their new calibrated values are equal to the average of the previous true outcome and the next true
outcome in the sorted list.
5. Whole process is repeated until no violation of ordering occurs.

 Platt Scaling (Platt) by Platt (1999):


- Assumes reliability curve can improve by parametric rescaling
1. The uncalibrated estimates (for SVM the distance to the decision boundary) are transformed
2. Posterior predictive cumulative density function is fit using logistic regression
Appendix Robustness
Proof of the functionals
Results
Reliability Curves of Simulation of LP1
Appendix Outlier Contamination
Average of Brier scores and logarithmic scores
Appendix
Reliability
Appendix
Heavy-tailed Error
Appendix
Reliability
Appendix
Reliability
Appendix
Reliability
Appendix
Reliability
Appendix
Misspecified Link function
Appendix
Reliability
Appendix
Reliability
Appendix
Reliability
Appendix
Additional Simulations for Average Score
Appendix
Additional Results of Heavy-tailed Error Contamination
Results
Scores and Reliability Curves of Vaso Constriction Application
When both influential observations in training set:
Appendix Applications
Banknote Data
Appendix Applications
Breastcancer Data
Appendix Applications
Creditcard Data
Appendix Applications
Foodstamp Data
Appendix Applications
Leukemia Data
Appendix Applications
Types of data points
Appendix Applications
Fisher consistency

Let Fth denote a distribution that estimates parameters h . Let gth be the
parametric function one wants to estimate. Fisher consistency occurs when:
Appendix Applications
Densities of Distributions Used
Appendix Applications
Influence Function and Breakdown Point

The IF can be used to approximate the bias, while the neighborhood in which this
approximation is useful, is determined by the breakdown point.

For non-linear models but i.i.d. setting, the breakdown point is the minimum
fraction of contamination that is needed to drive the estimator to the edge of the
parameter space

You might also like