Robust Statistics

03/12/2019
Robustness of
Evaluation Metrics
For Probabilistic Predictions of
Binary Outcomes
Erine de Leeuw
Supervisor: Dr. Mikhail Zhelonkin
Second Assessor: Dr. Andreas Alfons
https://www.cnbc.com/2019/11/04/twilio-says-it-got-the-math-wrong-on-its-full-year-earnings-forecast-last-week.html
https://www.express.co.uk/news/science/1181185/Climate-change-news-climate-forecast-wrong-global-warming-rising-temperatures
http://www.spectator.co.uk/features/8959941/whats-wrong-with-the-met-office/
Introduction
Research Gap
 Croux et al. (2008) investigated robust properties of error rate

 Not done for probabilistic forecast
 Add more current data sets
Agenda
Introduction
Methodology
Robustness Properties
Results
Conclusions and Discussion
Questions
2 Methodology
Methodology
Generalized Linear Method (GLM)
 Logistic regression solved using Maximum Likelihood Estimation (mle)

 M-estimator
 Influence function of M-estimator:
, looking at
ps IF is unbounded w.r.t.
 Score function of classic logistic regression:
Methodology
Generalized Linear Method
 Cantoni et al. (2001) introduced robust version by changing :
1 2 3
1. Weight functions to ensure robustness
2. Where and
is the pdf of the logit distribution
3. Ensures Fisher consistency
 Consistent
 Asymptotially normal
 Parameter estimation less efficient
Methodology
Machine Learning Methods and Calibration Methods
Machine Learning Methods

 Random Forest (RF)
 Support Vector Machine (SVM)
Calibration Methods
 Isotonic Regression (iso) by Zadrozny and Elkan (2002)
 Platt Scaling (Platt) by Platt (1999)
Why were these methods chosen?

Methodology
Evaluation Metrics
Strictly Proper Scoring Rules

 Brier score (BS)
 Logarithmic score (LS)
 Spherical score (SP)
 For a set of forecasts the scores are the average  scores seen as population
 Functional of evaluation metric is τ
3 Robustness
Properties
Theoretical Background
Population version of
mean forecast score:
Functional calculating mean Representation of score
of the forecasts
Von Mises
expansion:
Influence function of evaluation metric
Analysis of Mean Brier Score
1. Estimated value close to true value for correctly specified assumed

model
- Integrating this over F makes this term small
2. Same form as the IF of the expectation (Hampel et al. 1986)

Analysis of Mean Logarithmic Score
1. Estimating equations of the MLE of logistic regression

- Integrating this over F makes this term small
2. Same form as the IF of the expectation (Hampel et al. 1986)
Under suitable regularity conditions (Fernholz 1983):
 Loss of efficiency
vanishes
4 Results
Results
Numerical Examples in Thesis
Simulations: Applications:
 Correctly specified model  Vaso constriction
 Heavy-tailed error  Banknote Authentification
 Misspecified link function  Breastcancer
 Creditcard
 Foodstamp
 Leukemia
Simulation set-up:
 Binomial Distribution (canonical link)
 Three explanatory variables zero mean and
Toeplitz matrix as covariances
 Noisy linear predictor (LP1) or discriminatory
linear predictor (LP2) used
 R = 500
 Moderate outliers and heavy outliers can be added
Results
Numerical Examples Explained Here
Simulations: Applications:
 Correctly specified model  Vaso constriction
 Heavy-tailed error  Banknote Authentification
 Misspecified link function  Breastcancer
 Creditcard
 Foodstamp
 Leukemia
Results Simulations
Sensitivity Analysis at the Model
 Standardized bias due to one

outlier in training set
 Bias is negligible
 Similar results by Croux
(2008)
Results Simulations
Scores of Simulation
For LP1 and moderate outliers
no clear discrimination
For LP2 and moderate outliers
better discrimination
Results Simulations
Asymptotic Relative Efficiencies
At the (correctly specified) model:
Results Simulations
Reliability Curves of Simulation of LP1
Results Simulations
Results Application
Scores and Reliability Curves of Vaso Constriction Application
When one influential observation is in training set:
5 Conclusions
Discussion
Conclusions
Predictions from robust Calibration improves Out of all machine learning
GLM minimally as good as reliability methods considered, SVM
classic GLM gives most stable forecasts
Supported by: Supported by: Supported by:
 Equal mean scores  Reliability plots have  Reliability curves most
uncontaminated setting straighter lines after stable
 Better mean scores calibration  Effect of contamination
training contamination  Scores improve after in testing set much less
 In all numerical calibration extreme than for GLM
examples robust GLM  In some applications
reliable scores obtained better
than GLM
6 Questions
Appendix
Methodology
 Random Forest (RF):

 Support Vector Machine (SVM):
 The probabilistic estimate is collection of classification rules or trees

 The goal of one tree is to predict the dependent variable by estimating the expected value
of it given the explanatory variables, using the training set
Methodology
 Random Forest (RF):

ρ
 Support Vector Machine (SVM):
 SVM uses kernels as a nonlinear higher-dimensional mapping tool

 nonlinear classification become linear
 SVM solely results in a decision boundary
 a calibration method is needed to obtain the resulting probability estimates
Methodology
Calibration Methods
 Isotonic Regression (iso) by Zadrozny and Elkan (2002):

- Assumes monotonicity of the uncalibrated posterior probabilities
- Pool Adjacent Violators Algorithm (PAVA) generally used:
1. Sort all observations based on uncalibrated probabilities
2. Calibrated probabilities equal to true outcome
3. If ordered value per pair is greater than true outcome  placed in new list
4. New calibrated values equal to average of actual and next true outcome in new list
5. Repeat until no violation of ordering
 Platt Scaling (Platt) by Platt (1999):

1. The uncalibrated estimates (for SVM the distance to the decision boundary) are transformed
2. Posterior predictive cumulative density function is fit using logistic regression
Methodology
Calibration Methods
 Isotonic Regression (iso) by Zadrozny and Elkan (2002):

- Assumes monotonicity of the uncalibrated posterior probabilities
- Pool Adjacent Violators Algorithm (PAVA) generally used:
1. All observations sorted based on the uncalibrated probabilities of the calibration training set
2. Calibrated probabilities first taken equal to the true outcome of the calibration training set
3. Whenever for each consecutive pair the ordered value is greater than the true outcome: they are placed
in a new list.
4. Their new calibrated values are equal to the average of the previous true outcome and the next true
outcome in the sorted list.
5. Whole process is repeated until no violation of ordering occurs.
 Platt Scaling (Platt) by Platt (1999):

- Assumes reliability curve can improve by parametric rescaling
1. The uncalibrated estimates (for SVM the distance to the decision boundary) are transformed
2. Posterior predictive cumulative density function is fit using logistic regression
Appendix Robustness
Proof of the functionals
Results
Appendix Outlier Contamination
Average of Brier scores and logarithmic scores
Appendix
Reliability
Appendix
Heavy-tailed Error
Appendix
Reliability
Appendix
Reliability
Appendix
Reliability
Appendix
Reliability
Appendix
Misspecified Link function
Appendix
Reliability
Appendix
Reliability
Appendix
Reliability
Appendix
Additional Simulations for Average Score
Appendix
Additional Results of Heavy-tailed Error Contamination
Results
Scores and Reliability Curves of Vaso Constriction Application
When both influential observations in training set:
Appendix Applications
Banknote Data
Breastcancer Data
Creditcard Data
Foodstamp Data
Leukemia Data
Types of data points
Fisher consistency
Let Fth denote a distribution that estimates parameters h . Let gth be the
parametric function one wants to estimate. Fisher consistency occurs when:
Densities of Distributions Used
Influence Function and Breakdown Point
The IF can be used to approximate the bias, while the neighborhood in which this
approximation is useful, is determined by the breakdown point.
For non-linear models but i.i.d. setting, the breakdown point is the minimum
fraction of contamination that is needed to drive the estimator to the edge of the
parameter space

Robust Statistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Robust Statistics

Uploaded by

Copyright:

Available Formats

03/12/2019

 Croux et al. (2008) investigated robust properties of error rate

Conclusions and Discussion

 Logistic regression solved using Maximum Likelihood Estimation (mle)

 Cantoni et al. (2001) introduced robust version by changing :

Machine Learning Methods

Why were these methods chosen?

Strictly Proper Scoring Rules

 Logarithmic score (LS)

 Spherical score (SP)

1. Estimated value close to true value for correctly specified assumed

2. Same form as the IF of the expectation (Hampel et al. 1986)

1. Estimating equations of the MLE of logistic regression

2. Same form as the IF of the expectation (Hampel et al. 1986)

Under suitable regularity conditions (Fernholz 1983):

 Standardized bias due to one

For LP1 and moderate outliers

For LP2 and moderate outliers

 Random Forest (RF):

 The probabilistic estimate is collection of classification rules or trees

 Random Forest (RF):

 SVM uses kernels as a nonlinear higher-dimensional mapping tool

 Isotonic Regression (iso) by Zadrozny and Elkan (2002):

 Platt Scaling (Platt) by Platt (1999):

 Isotonic Regression (iso) by Zadrozny and Elkan (2002):

 Platt Scaling (Platt) by Platt (1999):

You might also like