Professional Documents
Culture Documents
Robustness of
Evaluation Metrics
For Probabilistic Predictions of
Binary Outcomes
Erine de Leeuw
Supervisor: Dr. Mikhail Zhelonkin
Second Assessor: Dr. Andreas Alfons
https://www.cnbc.com/2019/11/04/twilio-says-it-got-the-math-wrong-on-its-full-year-earnings-forecast-last-week.html
https://www.express.co.uk/news/science/1181185/Climate-change-news-climate-forecast-wrong-global-warming-rising-temperatures
http://www.spectator.co.uk/features/8959941/whats-wrong-with-the-met-office/
Introduction
Research Gap
Methodology
Robustness Properties
Results
Questions
2 Methodology
Methodology
Generalized Linear Method (GLM)
1 2 3
1. Weight functions to ensure robustness
2. Where and
is the pdf of the logit distribution
3. Ensures Fisher consistency
Consistent
Asymptotially normal
Parameter estimation less efficient
Methodology
Machine Learning Methods and Calibration Methods
Calibration Methods
Isotonic Regression (iso) by Zadrozny and Elkan (2002)
Platt Scaling (Platt) by Platt (1999)
For a set of forecasts the scores are the average scores seen as population
Functional of evaluation metric is τ
3 Robustness
Properties
Robustness Properties
Theoretical Background
Population version of
mean forecast score:
Functional calculating mean Representation of score
of the forecasts
Von Mises
expansion:
Influence function of evaluation metric
Robustness Properties
Analysis of Mean Brier Score
Loss of efficiency
vanishes
4 Results
Results
Numerical Examples in Thesis
Simulations: Applications:
Correctly specified model Vaso constriction
Heavy-tailed error Banknote Authentification
Misspecified link function Breastcancer
Creditcard
Foodstamp
Leukemia
Simulation set-up:
Binomial Distribution (canonical link)
Three explanatory variables zero mean and
Toeplitz matrix as covariances
Noisy linear predictor (LP1) or discriminatory
linear predictor (LP2) used
R = 500
Moderate outliers and heavy outliers can be added
Results
Numerical Examples Explained Here
Simulations: Applications:
Correctly specified model Vaso constriction
Heavy-tailed error Banknote Authentification
Misspecified link function Breastcancer
Creditcard
Foodstamp
Leukemia
Results Simulations
Sensitivity Analysis at the Model
no clear discrimination
better discrimination
Results Simulations
Asymptotic Relative Efficiencies
At the (correctly specified) model:
Results Simulations
Reliability Curves of Simulation of LP1
Results Simulations
Reliability Curves of Simulation of LP1
Results Application
Scores and Reliability Curves of Vaso Constriction Application
When one influential observation is in training set:
5 Conclusions
Discussion
Conclusions
Predictions from robust Calibration improves Out of all machine learning
GLM minimally as good as reliability methods considered, SVM
classic GLM gives most stable forecasts
Supported by: Supported by: Supported by:
Equal mean scores Reliability plots have Reliability curves most
uncontaminated setting straighter lines after stable
Better mean scores calibration Effect of contamination
training contamination Scores improve after in testing set much less
In all numerical calibration extreme than for GLM
examples robust GLM In some applications
reliable scores obtained better
than GLM
6 Questions
Appendix
Methodology
Machine Learning Methods
Let Fth denote a distribution that estimates parameters h . Let gth be the
parametric function one wants to estimate. Fisher consistency occurs when:
Appendix Applications
Densities of Distributions Used
Appendix Applications
Influence Function and Breakdown Point
The IF can be used to approximate the bias, while the neighborhood in which this
approximation is useful, is determined by the breakdown point.
For non-linear models but i.i.d. setting, the breakdown point is the minimum
fraction of contamination that is needed to drive the estimator to the edge of the
parameter space