You are on page 1of 18

▪Acronyms (quick jump)

Data Science Competition

1
Real life vs. competitive data science

2
Kaggle and the holdout set + wacky boosting algorithm #1

▪ Ideal world scenario The idea behind the holdout method is that the
holdout data serve as a fresh sample providing an unbiased and well-
concentrated estimate of the true loss of the classifier on the
underlying distribution.
▪ Why then didn’t the holdout method detect that our wacky boosting
algorithm was overfitting? The short answer is that the holdout
method is simply not valid in the way it’s used in a competition.

https://blog.mrtz.org/2015/03/09/competition.html
3
Kaggle and the holdout set + wacky boosting algorithm #2

▪ First part of the problem One point of departure from the classic method is that the
participants actually do see the data points corresponding to holdout labels which can lead
to some problems. But that’s not the issue here and even if they we don’t look at the
holdout data points at all, there’s a fundamental reason why the validity of the classic
holdout method breaks down.

▪ Second part of the problem The problem is that a submission in general incorporates
information about the holdout labels previously released through the leaderboard
mechanism. As a result, there is a statistical dependence between the holdout data and
the submission. Due to this feedback loop, the public score is in general no longer an
unbiased estimate of the true score. There is no reason not to expect the submissions to
eventually overfit to the holdout set

https://blog.mrtz.org/2015/03/09/competition.html
4
Kaggle and the holdout set + wacky boosting algorithm #3

▪ Practical Kaggle solution The problem of overfitting to the holdout set is well known.
▪ Kaggle’s forums are full of anecdotal evidence reported by various competitors.
▪ The primary way Kaggle deals with this problem is by limiting the rate of re-submission and
(to some extent) the bit precision of the answers. Of course, this is also the reason why the
winners are determined on a separate test set.

https://blog.mrtz.org/2015/03/09/competition.html
5
Kaggle and the holdout set + wacky boosting algorithm #4

▪ The holdout method is a static method


in that it assumes the model to be
independent of the holdout data on
which it is evaluated.

▪ However, machine learning


competitions are interactive, because
submissions generally incorporate
information from the holdout set.

https://blog.mrtz.org/2015/03/09/competition.html
6
Kaggle and the holdout set + wacky boosting algorithm #5

▪ First, wacky boosting required the domain ▪ I try out a bunch of random vectors and keep all those that
give me a slightly better than expected score. If we’re
to be Boolean. talking about misclassification rate, the expected score of
▪ Second, the algorithm only gave an a random binary vector is 0.5.
▪ So, I’m keeping all the vectors with score less than 0.5.
advantage over random guessing which Then I recall something about boosting. It tells me that I
might be too far from the top of the can boost my accuracy by aggregating all predictors into a
leaderboard to start out with. single predictor using the majority function.

https://blog.mrtz.org/2015/03/09/competition.html 7
Histogram and how it can be used in
competition to see trend

As a general rule always use


different bin sizes when using
histogram.

Furthermore, look at the peak, what


we see here is that the organiser
used the mean value to fill all the
missing values. As there were many
of them, the peak is then explained

8
Overfitting in general is different from
overfitting in competitions

9
Validation problems:
Validation stage
Submission stage

Solution 
10
11
Conclusions

12
The two metrics can be used
interchangeably.
The only difference is in the
gradient, which means that
probably we have to use a
different learning rate

13
Relationship between MSE and R2

14
MAE
Generally used in finance
because it is easier to explain
Look at the gradient, where the
derivative is not defined at y=0

15
MAE vs. MSE which one shall I use when I suspect there are outliers?

16
RMSLE Root mean square logarithmic error

17
Precision and variance

18

You might also like