Welcome to Scribd!

Data Science Competition

Uploaded by

0% found this document useful (0 votes)

15 views18 pages

The document discusses some issues with using a holdout set to evaluate models in machine learning competitions. Specifically, it notes that (1) participants can see holdout data points, (2) submissions incorporate information from previously revealed holdout labels, creating statistical dependence between submissions and the holdout set. This feedback loop means holdout scores no longer provide an unbiased estimate of true performance, allowing models to eventually overfit the holdout set. Kaggle addresses this by limiting resubmissions and answer precision, but the true test is on a separate private test set.

Original Description:

Original Title

Data_science_competition

Copyright

Available Formats

PPTX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

15 views18 pages

Data Science Competition

Uploaded by

Raja

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 18

Search inside document

▪Acronyms (quick jump)

Data Science Competition

1
Real life vs. competitive data science

2
Kaggle and the holdout set + wacky boosting algorithm #1

▪ Ideal world scenario The idea behind the holdout method is that the
holdout data serve as a fresh sample providing an unbiased and well-
concentrated estimate of the true loss of the classifier on the
underlying distribution.
▪ Why then didn’t the holdout method detect that our wacky boosting
algorithm was overfitting? The short answer is that the holdout
method is simply not valid in the way it’s used in a competition.

https://blog.mrtz.org/2015/03/09/competition.html
3
Kaggle and the holdout set + wacky boosting algorithm #2

▪ First part of the problem One point of departure from the classic method is that the
participants actually do see the data points corresponding to holdout labels which can lead
to some problems. But that’s not the issue here and even if they we don’t look at the
holdout data points at all, there’s a fundamental reason why the validity of the classic
holdout method breaks down.

▪ Second part of the problem The problem is that a submission in general incorporates
information about the holdout labels previously released through the leaderboard
mechanism. As a result, there is a statistical dependence between the holdout data and
the submission. Due to this feedback loop, the public score is in general no longer an
unbiased estimate of the true score. There is no reason not to expect the submissions to
eventually overfit to the holdout set

https://blog.mrtz.org/2015/03/09/competition.html
4
Kaggle and the holdout set + wacky boosting algorithm #3

▪ Practical Kaggle solution The problem of overfitting to the holdout set is well known.
▪ Kaggle’s forums are full of anecdotal evidence reported by various competitors.
▪ The primary way Kaggle deals with this problem is by limiting the rate of re-submission and
(to some extent) the bit precision of the answers. Of course, this is also the reason why the
winners are determined on a separate test set.

https://blog.mrtz.org/2015/03/09/competition.html
5
Kaggle and the holdout set + wacky boosting algorithm #4

▪ The holdout method is a static method

in that it assumes the model to be
independent of the holdout data on
which it is evaluated.

▪ However, machine learning

competitions are interactive, because
submissions generally incorporate
information from the holdout set.

https://blog.mrtz.org/2015/03/09/competition.html
6
Kaggle and the holdout set + wacky boosting algorithm #5

▪ First, wacky boosting required the domain ▪ I try out a bunch of random vectors and keep all those that
give me a slightly better than expected score. If we’re
to be Boolean. talking about misclassification rate, the expected score of
▪ Second, the algorithm only gave an a random binary vector is 0.5.
▪ So, I’m keeping all the vectors with score less than 0.5.
advantage over random guessing which Then I recall something about boosting. It tells me that I
might be too far from the top of the can boost my accuracy by aggregating all predictors into a
leaderboard to start out with. single predictor using the majority function.

https://blog.mrtz.org/2015/03/09/competition.html 7
Histogram and how it can be used in
competition to see trend

As a general rule always use

different bin sizes when using
histogram.

Furthermore, look at the peak, what

we see here is that the organiser
used the mean value to fill all the
missing values. As there were many
of them, the peak is then explained

8
Overfitting in general is different from
overfitting in competitions

9
Validation problems:
Validation stage
Submission stage

Solution
10
11
Conclusions

12
The two metrics can be used
interchangeably.
The only difference is in the
gradient, which means that
probably we have to use a
different learning rate

13
Relationship between MSE and R2

14
MAE
Generally used in finance
because it is easier to explain
Look at the gradient, where the
derivative is not defined at y=0

15
MAE vs. MSE which one shall I use when I suspect there are outliers?

16
RMSLE Root mean square logarithmic error

17
Precision and variance

ML Interview Questions PDF
Document20 pages
ML Interview Questions PDF
nandex777
100% (4)
Algorithmic Trading in Python
Document28 pages
Algorithmic Trading in Python
luuvanhoan1993
0% (1)
Summary of The Article: A Few Useful Things To Know About Machine Learning
Document3 pages
Summary of The Article: A Few Useful Things To Know About Machine Learning
Ege Erdem
No ratings yet
Data Science Interview Questions (#Day11) PDF
Document11 pages
Data Science Interview Questions (#Day11) PDF
Sahil Goutham
100% (1)
ML MU Unit 2
Document84 pages
ML MU Unit 2
Paulos K
100% (3)
Model Training
Document194 pages
Model Training
Raja
No ratings yet
Model Training: (Anything Done While We Train The Model)
Document194 pages
Model Training: (Anything Done While We Train The Model)
Raja
No ratings yet
Management Science Chapter 15 Powerpoint: Optimization in Simulation
Document18 pages
Management Science Chapter 15 Powerpoint: Optimization in Simulation
iwansoenandi
No ratings yet
02 - Diagnostics For Machine Learning Model
Document20 pages
02 - Diagnostics For Machine Learning Model
MauJuarezSan
No ratings yet
07two Marks Quest & Ans
Document4 pages
07two Marks Quest & Ans
V MERIN SHOBI
No ratings yet
CS3244 (2120) - Project Discussion 1 - Overview
Document25 pages
CS3244 (2120) - Project Discussion 1 - Overview
dylantan.yhao
No ratings yet
Machine Learning Interview Questions
Document8 pages
Machine Learning Interview Questions
Priya Koshta
No ratings yet
Lab Report
Document11 pages
Lab Report
Nicolae Barba
No ratings yet
Do Not Fall Into These Financial Back-Testing Traps. - by Sofien Kaabar - Medium
Document18 pages
Do Not Fall Into These Financial Back-Testing Traps. - by Sofien Kaabar - Medium
droolingpanda
No ratings yet
Week 10 - PROG 8510 Week 10
Document16 pages
Week 10 - PROG 8510 Week 10
Vineel Kumar
No ratings yet
Chapter-3-Common Issues in Machine Learning
Document20 pages
Chapter-3-Common Issues in Machine Learning
codeavengers0
No ratings yet
Deep Neural Network Module 4 Regularization
Document53 pages
Deep Neural Network Module 4 Regularization
Manju Prasad N
No ratings yet
Writeup On Bank Customer Churn Prediction
Document14 pages
Writeup On Bank Customer Churn Prediction
Pramendra Kumar Singh
No ratings yet
2017 WB 2635 RobotWealth AFrameworkforApplyingMachineLearningtoSystematicTradingv2
Document69 pages
2017 WB 2635 RobotWealth AFrameworkforApplyingMachineLearningtoSystematicTradingv2
NDamean
No ratings yet
40 Interview Questions On Machine Learning - AnalyticsVidhya
Document21 pages
40 Interview Questions On Machine Learning - AnalyticsVidhya
Kaleab Tekle
100% (1)
What To Do When It Is Not Working
Document7 pages
What To Do When It Is Not Working
Rajachandra Voodiga
No ratings yet
Bias and Variance in Machine Learning - Javatpoint
Document6 pages
Bias and Variance in Machine Learning - Javatpoint
FriendlyarmMini
100% (2)
Credit Card Fraud Analysis Ashutosh
Document3 pages
Credit Card Fraud Analysis Ashutosh
Hemang Khandelwal
No ratings yet
A Recipe For Training Neural Networks
Document15 pages
A Recipe For Training Neural Networks
Choukha Ram (cRc)
No ratings yet
Ensemble Modelsf
Document7 pages
Ensemble Modelsf
Teja
No ratings yet
ML MU Unit 2
Document42 pages
ML MU Unit 2
Paulos K
100% (2)
Neural Networks
Document14 pages
Neural Networks
Rebel X Hamza
No ratings yet
ML Questions
Document56 pages
ML Questions
Pavan Kumar
No ratings yet
Data Analysis (27 Questions) : 1. (Given A Dataset) Analyze This Dataset and Tell Me What You Can Learn From It
Document28 pages
Data Analysis (27 Questions) : 1. (Given A Dataset) Analyze This Dataset and Tell Me What You Can Learn From It
kumar kumar
No ratings yet
Validation Over Under Fir Unit 5
Document6 pages
Validation Over Under Fir Unit 5
Harpreet Singh Bagga
No ratings yet
1.4 Intro To Need of Estimation and Validation PDF
Document18 pages
1.4 Intro To Need of Estimation and Validation PDF
Dhairya Thakkar
No ratings yet
UCODE Lecture v2.3
Document45 pages
UCODE Lecture v2.3
khaled
No ratings yet
04-Main Challenges in ML
Document25 pages
04-Main Challenges in ML
Zaid Al-amayreh
No ratings yet
Advanced Guide To GA For CRO PDF
Document63 pages
Advanced Guide To GA For CRO PDF
Alex Ionut
No ratings yet
Hyper-Parameter Tuning Techniques in Deep Learning - Towards Data Science
Document14 pages
Hyper-Parameter Tuning Techniques in Deep Learning - Towards Data Science
Gabriel Pehls
No ratings yet
Lean Forecasting Model
Document39 pages
Lean Forecasting Model
Yogesh
No ratings yet
Interview Questions On Machine Learning
Document22 pages
Interview Questions On Machine Learning
Praveen
100% (4)
Machine Learning Workflow
Document2 pages
Machine Learning Workflow
Brian Jones
No ratings yet
Chapter IV - Model Evaluation
Document26 pages
Chapter IV - Model Evaluation
Halbeega Waayaha
No ratings yet
CHP 3
Document70 pages
CHP 3
its9918k
No ratings yet
Company Wise Data Science Interview Questions
Document39 pages
Company Wise Data Science Interview Questions
chaddi
No ratings yet
Underfitting and Overfitting
Document4 pages
Underfitting and Overfitting
hokijic810
No ratings yet
ML 4
Document13 pages
ML 4
dibloa
No ratings yet
CART - Machine Learning
Document29 pages
CART - Machine Learning
adnan arshad
No ratings yet
Reading 11 - Programming End-to-End Solution
Document13 pages
Reading 11 - Programming End-to-End Solution
lussy
No ratings yet
Data Preparation For Machine Learning Mini Course
Document19 pages
Data Preparation For Machine Learning Mini Course
Lavanya Easwar
No ratings yet
Data Science Assignment 2
Document14 pages
Data Science Assignment 2
anigunasekara
No ratings yet
CSL0777 L08
Document29 pages
CSL0777 L08
Konkobo Ulrich Arthur
No ratings yet
Machine Learning Models
Document52 pages
Machine Learning Models
Bharath
No ratings yet
Bias and Variance
Document6 pages
Bias and Variance
Prashant Sahu
No ratings yet
Databyte ML Task 1
Document6 pages
Databyte ML Task 1
Mohini Thakur
No ratings yet
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
Document12 pages
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
Hassan Saddiqui
No ratings yet
c3 Coursework Model Answer
Document5 pages
c3 Coursework Model Answer
xokcccifg
100% (2)
6.1 Introduction To Optimization
Document19 pages
6.1 Introduction To Optimization
Yukimura Sanada
No ratings yet
Lecture-5-HCL-DSE - Sumita Narang-2
Document40 pages
Lecture-5-HCL-DSE - Sumita Narang-2
srirams007
No ratings yet
Interview Questions
Document67 pages
Interview Questions
vaishnav Jyothi
100% (1)
Machine Learning Project Example - Building A Model Step-By-Step PDF
Document9 pages
Machine Learning Project Example - Building A Model Step-By-Step PDF
SrinivasKannan
No ratings yet
SKO 2022 AutoML PreSales Session Pre and in Session Draft Slides
Document25 pages
SKO 2022 AutoML PreSales Session Pre and in Session Draft Slides
Carlos Varela
No ratings yet
Project Data Mining
Document55 pages
Project Data Mining
Aisyah Mustakim
No ratings yet
Six Sigma on a Budget: Achieving More with Less Using the Principles of Six Sigma
From Everand
Six Sigma on a Budget: Achieving More with Less Using the Principles of Six Sigma
Warren Brussee
Rating: 4 out of 5 stars
4/5 (1)
8.1 3.8 Filtering With Criteria Not
Document1 page
8.1 3.8 Filtering With Criteria Not
Raja
No ratings yet
5.1 3.5 Filtering With Criteria GT LT
Document1 page
5.1 3.5 Filtering With Criteria GT LT
Raja
No ratings yet
6.1 3.6 Filtering With Criteria GT LT - SOLUTIONS
Document1 page
6.1 3.6 Filtering With Criteria GT LT - SOLUTIONS
Raja
No ratings yet
Lecture 09 - Calculus and Optimization Techniques (3) - Plain
Document15 pages
Lecture 09 - Calculus and Optimization Techniques (3) - Plain
Raja
No ratings yet
2.1 3.2 Filtering With Criteria or and
Document1 page
2.1 3.2 Filtering With Criteria or and
Raja
No ratings yet
Lecture 11 - Probabilistic ML (2) - Probability Basic Contd - Plain
Document8 pages
Lecture 11 - Probabilistic ML (2) - Probability Basic Contd - Plain
Raja
No ratings yet
Lecture 02 - Warming-Up and Data and Features - Plain
Document23 pages
Lecture 02 - Warming-Up and Data and Features - Plain
Raja
No ratings yet
Lecture 10 - Optimization Techniques (4) - Plain
Document6 pages
Lecture 10 - Optimization Techniques (4) - Plain
Raja
No ratings yet
3.1 3.3 Filtering With Criteria or and SOLUTIONS
Document1 page
3.1 3.3 Filtering With Criteria or and SOLUTIONS
Raja
No ratings yet
Deep Learning
Document189 pages
Deep Learning
Raja
No ratings yet
Lecture 10 - Probabilistic ML (1) - Basics of Probability - Plain
Document12 pages
Lecture 10 - Probabilistic ML (1) - Basics of Probability - Plain
Raja
No ratings yet
A B Testing
Document28 pages
A B Testing
Raja
No ratings yet
Bernd Klein Python Data Analysis Letter
Document514 pages
Bernd Klein Python Data Analysis Letter
Raja
No ratings yet
Lecture 04 - Supervised Learning by Computing Distances (2) - Plain
Document16 pages
Lecture 04 - Supervised Learning by Computing Distances (2) - Plain
Raja
No ratings yet
Course Logistics and Introduction: CS771: Introduction To Machine Learning Piyush Rai
Document23 pages
Course Logistics and Introduction: CS771: Introduction To Machine Learning Piyush Rai
Raja
No ratings yet
Lecture 03 - Supervised Learning by Computing Distances - Plain
Document17 pages
Lecture 03 - Supervised Learning by Computing Distances - Plain
Raja
No ratings yet
General Observation
Document93 pages
General Observation
Raja
No ratings yet
Dataset: (Most Famous)
Document8 pages
Dataset: (Most Famous)
Raja
No ratings yet
Lecture 05 - Cross-Validation and Decision Trees - Plain
Document15 pages
Lecture 05 - Cross-Validation and Decision Trees - Plain
Raja
No ratings yet
Machine Learning
Document407 pages
Machine Learning
Raja
No ratings yet
Bernd Klein Python and Machine Learning Letter
Document453 pages
Bernd Klein Python and Machine Learning Letter
Raja
No ratings yet
Explainable
Document49 pages
Explainable
Raja
No ratings yet
Cnns Convolution Neural Networks
Document50 pages
Cnns Convolution Neural Networks
Raja
No ratings yet
Introduction To Git: Takeaways: Syntax
Document2 pages
Introduction To Git: Takeaways: Syntax
Raja
No ratings yet
Adversarial NN
Document24 pages
Adversarial NN
Raja
No ratings yet
Measuring The Effective Young's Modulus of Structural Silicone Sealant in Moment-Resisting Glazing Joints
Document42 pages
Measuring The Effective Young's Modulus of Structural Silicone Sealant in Moment-Resisting Glazing Joints
A
No ratings yet
F 360BXP.24: Fassi Crane
Document10 pages
F 360BXP.24: Fassi Crane
ניקולאי אין
No ratings yet
Physics! Unit 03 MTM Packet 2013
Document8 pages
Physics! Unit 03 MTM Packet 2013
Kelly O'Shea
No ratings yet
Report of Online Document Management System
Document8 pages
Report of Online Document Management System
Kunal Bangar
No ratings yet
Rules For Inland Waterway Ships 2018 Part E
Document98 pages
Rules For Inland Waterway Ships 2018 Part E
NavalPRO Ingeniería Naval
No ratings yet
Design New
Document9 pages
Design New
Ardago Lengga
No ratings yet
Design Calculations All
Document34 pages
Design Calculations All
Gayan Indunil
No ratings yet
Basketball Coaching
Document43 pages
Basketball Coaching
Tyler Wrice
No ratings yet
Connecting Samsung Mobiles To Internet Through PC Studio
Document11 pages
Connecting Samsung Mobiles To Internet Through PC Studio
Sirineni Mukesh
No ratings yet
The ABC's of Interviewing: A Is For Attitude
Document5 pages
The ABC's of Interviewing: A Is For Attitude
Snehal Patil
No ratings yet
Lighthill's Acoustic Analogy
Document18 pages
Lighthill's Acoustic Analogy
శ్రీకాంత్ మడక
No ratings yet
Analysis and Control of Underactuated Mechanical S
Document3 pages
Analysis and Control of Underactuated Mechanical S
Diego Mascarenhas
No ratings yet
HVAC - Psychrometrics
Document80 pages
HVAC - Psychrometrics
Anonymous qxOHkF
No ratings yet
Rr210106 Fluid Mechanics
Document8 pages
Rr210106 Fluid Mechanics
Srinivasa Rao G
No ratings yet
NARCICISSM Vocabulary
Document5 pages
NARCICISSM Vocabulary
fernanda
No ratings yet
10 HP SME AIR COMPRESSOR 2 Nos
Document5 pages
10 HP SME AIR COMPRESSOR 2 Nos
Shivam Acharya
No ratings yet
Crossed Roller Bearings - (E)
Document20 pages
Crossed Roller Bearings - (E)
akhmad saeful
No ratings yet
Playlist Observer
Document6 pages
Playlist Observer
ricardogarcia1724gj
No ratings yet
KAYO EFI Instructions: 1 Operating Principle
Document16 pages
KAYO EFI Instructions: 1 Operating Principle
kenny nieuwenweg
No ratings yet
Chemistry Syllabus New Curriculum - pdf-2
Document53 pages
Chemistry Syllabus New Curriculum - pdf-2
don
No ratings yet
WIP Job Closure Process
Document13 pages
WIP Job Closure Process
Sujan Karmakar
No ratings yet
Ibm 6558-03N
Document6 pages
Ibm 6558-03N
Jose Roberto Pirola
No ratings yet
51320062004AL-FFU130HAX (L - MBP) - Datasheet
Document4 pages
51320062004AL-FFU130HAX (L - MBP) - Datasheet
desenvolvimento app
No ratings yet
Elemix Properties Guide - 2011
Document23 pages
Elemix Properties Guide - 2011
Octavio Pittaluga
No ratings yet
Ubc 2017 May Mercer Gareth PDF
Document312 pages
Ubc 2017 May Mercer Gareth PDF
Maria Carmela Domocmat
No ratings yet
Emotional Intelligence Exercise
Document8 pages
Emotional Intelligence Exercise
m.syed
No ratings yet
Commissioning Manual TM120
Document27 pages
Commissioning Manual TM120
Josmar peña
No ratings yet
Uniaxial Loading: Design For Strength, Stiffness, and Stress Concentrations
Document11 pages
Uniaxial Loading: Design For Strength, Stiffness, and Stress Concentrations
DanishIqbal
No ratings yet
Aerodynamics Course Notes v3
Document253 pages
Aerodynamics Course Notes v3
Uribe Aldo
No ratings yet
Pressure Loss in Schedule 40 Steel Pipes
Document14 pages
Pressure Loss in Schedule 40 Steel Pipes
allovid
No ratings yet