Machine Learning in practice (at Yahoo

)
common pitfalls, and debugging tricks (Kilian Weinberger)

(thanks to Rob Shapire, Andrew Ng)

Overview
Machine Learning Setup

Algorithm Debugging

Data Debugging

Machine Learning Setup

Goal
Data

Miracle Learning Algorithm

Idea
Amazing results!!!
Fame, Glory, Rock’n Roll!

1. Learning Problem

What is my data? What am I trying to learn? What would be ideal conditions?

QUIZ: What would be some answers for email spam filtering?

Example: What is my data? Email content / Meta Data User’s spam/ham labels What am I trying to learn? What would be ideal conditions? Only Y! Employees .

See John’s talk.) 2. Train / Test split Train Data Test Data time Real World Data ?? 1. How much data do I need? (More is more . How do you split into train / test? (Always by time!) 3. Training data should be just like test data!! .2.

you will overfit Overfitting bounded by: O ￿￿ log (#trials) #examples ￿ (see John’s talk) Kishore’s rule of thumb: subtract 1% accuracy for every time you have tested on a data set Ideally: Create a second train / test split! .Data set overfitting many runs Train Data Test Data time one run! Real World Data ?? By evaluating on the same data set over and over.

0 1 2. .. Data Representation: feature vector: 0 1 0 1 1 ......1 . data (email) 0.. Percentile in email length Percentile in token likelihood . “viagra” “hello” “cheap” “$” “Microsoft” ..3.. From a YID? IP known? Sent time in s since 1/1/1970 Email size Attachment size .3423E+12 12323 0 ...232 0..

...... 0.3423E+12 12323 0 ..1 .232 0. From a YID? IP known? Sent time in s since 1/1/1970 Email size Attachment size . Percentile in email length Percentile in token likelihood . “viagra” “hello” “cheap” “$” “Microsoft” .....Data Representation: feature vector: bag of word features (sparse) 3 1 0 1 1 . Pitfall #1: Aggregate statistics should not be over test data! .. meta features (sparse / dense) aggregate statistics (dense real) 0 1 2.

1]) Must use the same scaling constants for test data!!! (most likely test data will not be in a clean [0.) fi → (fi + ai ) ∗ bi .g.Pitfall #2: Feature scaling With linear classifiers / kernels features should have similar scale (e. range [0.1] interval) Dense features should be down-weighted when combined with sparse features (Scale does not matter for decision trees.

232 0.2 -23.1 condensed feature vector 0 1 2.3 12.2 2..3 5.1 . 1. ..3423E+12 12323 0 ...Pitfall #3: Over-condensing of features Features do not need to be semantically meaningful Just add them: Redundancy is (generally) not a problem Let the learning algorithm decide what’s useful! raw data: 3 1 0 1 1 .. 0..

Example: Thought reading fMRI scan Nobody knows what the features are But it works!!! [Mitchel et al 2008] .

) 2. Is the signal derived independently of the features?! 4. Training Signal 1.3. Could the signal shift after deployment? . How reliable is my labeling source? (Even editors only agree 33% of the time. Does the signal have high coverage? 3.

Quiz: Spam filtering se el er has sent 10M spam emails The spammer with IP e.usely emails with this IP as over the last 10 days ial all o n t ata spam examples ten d po in sy oi votes as signal Use user’s spam / not-spam n o to ge ra Use Yahoo employees’ ve votes co w lo .i.v.l lab iv d t .

Example: Spam filtering spam filter incoming email Inbox feedback: SPAM / NOT-SPAM Junk user .

Example: Spam filtering old spam filter incoming email annotates email new ML spam filter Inbox feedback: SPAM / NOT-SPAM Junk user QUIZ: What is wrong with this setup? .

Example: Spam filtering old spam filter incoming email annotates email new ML spam filter Inbox feedback: SPAM / NOT-SPAM Problem: Users only vote when classifier is wrong New filter learns to exactly invert the old classifier Possible solution: Occasionally let emails through filter to avoid bias .

Example: Trusted votes Goal: Classify email votes as trusted / untrusted Signal conjecture: votes voted “good” voted “bad” time evil spammer community .

Searching for signal The good news: We found that exact pattern A LOT!! votes voted “good” voted “bad” time evil spammer community .

Searching for signal The good news: We found that exact pattern A LOT!! The bad news: We found other patterns just as often votes voted “good” voted “bad” time .

Searching for signal The good news: We found that exact pattern A LOT!! The bad news: We found other patterns just as often votes voted “good” voted voted “bad” “good” voted “bad” voted “good” time Moral: Given enough data you’ll find anything! You need to be very very careful that you learn the right thing! .

4. Learning Method • • • • • Classification / Regression / Ranking? Do you want probabilities? Do you have skewed classes / weighted examples? Best off-the-shelf: SVMs or boosted decision trees Generally: Try out several algorithms .

) QUIZ: What would you use for spam? .Method Complexity (KISS) Common pitfall: Use a too complicated learning algorithm ALWAYS try simplest algorithm first!!! Move to more complex systems after the simple one works Rule of diminishing returns!! (Scientific papers exaggerate benefit of complex theory.

joachims.edu/~vikass/svmlin.umass.edu.csie.com/view/Ysti18n/MLRModelingPackage Internal: Data Mining Platform http://twiki.net/~vw/ Machine Learning Open Software Project http://mloss.org/svm_struct.tw/~cjlin/libsvm/ SVM Light http://svmlight.com LIB SVM (Powerful SVM implementation) http://www.php/Main_Page Internal: Alex Smola’s LDA implementation smola@yahoo-inc.nz/~ml/index.org/software MALLET: Machine Learning for Language Toolking http://mallet.Ready-Made Packages Weka 3 http://www.ac.edu/index.html Internal Boosted Decision Tree Implementation http://twiki.yahoo.corp.yahoo.uchicago.cs.html SVM Lin (very fast linear SVM) http://people.com/view/Yst/Clue .waikato.corp.ntu.cs.cs.html Vowpal Wabbit (very large scale) http://hunch.

Model Selection (parameter setting with cross validation) Train Train’ Val Do not trust default parameters!!!! Grid Search over parameters Most importantly: Learning rate!! Often easy to use hod-farm on hadoop (Jerry Ye) Pick best parameters for Val .

Parallelize your experiments (use hod-farm) .5. Experimental Setup 1. Automate everything (one button setup) • • • pre-processing / training / testing / evaluation Let’s you reproduce results easily Fewer errors!! 2.

TRUE To avoid data overfitting benchmark on a second train/test data set. FALSE You cannot create train/test split when your data changes over time.Quiz T/F: Condensing features with domain expertise improves learning? FALSE T/F: Feature scaling is irrelevant for boosted decision trees. derive your signal directly from the features. T/F: Ideally. FALSE . FALSE T/F: Always compute aggregate statistics over the entire corpus.

Debugging ML algorithms .

Problem: Your test error is too high (12%)! QUIZ: What can you do to fix it? .Problem: Spam filtering You implemented logistic regression with regularization.

meta features. header information) Run gradient descent longer Use Newton’s Method for optimization Change regularization Use SVMs instead of logistic regression But: which one should we try out? .g.Fixing attempts: Get more training data Get more features Select fewer features Feature engineering (e.

Possible problems Diagnostics: Underfitting: Training error almost as high as test error Overfitting: Training error much lower than test error Wrong Algorithm: Other methods do better Optimizer: Loss function is not minimized .

Underfitting / Overfitting .

Diagnostics over fitting • test error still decreasing with more data • large gap between train and test error error testing error desired error training error training set size .

Diagnostics under fitting • even training error is too high • small gap between train and test error error testing error training error desired error training set size .

Convergence problem vs. wrong loss function? .

Loss function where are you? L(w) only global minimum .

decision trees?) otherwise: compute your training loss L(w’) [w’ parameters obtained by other loss] .Diagnostics Your loss function L(w) [ w = parameters ] Train various loss functions (e. you might need a more powerful function class (kernels. use WEKA) If all algorithms perform worse on test set.g.

Your loss function is bad loss L(w’) L(w) iterations .Diagnostics Case 1: w’ has lower test error and higher loss.

Diagnostics Case 2: w’ has lower test error and lower loss. Your optimizer is bad loss L(w) L(w’) iterations .

Quiz: Get more training data Get more features Select fewer features Feature engineering Run gradient descent longer Use Newton’s Method for optimization Change regularization Use SVMs instead helps against: (overfitting / underfitting / bad optimizer / bad loss) overfitting underfitting overfitting underfitting bad optimizer bad optimizer underfitting / overfitting bad loss function .

Debugging Data .

Problem: Online error > Test Error error testing error training error online error desired error training set size .

) . train+test If you can learn this (error < 50%). you have a distribution problem!! You do not need any labels for this!! (Alex Smola YR is a world expert in covariate shift.Analytics: train/test online Suspicion: Online data differently distributed Construct new binary classification problem: Online vs.

Suspicion: Temporal distribution drift Train Test time 12% Error shuffle Train Test 1% Error If E(shuffle)<E(train/test) then you have temporal distribution drift Cures: Retrain frequently / online learning .

Problem: You are “too good” on your setup . online error error desired error training error testing error iterations ...

5 0 2005 2006 2007 .0 67.0 22.Possible Problems Caltech 101 Test Accuracy Is the label included in data set? Does the training set contain test data? Famous example in 2007: Caltech 101 90.5 45.

Caltech 101 2007 2009 .

True .Final Quiz Increasing your training set size increases the training error. Temporal drift can be detected through shuffling the training/test sets. T/F: Better and more features always decreases the test error? False T/F: Very low test error always indicates you are doing well. Increasing your feature set size decreases the training error. True T/F: The test error is (almost) never below the training error. T/F: Underfitting can be cured with more powerful learners. False When an algorithm overfits there is a big gap between train and test error.

Summary Marty: “Machine learning is only sexy when it works. Carefully rule out possible causes 2.” ML algorithms deserve a careful setup Debugging is just like any other code 1. Apply appropriate fix .

Resources Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) Y.com/watch?v=UzxYlbK2c7E . in Orr.youtube. Bishop Andrew Ng’s ML course: http://www. and Muller K. Springer. L. G. Muller: Efficient BackProp. 1998. Neural Networks: Tricks of the trade. LeCun. Orr and K. Pattern Recognition and Machine Learning by Christopher M. (Eds). Bottou. G.

Sign up to vote on this title
UsefulNot useful