148 views

Uploaded by rohananil

- PentlandLiu_NeuralComp99_v11n2
- Data - Second Annual Data Science Bowl _ Kaggle
- MS Project Report - Final - GrockIt on Kaggle.com
- OPTIMAL ACCOUNTING BASED DEFAULT PREDICTION MODEL FOR THE UK SMEs
- Neural Network Practical Use En
- Decision Trees
- An Intelligent Short Term Stock Trading Fuzzy System for Assisting
- xgboost-150831021017-lva1-app6891
- randomForest Vs NNA.pdf
- Getting Started With Neuroph 2.3
- Market Impact Measurement of a VWAP Trading Algorithm
- Dadea-Comments-rev01 (1)
- Churn Prediction
- A new product growth for model consumer durables (original).pdf
- Various Techniques to Detect and Predict Faults in Software System: Survey
- Machine Learning and Data Mining
- Bayesian Learning Decision Tree
- A WIND POWER PREDICTION METHOD BASED ON BAYESIAN FUSION
- 0102-05_LFeBk
- Open23

You are on page 1of 75

' '

Latent feature approach for the Kaggle' s GrockIt challenge

Rohan Anil Advised by Prof. Charles Elkan collaboration with Aditya Menon UC San Diego March 19, 2012

Outline

Introduction

Kaggle.com

st

GrockIt.com

Dataset

Training Set 4,851,476 outcomes of students answering various questions Outcomes Four types:i) correct ii) incorrect iii) skipped iv) timed-out. Students practicing for competitive exams i) GMAT, ii) ACT and iii) SAT

Dataset

Dataset

Differences between training set and test set are:Bias Biased towards users who have answered more questions. #Respone Only one response per student Temporal Outcomes are latter in time than the training responses and validation responses of that student. Outcomes Test set distribution is different from training set,it does not include timed-out or skipped outcomes.

Baseline

Rasch Baseline

A baseline was provided by Kaggle for the dataset.

...

Bs - ability of

the student ' s' question ' q'

q - difficulty of

The probability of answering a question is only dependent on the difficult of the question q Consequence of this is that for every student, the ranking interms of probability of answering the question correctly is the same.

Dataset

Validation set Grockit created a validation set which contains responses of 80,075 students on different questions. Test set Test set was used for ranking the teams, it contains responses of 93,100 users on different questions.

Dyadic Prediction

A dyadic prediction task is a learning task which involves predicting a class label for a pair of items ( Hoffman 1999 )

Side-Information

Sometimes there is more information in the dataset. They are 1. side-information associated with u 2. side-information associated with i 3. interaction side-information for (u,i)

The dataset contains student responses for various questions. 179,107 students and 6,046 questions

Skipped

....

Timed out

...

Nominal Outcomes

Dyadic Prediction

Training Set

, .....

..... .....

Dyadic Prediction

Query in Test

Associated with a student Not Available Associated with a question Question Type, Group, Track, Subtrack, Tags Associated with (student,question) dyad Game, Number of Players, Started at, Answered at, Deactivated at, Question set

Side Information

Question Type

Multiple Choice, Free Response

Group

ACT, GMAT, SAT

Subtrack

Critical Reasoning, Data Sufficiency, English, Ientifying Sentence Errors, Improving Paragraphs, Improving Sentences, Math, Multiple Choice, Passage Based Reading, Problem Solving, Reading, Reading Comprehension, Science, Sentence Completion, Sentence Correction, Student Produced Response

Dataset

Dataset

Dataset

....

The dataset is similar to the typical dyadic dataset with a couple of key differences:

Duplicate Dyads

There can exist duplicate dyad pairs in the training set with different outcomes, since a student can answer a question many times,

In some games types, students can collaboratively answer questions.

Highly successful at winning the Netflix prize 1M$ challenge (Toscher et al., 2009) where the problem was to predict ratings for movies.

Binomial Capped Deviance, similar to log-likelihood

Leaderboard

Motivations for Latent Feature Log-Linear (LFL) (Menon & Elkan, 2010) Well calibrated Probabilities

we need to predict the probability of correct outcome for the dyadic pairs in the test set.

Leverage Side-Information

Most collaborative filtering algorithms do not have any principled way of including side-information

Scale Well

To be used in the industry, the method has to scale well to large datasets

Case | Y| = 3

U1

I1

2 U1

I2

U3

exp( U3user . I3item )

I3

p(y=3 | (user,item)) =

Z

Z = exp( U1user . I1item ) +exp( U1user . I1item ) + exp( U3user . I3item )

Test Set contains only two types of outcomes i) correct ii) incorrect

y = 1 ( Correct Response) y = 0 ( Incorrect Response) The binary-LFL model has appeared in the literature before (Schein et al., 2003; Agarwal & Chen, 2009)

Training

We optimize for the negative log likelihood

Regularization Terms

We can optimize this objective function using the stochastic gradient descent method.

LFL on GrockIt

Grid Search

parameters

This is us!!! =)

Parallelism

Side-Information

For a question q, let g = group(q). We can add a latent vector for each group i.e ACT, GMAT, SAT Prediction equation after adding side information is

Categorical Features

Group G Track T Subtrack ST Game Type GT Question Type QT

LFL Models

Training Set

Training set contains four types of outcomes i) correct, ii) incorrect, iii) skipped and iv) timed-out. Test set contains four types of outcomes i) correct, ii) incorrect We create two training sets,

a) Training set with skipped and timed-out responses excluded b) Training set with skipped and timed-out responses treated as an incorrect outcome

Observation

Throwing away data helps!

Removing skipped and timed-out responses from training set improved the BCD (binomial capped deviance)

Motivates for adapting the model to the testset distribution to win the competition.

Ensemble Learning

No Single Model works well on every dyad.

Combining predictions from multiple models can outperform each of the individual models (Takcas et al., 2009 )

True labels for four samples (1,1,0,0) Predictions from four different models. (0,1,0,0) accuracy 75% (1,0,0,0) accuracy 75% (1,1,1,0) accuracy 75% (1,1,0,1) accuracy 75% Average of different models (.75,.75,.25,.25) Threshold the average at 0.5 (1,1,0,0) accuracy = 100%

For a set with known labels, { (s,q) > y(s,q) } , where y can take 0 or 1 pi = p i ( y= 1) | (s,q) ) is the estimated probability of a correct response from the ith model Define matrix P and column matrix Y, where each row of P contains predictions from n models, ( p1 .., p i , .. p n ) and Y contains the target value y(s,q) Similarly using predictions for every dyad in the set, we create matrix P with predictions and Y with target values. We solve,

Pw = Y

....

To predict the probability of a correct response of an example in the test set, We combine predictions from n models using the weight vector w pestimated = wj pj

Step 1 for each of the n models Train on the training set Predict on the validation set save parameters Step 2: Estimate w using linear regression on the validation set predictions Step 3: for each of the n models Train on the training set + validation set Predict on the test set Step 4: Combine predictions of the test set using w

Results

2 weeks later

Leverage Side-Information in Ensemble learning Gradient Boosted Decision Trees (GBDT) (Friedman, 1999) algorithm can be used to combine predictions and side information together. Popular algorithm GBDT is a powerful learning algorithm that is widely used (see Li & Xu, 2009, chap. 6) The core of the algorithm is a decision tree learner

Decision Tree

Decision tres can handle both i) Numeric, and ii) categorical variables. It can also handle missing information.

Decision Tree

Decision function

Prediction ( Y1 + Y3 ) / 2

...................

.................

Prediction ( Y6 + Y7 + Y9 ) / 3

Gradient Boosting

Select the base learner, and loss function.

Decision Tree as the base learner, and Squared Loss as the loss function Gradient boosting is an iterative-procedure Iteratively fit a base learner on the gradient of the previous iteration

Gradient Boosting

Meta-Features

Preprocessing Tags

Each question has a set of tags that is associated with it. Some are listed below

Statistics (incl. mean median mode),259 Strengthen Hypothesis,260 Student Produced Response,261 System of Linear Equations,262 Systems of Linear Equations,263 Systems of linear equations and inequalities,264

We manually merge the tags that we feel are very similar. We cluster the tags into 40 clusters using spectral clustering (Ng et al., 2001) with normalized co-ocurrence of tags as the similarity measure to generate the affinity matrix A.

...

Last day

Combined predictions from GBDT models using linear regression, improved slightly.

Latent feature approach is a good approach for this dataset. LFL performs really well on the dataset Code will be available soon @ http:/ / code.google.com/ p/ latent-feature-log-linear/

Questions

References

Agarwal, Deepak and Chen, Bee-Chung. Regression based latent factor models. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 09, pp. 19 28, New York, NY, USA, 2009. ACM. ISBN 978- 1-60558495-9. Friedman, Jerome H. Stochastic gradient boosting. Computational Statistics and Data Analysis, 38: 367 378, 1999. Gemulla, Rainer, Nijkamp, Erik, Haas, Peter J., and Sismanis, Yannis. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 11, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0813-7. Hofmann, Thomas, Puzicha, Jan, and Jordan Michael I. Learning from dyadic data. In Proceedings of the 1998 conference on Advances in neural information processing systems II, pp. 466 472, Cambridge, MA, USA, 1999. MIT Press. ISBN 0-262-11245-0. Li, Xiaochun and Xu, Ronghui (eds.). High dimensional data analysis in cancer research. Springer, CA, U.S.A, 2009. Menon, Aditya Krishna and Elkan, Charles. A log linear model with latent features for dyadic predic-tion. In ICDM10, pp. 364 373, 2010. Ng, Andrew Y., Jordan, Michael I., and Weiss, Yair. On spectral clustering: Analysis and an algorithm.In Advances in Nueral Information Processing Systems, pp. 849 856. MIT Press, 2001.

References

Rasch, Georg. Estimation of parameters and control of the model for two response categories, 1960. Schein, Andrew I., Lawrence, Andrew I., Saul, Lawrence K., and Ungar, Lyle H. A generalized linear model for principal component analysis of binary data, 2003. Takcas, G abor, Pilaszy, Istvan, Nemeth, Bottyan, and Tikk, Domonkos. Scalable collaborative filtering approaches for large recommender systems. J. Mach. Learn. Res., 10:623 656, June 2009. ISSN 1532- 4435. Tscher, Andreas, Jahrer, Michael, and Bell, Robert M. The bigchaos solution to the netflix grand prize, 2009.

- PentlandLiu_NeuralComp99_v11n2Uploaded byKürşat Kaygı
- Data - Second Annual Data Science Bowl _ KaggleUploaded byRVP
- MS Project Report - Final - GrockIt on Kaggle.comUploaded byrohananil
- OPTIMAL ACCOUNTING BASED DEFAULT PREDICTION MODEL FOR THE UK SMEsUploaded byAngus Sadpet
- Neural Network Practical Use EnUploaded byCao Thang
- Decision TreesUploaded byYUSRIL
- An Intelligent Short Term Stock Trading Fuzzy System for AssistingUploaded byThoth Dwight
- xgboost-150831021017-lva1-app6891Uploaded byPushp Toshniwal
- randomForest Vs NNA.pdfUploaded byJack Daniel
- Getting Started With Neuroph 2.3Uploaded byIvan Palacios
- Market Impact Measurement of a VWAP Trading AlgorithmUploaded byErezwa
- Dadea-Comments-rev01 (1)Uploaded byJan Villanueva
- Churn PredictionUploaded byPritamJaiswal
- A new product growth for model consumer durables (original).pdfUploaded bymalfin_z
- Various Techniques to Detect and Predict Faults in Software System: SurveyUploaded byRahul Sharma
- Machine Learning and Data MiningUploaded byMartin Manullang
- Bayesian Learning Decision TreeUploaded byAnonymous JSHUTp
- A WIND POWER PREDICTION METHOD BASED ON BAYESIAN FUSIONUploaded byCS & IT
- 0102-05_LFeBkUploaded byGelu Diaconu
- Open23Uploaded byamcuc
- williams et al cns 2019Uploaded byapi-427600178
- Cellular Lab HattaUploaded byRahma Atallah
- Syllabus PT7Uploaded byKiyo Nath
- Software Jk Sim MetUploaded byJorge Ttica
- BIDM Assignment No1Uploaded byee052022
- Final DZuo TDingUploaded byGC
- da_demUploaded byimadhuryya5023
- 13. Module 14 Sensitivity AnalysisUploaded byRizki Anggraeni
- Strategy PRINTEDITTEDUploaded byNini Mohamed
- Decicion Analysis-2012 [Compatibility Mode]Uploaded byUshan De Silva

- IMBA Brochure8!19!16f3Uploaded byGilchrist Tossou
- David Benatar - Suicide, A Qualified DefenseUploaded byAlan Taveras
- Planning is Based on the Theory OfUploaded byAttaullah Malakand
- Cdr Template Example of Continuing Professional DevelopmentUploaded bypradator
- Hegel and Organization TheoryUploaded bymadspeter
- reflection - creating my futureUploaded byapi-244624754
- rw c prep syllabus spring 2016Uploaded byapi-248439058
- IH Journal No35Uploaded bySteven Wilson
- SNLP_16_Chap1Uploaded byVonny Pawaka
- checklist for evaluating internet sourcesUploaded byapi-239537002
- MCSUploaded bySayan Das
- Gestalt Principle Examples[1]Uploaded byvimbo
- BDRRM Training Facilitators Guide and Sourcebook.pdfUploaded byVincent Bautista
- Presupposition Stalnaker 1973Uploaded byAnonymous CjcDVK54
- Over Training and Burnout in SportsUploaded byMelvin N. Escartin
- Teacher's PlanUploaded byTimothy Crittenden
- Organization and Change Notes.Uploaded byamitgargbi3353
- 11 Outline PersonalityUploaded byViola Hastings
- Gibson - James J. GibsonUploaded byPatrícia Netto
- The ABCs of Human Behavior 1Uploaded byaranda_88
- Inferences in the Comprehension of LanguageUploaded byAlexander Decker
- observation sheet primaryUploaded byapi-263104734
- How Pisa came to rule the world.docxUploaded byAdrian Palea
- Transformational Leadership 1-s2.0-S0148296313000659-MainUploaded byshafiqah
- Nurse Process RecordingUploaded bymmcgee002
- cultural adaptationUploaded byPeggy Moonien
- Bahan Makalah Pak AwiUploaded byEmmanuella Monica
- Best PracticesUploaded bynavneet
- Dessler Hrm12ge Ppt 03Uploaded bynobi26
- UT Dallas Syllabus for math1325.002 05s taught by Paul Stanford (phs031000)Uploaded byUT Dallas Provost's Technology Group