You are on page 1of 10

KDDCup2014PredictingExcitementat

DonorsChoose.org

WinningEntryDocumentation

Name:JeremyAchin
Location:BostonMA,UnitedStates
Email:jeremy@datarobot.com

Name:XavierConort
Location:Singapore
Email:xavier@datarobot.com

Name:LucasEustquioGomesdaSilva
Location:BeloHorizonteMG,Brazil
Email:lucas@datarobot.com

Summary

DonorsChoose.orgisanonlinecharitythatmakesiteasytohelpstudentsinneedthrough
schooldonations.Atanytime,thousandsofteachersinK12schoolsproposeprojects
requestingmaterialstoenhancetheeducationoftheirstudents.Whenaprojectreachesits
fundinggoal,theyshipthematerialstotheschool.
The2014KDDCupaskedparticipantstohelpDonorsChoose.orgidentifyprojectsthatare
exceptionallyexcitingtodonorsatthetimeofposting.

Inordertopredicthowexcitingisaproject,datawasprovidedinarelationalformatand
splitbydates.AnyprojectpostedpriortoJanuary1,2014wasinthetrainingset(along
withitsfundingoutcomes).AnyprojectpostedafterJanuary1,2014wasinthetestset.

ThetestsetusedknownoutcomesfromJanuary2014tomidMay2014.Kaggleignored
liveprojectsinthetestsetanddidnotdisclosewhichprojectswerestilllivetoavoid
leakageregardingthefundingstatus.

Adatadictionnaryofthedataprovidedisavailablehere:
https://www.kaggle.com/c/kddcup2014predictingexcitementatdonorschoose/data

Ourapproachtriedtoextractthebestfeaturesfromthedataandusethemin2Gradient
BoostingMachinesmodels:

onebasedonthesklearnGradientBoostingRegressor
(http://scikitlearn.org/stable/modules/generated/sklearn.ensemble.GradientBoosting
Regressor.html)
andonebasedontheRgradientboostingmachine(gbm)package
(http://cran.rproject.org/web/packages/gbm/)

Bothusedasresponse2013outcomesonly.Thisgaveussignificantgainincomputation
timewithoutmuchlossinpredictiveaccuracy.

WemadetheassumptionthatthemidMaycutoffinthetestsetproducesacensoringeffect
ontheresponse.Toreproducetheassumedtimeeffectontheresponseinthetraining
set,wecensoredtheresponsebeforetrainingourmodelsandcreated2typesofcensored
outcomes:
randomcutoffoutcomes:excitingoutcomescensoredat3cutoffsdrawnrandomly
fromthefirst131daysaftertheprojectwasposted
20weeksoutcomes:excitingoutcomescensoredeveryweekduringthefirst20
weeks

FeatureExtraction

Ourfeatureextractionconsistsof:
1. Rawfeaturesfromprojects.csv.Thiscontainsinformationabouteachprojectand
wasprovidedforboththetrainingandtestset.
2. Lapsesbetweenprojectspostedbyteachers
3. Proxiesoftextpostedbyteacherssuchasnbofcharacters,nbofwords,statson
lengthofwords,nbofsentences,nbofwordspersentence,statsonpunctuations
usage,misspelling,etc...
4. Stackedpredictionsoftheexcitingoutcomeandoftherequiredcriteriatobe
qualifiedasexciting,basedoninformationcontainedinprojecttitleandessay
postedbytheteacher
5. Deviationsfromanexpectedprojectcostthatwasestimatedbyastacked
predictionofthecost.ThemodelusedtopredictwasaGradientBoostingMachine
thatusedaspredictors"primary_focus_subject","grade_level"and
"students_reached"
6. Vendoridofthemostexpensiveitemintheproject
7. Stackedpredictionsoffinalexcitingoutcomebasedonthenameofthemost
expensiveitem
8. Historyfeatures

Tobuildhistoryfeatures,weslicedtimeintochunksof4months,computedstatisticsfor
eachchunkandusedasfeaturesthestatsofthelast3chunkspriortothetimechunkof
theproject.


Thestatsthatwecomputedforeachchunkincludestatson
1. Projectspostedbyteachers:
a. nbofprojects
b. foreachcriteria,sumofprojectsthatmetthecriteria
c. criteriametbylastproject
d. meanprojectcost
2. Donationsreceivedbyteachers:
a. nbofdonationsreceived
b. sumandlastamountsreceived
3. Donationsmadebyteachers:
a. nbofdonationsmade
b. sumofamountsdonated
c. sumofexcitingprojectstowhichtheteachersdonatedmoney
d. sumofdistancebetweentheteacherlocationandthelocationofprojects
theysponsored
4. Donationsmadebythezip,city,stateoftheproject:
a. sumandmeanamountdonated
b. sumandmeanofexcitingoutcomesoftheprojectssponsored

Tobuildstackedpredictionsoftheexcitingoutcomeandcriteriametbyaproject,we
trainedregularizedregressions(fromtheRpackageglmnet)trainedonwords2grams
documenttermmatricesgeneratedfromtheprojecttitleandtheessaypostedbythe
teacher.Regressionsweretrained:
Byprimaryfocusareaonemodelforeacharea
Foreacharea,webuilt:
onelogisticregression(L1penalty)topredictis_excitingusingtitle
documenttermmatrix
onelogisticregression(L2penalty)topredictis_excitingusingessay
documenttermmatrix
regressions(L2penalty)topredicteachcriteriatoqualify(fullyfunded,
at_least_1_teacher_referred_donor,...)usingessaydocumenttermmatrix
only

Stackedpredictionsofthefinalexcitingoutcomebasedonthenameofthemostexpensive
itemusedanelasticnetlogisticregression(fromglmnet)trainedonawords2grams
documenttermmatrix.

Alltextstackedpredictionsweregeneratedvia5foldscrossvalidation.

Fromthis,wecreated2setsoffeatures:
FG1:mostlydescribestheprojectstopredict,theteachersprojecthistoryandthe
teachersdonationshistory.Thefeaturesetincludes:

Rawfeaturesfromprojects.csv
Lapsesbetweenprojectspostedbyteachers
Statsonpastprojectspostedbyteachers
Statsonpastdonationsreceivedbyteachers
Statsonpastdonationsmadebyteachers
Textproxiesoftextpostedbyteachers
FG2:includesmorefeaturesontheprojectsandusesonlythehistoryofdonations
madebytheteachersortheprojectlocations(zip,cityandstate).Thisfeatureset
canbeseenasdesignedforteacherswithlow/noprojecthistorywhilethefirstset
reliesmoreonpastperformanceofteachersprojects.
Rawfeaturesfromprojects.csv
Lapsesbetweenprojectspostedbyteachers
Statsonpastdonationsmadebyteachers
Statsonpastdonationsmadebythezip,city,stateoftheproject
Stackedpredictionsoftheexcitingoutcomeandcriteriarequiredforaproject
tobequalifiedasexcitingbasedontheprojecttitlesandessayspostedby
teachers
Deviationsfromtheprojectexpectedcost
Vendoridoftheprojectsmostexpensiveitem
Stackedpredictionsoftheexcitingoutcomebasedonthenameofthemost
expensiveitem

ModelingTechniquesandTraining

WemadetheassumptionthatthemidMaycutoffinthetestsetproducesacensoringeffect
ontheresponseandweexpectedthiseffecttobemuchstrongerforthemostrecent
months.Toreproducetheassumedtimeeffectontheresponseinthetrainingset,we
censoredtheresponsebeforetrainingourmodelsandcreated2typesofcensored
outcomes:
randomcutoffoutcomes:excitingoutcomescensoredat3cutoffsdrawnrandomly
fromthefirst131daysaftertheprojectwasposted
20weeksoutcomes:excitingoutcomescensoredeveryweekduringthefirst20
weeks

Wetrained
Onesklearngradientboostedtreesmodeltopredicttherandomcutoffoutcomes.
ThemodelusedtheFG1featuresetandtherandomcutoffaspredictors.The
trainingsetsizewasmultipliedby3aseachrecordofthetrainingsethad3
censoredresponses.
TwentyRgradientboostingmachine(gbm)models:onemodelforeachweekofthe
20weeksoutcomes.AllmodelsusedtheFG2featuresetaspredictors.

Allmodelsweretrainedwith2013outcomesonly.

Thesklearngradientboostedtreesmodelusedashyperparameters:
n_estimators:2000
learning_rate:0.01
max_features:12
max_depth:7
subsample:1

The20Rgradientboostingmachinemodelsusedashyperparameters:
distribution="bernoulli"
n.trees=2500+week_n*100
n.minobsinnode=10

interaction.depth=5
shrinkage=0.01
bag.fraction=0.75
withweek_nthenbofweeksusedtocensortheresponse

Topredictoutcomesinthetestset:

Wefirstcomputedthenbofdaysnbetweentheprojectposteddateandthetest
setcutoff(May12,2014)
Whenusingthesklearnrandomcutoffgbm,thenbofdaysnwasusedasa
predictor
WhenusingtheR20weeksgbms,weselectedthegbmsthatweretrainedwitha
numberofweekscloseton/7
Averagedthe2solutions

CodeDescription

CodetogenerateFG1features

Script

Folder

Description

FG1_functions.R

mainfolder

Supportfunctions.

RUN_FG1.R

mainfolder

RunsolutionforFG1features

FG1_read_files.R

mainfolder

Readcompetitionsfilesanddosimplefeature
transformation

FG1_cost.R

mainfolder

Buildhistoryofcostofteacherspastprojects

FG1_outcomes.R

mainfolder

Buildhistoryofoutcomesofteacherspast
project

FG1_received.R

mainfolder

Buildhistoryofdonationsreceivedby
teachers

FG1_donated.R

mainfolder

Buildhistoryofdonationsmadebyteachers

FG1_txt_proxies.R,
FG1_vocab.R,
FG1_proxies.R

mainfolder

Buildtextproxiesoftextpostedbyteachers

FG1_lapse.R

mainfolder

Computelapsebetweenprojectsofasame
teacher

FG1_subset.R

mainfolder

ListofFG1features

FG1_Conso.R

mainfolder

Consolidatesfeaturesandsavesthemtodisk

Codetogeneratestackedtextfeatures

Script

Folder

Description

FG2_essay_NLP.R

mainfolder

Savetodisktextpostedbyteachers

FG2_resources.R

mainfolder

ExtractitemnameandVendoridof
mostexpensiveitemofaprojectand
savetodisk.

RUNNLP.R

NLP

Runstackedpredictionssolutionfortext
postedbyteachersanditemname

NLP

Trainstackedpredictionssolutionfor
textpostedbyteachersandsavemodel
andstackedpredictionsintodisk

NLP

Trainstackedpredictionssolutionfor
itemnameandsavemodelandstacked
predictionsintodisk

GLMNETsFITS.R

GLMNETFITSitem.R

_DTM_WORDS.R

NLP

Converttextintowordngrams
documenttermmatrix

_NUMBERS.R

NLP

Convertnumberintotext

_KFolds.R

NLP

Partition

_METRICS.R

NLP

functiontocomputeevaluationmetrics

CV_GLMNET.R

NLP

functiontotrainglmnetonKfolds

GLMNETsPREDICT.R

NLP

Predictionsbasedontextpostedby
teachersandsavepredictionsintodisk

GLMNETPREDICTitem.R

NLP

Predictionsbasedonitemnameand
savepredictionsintodisk

CodetogenerateFG2features

Script

Folder

Description

FG2_functions.R

mainfolder

Supportfunctions.

RUN_FG2.R

mainfolder

RunsolutionforFG2features

FG2_essay_NLP.R

mainfolder

Savetodisktextpostedby
teachers

FG2_donations_distance.R

mainfolder

Buildfeaturesofrelativelocation
ofdonations(receivedand
madebyteachers)

FG2_cost_deviation.R

mainfolder

Estimateanormalcostfora
project

FG2_donation_history_per_locati
on.R

mainfolder

Buildhistoryofdonationsmade
thezip,cityandstateofthe
project

FG2_subset.R

mainfolder

ListofFG2features

FG2_Conso.R

mainfolder

Consolidatesfeaturesand
savesthemtodisk

Codetogeneratecensoredoutcomes


Script

Folder

Description

fn.base.R

kddcup2014r

Supportfunctions.

data.build.R

kddcup2014r

Buildthefeaturesandsavesthemto
disk

Script

Folder

Description

sci_learn_train.py

kddcup2014py

Pythonscripttotraingradientboosted
trees

train.FG1.rc.R

kddcup2014r

Trainrandomcutoffoutcomesmodel

train.FG2.20W.R

kddcup2014r

Train20weeksoutcomesmodel

train.ens.R

kddcup2014r

Averagethe2solutionsandsavethe
submissionfileindata/submission

Codetopredict

HowtoRuntheCode

1. unzipKDD2014_DATAROBOT.zip
2. Putcompetitionfilesintothe"data/input"folder.
3. OpenaRsessionwithfolder"KDD2014_DATAROBOT"setasworkingdir.
4. RunRUNFG1.R
5. OpenaRsessionwithfolder"KDD2014_DATAROBOT/NLP"setasworkingdir.
6. RunRUNNLP.R
7. OpenaRsessionwithfolder"KDD2014_DATAROBOT"setasworkingdir.
8. RunRUNFG2.R
9. OpenaRsessionwithfolder"KDD2014_DATAROBOT/kddcup2014r"setas
workingdir.
10. Rundata.build.R
11. Runtrain.FG1.rc.R
12. Runtrain.FG2.20W.R
13. Runtrain.ens.R

ThepredictionswillbesavedinKDD2014_DATAROBOT/data/submission/ens.csv


Dependencies

Tobuildthesolution,Randpythonwereused.TheRversionusedwas3.0.2,andthe
Pythonversionwas2.7.3.
Asforthepackages:
R:SOAR0.9911,doSNOW1.0.9,foreach1.4.1,cvTools0.3.2,data.table1.8.10,
Matrix1.14,tau0.015,RtextTools1.4.1,glmnet1.95,gbm2.1
Python:pandas0.13.1,numpy1.8.1,scikitlearn0.15.0

Allthelistedversionaretheusedones.Itwillprobablyworkwithnewerversions,butit
wasn'ttested.

AdditionalComments

Thetimebiaspresentinthetestsetmadepredictionsverychallenging.
Wechosetotrustoursolutionsbasedoncensoredoutcomesratherthansolutionsusing
therawresponseandalineardecaytoadjustthesubmission.
Basedonothercompetitorsfeedbackandourown(unselected)submissions,modelswith
amoreaggressivetimedecayperformedbetteronthePrivateLeaderboard(ourhighest
scoreofunselectedsubmissionswentupto0.685).
Thiscouldbeexplainedbyeitheraseasonalitythatwedidntcaptureinourmodelsor
someadditionalcensoringdonebyKaggleinthetestset.

References

J.Friedman,GreedyFunctionApproximation:AGradientBoostingMachine,TheAnnals
ofStatistics,Vol.29,No.5,2001.

Friedman,StochasticGradientBoosting,1999

Hastie,R.TibshiraniandJ.Friedman,ElementsofStatisticalLearningEd.2,Springer,
2009.