1 File List
sample construction.py (author Keith Herring):
A pythonscript for initializing the training samples from the data set of rawuser edit histories. It inializes each sample as a time-limited recordof a single user’s edit history.2.
feature construction.py (author Keith Herring):
A pythonscript for converting the raw edit history of a time-limited user sam-ple into the 206-element feature vector that the edits predictor op-erates on.3.
random forest (GPL licensed):
Matlab CMEX implementationof the Breimann-Cutler Random Forest Regression Algorithm4.
edits learner.m (author Keith Herring):
Matlab implementedalgorithm for learning a suite of weak-to-strong future-edit models.5.
ensemble optimizer.m (author Keith Herring):
Matlab im-plemented algorithm for ﬁnding the optimal model weights for theesemble future edits predictor.6.
edits predictor.m (author Keith Herring):
Matlab implementedalgorithm that predicts the future edits for a user as a function of its associated edit-history derived 206-element feature vector.7.
models.mat (author Keith Herring):
Matlab data ﬁle contain-ing the 34 decision tree models in the ﬁnal ensemble.8.
training statistics.m at(author Keith Herring):
Matlab dataﬁle containg the training population means and standard deviationsfor the 206 features.9.
ensemble weights.mat (author Keith Herring):
Matlab dataﬁle containg the weights for the 34 models in the ﬁnal ensemble.
2 Raw Data
An interesting aspect of this challenge was that it involved public data..As such there was opportunity to improve one’s learning capability byobtaining additional data not included in the base data set (training.tsv).Given this setup, the ﬁrst step of my solution was to write a web scraperfor obtaining additional pre-Sept 1, 2010 (denoted in timestamp format2010-09-01 in subsequent text) data for model training. More speciﬁcally Iwanted to obtain a larger, more representative sample of user edit historiesand also additional ﬁelds not included in the original training set. Intotal I gathered pre-2010-09-01 edit histories for approximately 1 millionWikipedia editors. The following attributes were scraped for each user:1.
The timestamp in which a user was blocked.NULL if not blocked or blocked after Aug 31, 2010.2.
For each pre-2010-09-01 user-edit the fol-lowing attributes were scraped:(a)
The title of the article edited.(d)
0-5. All other namespaces were discarded, al-though an intersting extension would be to include the
5namespaces to test if they provide useful information on theediting volume over lower namespaces.