Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
0 of .
Results for:
No results containing your search query
P. 1


Ratings: (0)|Views: 32 |Likes:
Published by Yang

More info:

Published by: Yang on Nov 18, 2011
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





Wikipedia Participation Challenge Solution
Keith T. HerringUser Name: Ernest ShackletonTeam: Ernest ShackletonOctober 1, 2011
This document describes my (Keith Herring) solution to the WikipediaParticipation Challenge. I can be reached at keith.herring@gmail.com orkherring@mit.edu. I appreciate the important contributions Wikipediahas made to the accessibliity of information, so I hope this analysis willbe useful to their cause. Thanks.
1 File List 22 Raw Data 2
2.1 Namespace Classier Bug . . . . . . . . . . . . . . . . . . . 3
3 Training Set Construction 34 Feature Construction 35 Sample Bias Correction 56 Model Learning/Training 7
6.1 Standard Random Forest Learning . . . . . . . . . . . . . . 76.2 Future Edits Learning Algorithm: . . . . . . . . . . . . . . . 7
7 Conclusions and Interpretation 9
1 File List
sample construction.py (author Keith Herring):
A pythonscript for initializing the training samples from the data set of rawuser edit histories. It inializes each sample as a time-limited recordof a single user’s edit history.2.
feature construction.py (author Keith Herring):
A pythonscript for converting the raw edit history of a time-limited user sam-ple into the 206-element feature vector that the edits predictor op-erates on.3.
random forest (GPL licensed):
Matlab CMEX implementationof the Breimann-Cutler Random Forest Regression Algorithm4.
edits learner.m (author Keith Herring):
Matlab implementedalgorithm for learning a suite of weak-to-strong future-edit models.5.
ensemble optimizer.m (author Keith Herring):
Matlab im-plemented algorithm for finding the optimal model weights for theesemble future edits predictor.6.
edits predictor.m (author Keith Herring):
Matlab implementedalgorithm that predicts the future edits for a user as a function of its associated edit-history derived 206-element feature vector.7.
models.mat (author Keith Herring):
Matlab data file contain-ing the 34 decision tree models in the final ensemble.8.
training statistics.m at(author Keith Herring):
Matlab datafile containg the training population means and standard deviationsfor the 206 features.9.
ensemble weights.mat (author Keith Herring):
Matlab datafile containg the weights for the 34 models in the final ensemble.
2 Raw Data
An interesting aspect of this challenge was that it involved public data..As such there was opportunity to improve one’s learning capability byobtaining additional data not included in the base data set (training.tsv).Given this setup, the first step of my solution was to write a web scraperfor obtaining additional pre-Sept 1, 2010 (denoted in timestamp format2010-09-01 in subsequent text) data for model training. More specifically Iwanted to obtain a larger, more representative sample of user edit historiesand also additional fields not included in the original training set. Intotal I gathered pre-2010-09-01 edit histories for approximately 1 millionWikipedia editors. The following attributes were scraped for each user:1.
Blocked Timestamp:
The timestamp in which a user was blocked.NULL if not blocked or blocked after Aug 31, 2010.2.
Pre-2010-09-01 Edits:
For each pre-2010-09-01 user-edit the fol-lowing attributes were scraped:(a)
Edit ID
Article Title:
The title of the article edited.(d)
0-5. All other namespaces were discarded, al-though an intersting extension would be to include the
5namespaces to test if they provide useful information on theediting volume over lower namespaces.
New Flag:
Whether or not the edit created a new article(f)
Minor Flag:
Whether or not the edit was marked as “minor”by the user.(g)
Comment Length:
The length of the edit comment left bythe user.(h)
Comment Auto Flag:
Whether the comment was automat-ically generated. I defind this as comments that containedparticular tags associated with several automated services, e.g.mw-redirect.
2.1 Namespace Classifier Bug
I noted during this scraping process a bug in the original training data.Specifically articles whose title started with a namespace keyword wereincorrectly classified as being in that namespace, the regexp wasn’t check-ing for the colon. The result being that some namespace 0-5 edits werebeing considered as namespace
5, and thus left out from the trainingset. I’m not sure if this bug was introduced during the construction of thearchival dumps or the training set itself.
3 Training Set Construction
A single training sample can be constructed by considering a single user’sedit history ending at any point before 2010-09-01. I used the follow-ing strategy for assembling a set of trainnig samples from the raw datadesribed above.:1. Start at an initial end date of 153 days before 2010-09-01, i.e. April1 2010.2. Create a sample from each user that has at least one edit duringthe year prior to the end date. This is to be consistent with thesampling strategy employed by wikipedia in constructing the basetraining set.3. Now move the end date back 30 days and repeat.Repeating the above process for 32 offsets I obtaind a training set withapproximately 10 million samples, i.e. time-limited user edit histories.
4 Feature Construction
Given that a user’s edit history is a multi-dimensional time-series overcontinuous time, it is necessary for tractibility to project onto a lower-dimensional feature space, with the goal of retaining the relevant infor-mation in the time-series with respect to future editing behavior. A priorimy intuition was that many distinct feaures of a user’s edit time seriesmay play a role in influencing future edit behavior. My strategy then wasto contstruct a large number of features to feed into my learning algorithmto ensure most information would be available to the learning process. Myfinal solution operated on the following features which I constructed fromthe raw edit data described above:1.
: the number of days between a user’s first edit and the end of the observation period (e.g. April 1, 2010=X for offset 0, X-30 daysfor offset 1, etc.). Both linear and log scale.

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->