You are on page 1of 10

What do you know?

- Latent feature approach for the Kaggles GrockIt challenge

Rohan Anil ranil@cs.ucsd.edu University of California, San Diego, Department of Computer Science & Engineering, La Jolla, CA 92092 USA March 19, 2012

Abstract
We describe our eorts in solving the GrockIt competition in this report. The goal of the competition was to improve the state of the art in student evaluation. The task was to predict whether a student will answer the next test question correctly. The data-set was provided by GrockIt, an on-line platform for students to practice questions for competitive exams. At the end of the competition there were a total of 252 teams, 581 individuals and a total of 1803 submissions. We train latent feature log-linear models (LFL) for the prediction task. We exploit the rich meta information associated with questions and the dyad to improve the performance of the models. Finally, we explore dierent techniques to blend the predictions from multiple models. The competition used binomial capped deviance as the metric to rank teams. Our team of the UCSD Triton was placed at rank 4th in the public leaderboard with a BCD of 0.24665 and rank 5th in the nal private leaderboard with a BCD of 0.24792.

Inspired by the success of latent-feature methods on the Netix prize challenge(Tscher et al., 2009), we see if latent-feature methods are competetive in solving this task. To improve our leaderboard rank, we explore two dierent ensemble learning technique i.e linear regression and gradient boosted decision trees. The report is organized as follows, Section 2 introduces the task of dyadic prediction, the dataset that was available for the competition, the metric used to rank the teams, and the generic latent feature log-linear model (Menon & Elkan, 2010) (LFL). Section 3 formulates the student-evaluation task as a dyadic-prediction task and derives the stochastic gradient update rules for the LFL model. Section 4, describes two techniques we used for ensembling and their results on the dataset and nally we conclude in Section 5.

2. GrockIt-Kaggle dataset
The dataset contains student responses for various questions. There are a total of 179,107 users and 6,046 questions in the training set. The dataset is similar to the typical dyadic dataset with a couple of key dierences: i) there can exist duplicate dyad pairs in the training set with dierent outcomes, since a student can answer a question many times, ii) In some games types, students can collaboratively answer questions. The value of the feature number of players is the number of students answering the question. Grockit provides a chat-box for the students to discuss the answer. If the question is a multiple-choice question, students can leave comments on the choices. The next question is not displayed until everyone answers the question. We can interpret each pair of user and question as a dyad. s S: Students q Q: Questions y O: Outcome

1. Introduction
Kaggle.com is a data-mining competition platform. We participated in one of the competitions titled What do you know? for improving the state-ofthe-art in student evaluation and nished at rank 5th. The competition dataset was provided by GrockIt, an online learning platform. GrockIt provides tools to help prepare students for competitive exams like the GMAT, ACT, SAT etc. The dataset mainly contains performance information of students on various questions.

(s, q) y(s,q)

MS Project:Competing on Kaggle in GrockIts What do you know ?

Training set T = ((s, q) y)

student-question interaction side-information in Table 2.

Table 1. Side-information associated with a question. Type QuestionType Group Track Subtrack Tags Question set Description i) Multiple Choice ii) Free Response i)ACT, ii) GMAT, iii) SAT 9 types, listed in appendix 15 types each question is tagged with subjects it is from, ner granularity than subtrack questions which share a question set id and share similarity in presentation on the screen

Figure 1. GrockIt-Kaggle dataset

The training set contains a total of 4,851,476 dyad pairs each with its corresponding label which is the outcome. Outcome can be of four types, i) correct, ii) incorrect, iii) skipped and iv) timed-out. The validation set contains a total of 80,075 dyads of 80,075 unique students and test set contains 93,100 dyads of 93,100 unique students. The validation set students is a subset of test set students. The students which appear in the test set but do not appear in the validation set are the students who have only few dyads in the training set. The validation set has later outcomes relative to the training set and test set has later outcomes relative to the validation set. Our task is predict the probability of correct outcome for every dyad in the test set. Pr(y = Correct |(s, q) Test Set) The validation and test set do not contain any skipped or timed-out outcomes. The competition uses 40% of the test data to rank teams on the public leaderboard and 60% of the test data for nal ranking on the private leaderboard, which was only revealed at the end of the competition. This measure was used by the organizers to prevent overtting.

Table 2. Side-information for a dyad. Type Game Number of players Started Answered-at Deactivated Description 12 types of games players in the game Date and Time the question was seen by the student Date and Time the question was answered by the student Date and Time the question cleared from the screen.

2.3. Preprocessing the dataset Our task is to predict the probability of correct outcome for the dyads in the test set. The test set is gauranteed to have only correct and incorrect outcomes. The training set contains dyads with outcomes, skipped and timed-out as illustrated in Figure 2. We create two training sets from the orignal set. In the rst training set, we exclude all dyads with skipped and timed-out outcome. In the second training set, we include all dyads but we treat the skipped and timedout outcomes as incorrect outcomes. 2.4. Binomial Capped Deviance The competition uses binomial capped deviance (BCD) as the metric to rank the teams. The metric is similar to the log-likelihood. For binary outcome the BCD is calculated as follows Let Pr(y = 1|(s, q) T ) be the predicted probability of correct (1) outcome for the dyad (s, q) observed in the training set T . The BCD of the training set then is

2.1. Baseline - Rasch Model A baseline was provided by Kaggle for the dataset. The baseline uses the Rasch model (Rasch, 1960). Rasch models are widely used in education psychology research. The prediction from the model is, exp(Bs q ) Pr((s, q) 1) = 1 + exp(Bs q ) where Bs is interpreted as the ability of the student and q is interepreted as the diculty of question q. 2.2. Side Information The dataset contains two types of side-information, question side-information listed in Table 1, and

MS Project:Competing on Kaggle in GrockIts What do you know ?


3 x 10
6

p = max(0.01, min(0.99, Pr(y = 1|(s, q) T ))

2.5

3. Dyadic Prediction
2 1.5

A dyadic prediction task is a learning task which involves predicting a class label for a pair of items (Hofmann et al., 1999). (u, i) y(u,i) where u U, i I and y Y The task involves the training set of pairs (u, i) each with its corresponding label y i.e. T = {((u, i) y)}. There is sometimes more information available in the dataset i.e information associated with each item of the dyad and information associated with the dyadic pair. This information can be processed into an explicit feature vector, which is termed as the side-information. 3.1. Latent feature log-linear model
5000 4000

0.5

Correct

Incorrect

Skipped TimedOut

Figure 2. Histogram of outcomes in the training set


6000

3000

2000

1000

Multiple Choice

Free response

For this competition, we need to predict the probability of correct outcome for the dyadic pairs in the test set. We also need to leverage the side-information available in the dataset. This motivates the use of the latent feature log-linear(LFL) model for dyadic prediction. (Menon & Elkan, 2010). LFL can predict well calibrated probabilities and incorporate sideinformation in the training process. Let(u, i) y Y , where |Y | > 2 i.e a multi-class

Figure 3. Division of questions by types


2.5 x 10
6

y y Probabilty Pr(y|(u, i)) exp(Uu Ii )

U y No of item of type u k I y No of item of type i k k = Number of latent features


y Uu = uth row vector in matrix U y y Ii = ith row vector in matrix I y

2 # of observations of dyads

1.5

0.5

We predict the label which has the highest probability according to the model,
ACT GMAT SAT

y = arg max
y

Figure 4. Number of dyads in the training set for a particular group

Pr(y|(s, q)) y Pr(y|(s, q))

BCD(T ) =

1 NT

We train the LFL model using the negative loglikelihood as the objective function. We use stochastic gradient descent (SGD) to learn the model parameters. We optimize for the negative log-likelihood (LL), LL = = log Pr(y|(u, i)) (u,i)yT ll(u,i)

(ylog() + (1 y)log(1 p)) p


((s,q)yT ) (u,i)yT

where NT = Number of dyads T

MS Project:Competing on Kaggle in GrockIts What do you know ?

Stochastic gradient descent (SGD), contribution of each example (u, i) y T to the negative loglikelihood is where p = Pr(y|(u, i)) ll (u,i)yT
y ll Uu y ll Uu yp Uu y ll Uu similarly I y ll
i

. The binary-LFL model has appeared in the literature before (Schein et al., 2003; Agarwal & Chen, 2009). We use stochastic gradient descent to train this model.
x 10
6

= = =

log(p)
1 p Uu p y 1 p Uu p y

3.5

y = (1 p) p Ii
# of dyads

2.5

= =

y (1 p) Ii y (1 p) Uu

1.5

y y Derivate with respect to Uu and Ii where y Y and y = y y Uu

0.5

ll
y Ii

= ll

p(u,i)y

y Ii

similarly

y = p(u,i)y Uu

#=1 # >=2 Number of Players

After adding regularization terms to the objective function we get, ( log(p(u,i)y )+ |Uu |2 + |Ii |2 ) 2 2

Figure 5. Number of dyads with number of players

3.3. SGD Training Objective =


(u,i)T

In the SGD algorithm, we do not randomize the dataset. We order the questions based on time and run stochastic gradient descent - so that model adapts to recently answered questions.

Update rules in SGD algorithm used to minimize the objective function are
y y Uu = Uu (

y y ll + Uu ) Uu y y ll + Ii ) Ii

3.4. Parallelism Updates of the stochastic gradient descent algorithm are independent for dyads (u, i) and (u , i ) where u = u ,i = i . Hence we exploit this parallelism, by splitting the input training set into non-overlapping blocks. We oberved this parallelism while competing in the KDD Cup 2011 and later on found the same observation was made independently by another group (Gemulla et al., 2011) The threads processes a set of blocks such that no two blocks have the same column or row index as shown in Figure 6. Figure 7. shows the time taken for a epoch of LFL vs the number of cores on the dataset using latent feature size of 5. The experiments were run on a Intel Core-i5 450M processor. 3.5. LFL Models In the Grockit-Kaggle dataset, side-information is available for questions as group, track, sub-track,

y y Ii = Ii (

3.2. LFL on Kaggle-GrockIt dataset The test set only contains two outcomes i) Correct and ii) Incorrect for which we will use the binary LFL model. The binary case can be written as follows:Pr(y|(s, q)) =
y exp(Ss Qy ) q y y y exp(Ss Qq )

where y = 1 for correct response and 0 for incorrect response. We can x S 0 and Q0 to be zero i.e. keeping class 0 as the base class. Pr(y = 1|(s, q)) = 1 1 1 + exp(Ss Q1 ) q

MS Project:Competing on Kaggle in GrockIts What do you know ?

Figure 6. Parallelism

Algorithm 1 Stochastic Gradient Descent Input: Dyads (s, q) y {T rainingSet EL: Epoch Limit : Regularization : Learning rate previous-bcd: DOUBLE MAX k: latent feature size for epoch = 1 to EL do for each (s, q) y do // Update latent vectors 1 1 1 Ss = Su ((py y) Q1 + Ss ) i 1 1 1 Qq = Qi ((py y) Su + Q1 ) q current-bcd = calculate validation set bcd if ( current-bcd>previous-bcd ) then break else previous-bcd = current-bcd end if end foreach = 0.99 end for

question-type and tags. All of the side-information available are categorical in nature. Although, LFL is a powerful model that can leverage any type of side information, we only experimented with models in Table 3 during the competition. The following section describes how to add sideinformation for a categorical variable group to the LFL model. 3.6. Adding Side-Information to the LFL model For a question q, let g = group(q). We add a latent vector for each group. Let the prediction equation be, Pr(y = 1|(s, q)) = The update rules are, 1 1 1 + exp(Ss (Q1 + G1 )) q g

1 1 1 Ss = Ss ((py y) (Q1 + G1 ) + Ss ) g q 1 Q1 = Q1 ((py y) Ss + Q1 ) q q q 1 G1 = G1 ((py y) Ss + G1 ) g g g

S 1 Number of students k Q1 Number of questions k G1 Number of groups k

3.7. Results As discussed in the previous section, the training set contains four types of outcomes, i) correct, ii) incorrect, iii) skipped and iv) timed-out. We train the LFL models on training set after excluding skipped and timed-out outcomes, results are presented in Table 4. and on the entire training set by treating skipped and

Figure 7. Parallelism

MS Project:Competing on Kaggle in GrockIts What do you know ? Table 3. LFL models. Table 5. Results (BCD) on training set after excluding skipped and timed-out responses. Model LFL-1 LFL-2 LFL-3 LFL-4 LFL-5 LFL-6 LFL-7 LFL-8 LFL-9 LFL-10 Average Validation 0.259718 0.259337 0.258626 0.258597 0.259242 0.258606 0.258980 0.259162 0.258639 0.258897 0.258363 Test(40%) 0.25784 0.25719 0.25668 0.25654 0.25692 0.25641 0.25683 0.25700 0.25646 0.25665 0.25624 Test(60%) 0.25921 0.25869 0.25799 0.25795 0.25847 0.25810 0.25832 0.25874 0.25822 0.25821 0.25778

Prediction p((s, q) 1) 1. 2. 3. 4. 5. 6. 7.
1 1 1+exp(Ss Q1 ) q 1 1 1+exp(Ss (Q1 +G1 )) q g 1 1 1 1+exp(Ss (Q1 +Tt )) q 1 1 1 1+exp(Ss (Q1 +STst )) q 1 1 1 1+exp(Ss (Q1 +QTqt )) q 1 1 1 1+exp(Ss (Q1 +GTgt )) q 1 1 1 1+exp(Ss (Q1 +G1 +STst )) q g

Description Basic LFL Model G - Group T - Track ST - Subtrack QT - Question Type GT - Game Type G - Group & ST - Subtrack

8.

1 1 1 1+exp(Ss (Q1 +G1 +GTgt )) q g

G - Group & GT - Game Type

4. Ensemble Learning
The main motivation for ensemble learning for this competition was because no single model performs well on every dyad (Takcs et al., 2009). Combining a predictions from multiple models can outperform each of the individual models. In the following section we describe dierent techniques to combine predictions. The simplest technique to combine predictions from multiple models is linear regression. Dene matrix P where pi,j is the prediction for dyad i using model j. Then linear regression learns a weight vector w such that, Pw = Y where Y is a column vector with Yi is the true label of dyad i. Then prediction for dyad i is j pi,j wj To avoid overtting, we tune parameters for each of the models using a held-out set. To achieve the best peformance on the test set, it is advisable to create a held-out set which is similar to the test set. For the competition, we trained the LFL-models on the training set and predict on the validation set. We cross-validate i.e tune the learning parameter , regularization and epoch by treating the validation set as the held-out set. Finally, the validation set predictions are used for linear regression(ensemble learning). We re-train each of the LFL-model on both the training set and validation set using the tuned parameters to generate the predictions of the test set. The predictions from all the models and the weights learned from linear regression are used for the nal

9.

1 1 1 1 1+exp(Ss (Q1 +STst +GTgt )) q

ST - Subtrack & GT - GameType

10.

1 1 1 1 1+exp(Ss (Q1 +STst +QTqt )) q

ST - Subtrack & QT Question Type

timed-out responses as incorrect response, results are presented in Table 5. All the model parameters were tuned by grid search. The test predictions were obtained after re-training the model on both training and validation set on the tuned parameters. The results on 60% of the test set was only available after the end of the competition.
Table 4. Results (BCD) on training set after excluding skipped and timed-out responses. Model Rasch LFL-1 LFL-2 LFL-3 LFL-4 LFL-5 LFL-6 LFL-7 LFL-8 LFL-9 LFL-10 Average Validation 0.252465 0.252250 0.251941 0.251842 0.252021 0.251819 0.252014 0.251916 0.251561 0.251802 0.251320 Test(40%) 0.25663 0.25398 0.25343 0.25340 0.25288 0.25300 0.25296 0.25328 0.25310 0.25266 0.25291 0.25250 Test(60%) 0.25766 0.25483 0.25451 0.25450 0.25389 0.25426 0.25446 0.25433 0.25458 0.25423 0.25389 0.25362

MS Project:Competing on Kaggle in GrockIts What do you know ?

recursively applied till the leaf node has only a few examples. The decision function at the root node is learned using the entire dataset. We can restrict the depth of the decision tree using the paramter d which halts the recursive tree building at that depth. The example to be predicted is used input to the decision tree and the decision function at every node sends the example towards a particular sub-tree. Finally we end up at a leaf node which contains a set of training examples. For the regression task the prediction is the average of the target regression value of the set.
Figure 8. Decision tree. X1 to X10 are training examples

test set predictions. The results from linear regression are listed in Table 6.
Table 6. Results from linear regression. i) WITH-OUT - training without skipped and timed-out responses. ii) WITH - training with skipped and timed-out responses treated as incorrect responses

In gradient boosting, we select the base learner and the loss function. For this competition we used decision tree as the base learner and squared error as the loss function. Gradient boosting is an iterative-procedure where we t a base learner on the gradients from the previous iteration.
J

f (x) =
j=0

j Tj (x)

(1)

where Tj (..) is a decision tree.


Description With-out With Combined Validation 0.25114 0.256500 0.251025 Test(40%) 0.25195 0.25599 0.25179 Test(60%) 0.25311 0.25770 0.25300

4.1. Gradient Boosted Decision Trees The main motivation for the use of Gradient Boosted Decision Trees (GBDT) (Friedman, 1999) is to include side-information in the ensemble learning. GBDT is a powerful learning algorithm that is widely used (see Li & Xu, 2009, chap. 6). The core of the algorithm is a decision tree learner. A decision tree is illustrated in the Figure 7. Internal nodes are decision functions that select the subtree to send the example through. The leaves are nodes with a small set of training examples which is used for the nal prediction. As illustrated decision trees can capture interactions between features and can handle both numeric and categorical features. We can nd the split value in case of numeric feature or by split based on subset of categories which minimizes a particular criteria. For the regression task, generally sum squared error is used as the criteria. Decision functions are learned at every node so that it partitions the dataset into two and this process is

Algorithm 2 Gradient Boosting Initialize: f0 = T0 = Count(Outcome== Correct ) and 0 = 1 T otal Input: L(yi , fj1 (xi )) = (yi fj1 (xi ))2 Training data (xi , yi ) for iteration = 1 to J do L(yi ,fj1 (xi )) (a) yi = for i= 1 .. n fj1 (b) Fit tree Tj (..) to (xi , y) for i= 1 .. n. i (c) j = arg min i [L(yi , fj1 (xi )) + T (xi ) ] (d) fj = fj1 + j Tj (x) end for 4.2. Application of GBDTs The input feature vector of a training example contains the predictions of the LFL models, the side-information associated with question, interaction features between users and questions and metafeatures that we compute from the statistics of the dataset. We can add a regularization parameter , 0 < = 1 to the equation (d) which results in the following update fj = fj1 + j Tj (x) The side-information and meta information we add as features in ensemble learning are listed in Table 12. and Table 13 in the apppendix. For each question, the
N

MS Project:Competing on Kaggle in GrockIts What do you know ?

dataset contains a set of tags. Each tag corresponds to a subject that the question is from. We rst manually merge the duplicated tags. Then we cluster the tags using spectral clustering (Ng et al., 2001) with normalized co-ocurrence of tags as the similarity measure to generate the anity matrix A.

Table 8. Results from GBDT with side-infomration, i) meta-features in Table 13. and ii) temporal meta-features in Table 14.

Param(N,,d) 200,0.1,3

Validation 0.24148 0.24404 0.23563 0.24401

Test(40%) 0.24787 0.24793 0.24812 0.24792

Test(60%) 0.24906 0.24913 0.24894 0.24922

Algorithm 3 Spectral Clustering


k=|Q| i j k (a) A(i, j) = k=1 |Q| (b) D(i, i) = j Ai,k (c) NL = D1/2 LD1/2 , N L Rnn (d) V Rnk contains rst k eigenvectors of N L 2 (e) U(i, j) = V(i,j) /( k V(i,j) )1/2 Cluster rows of U using k-means. Cluster ci for row i is the cluster for tagi .

I(tag ,tag T ags(q ))

200,0.05,3 100,0.1,5 100,0.1,3

There are a total of 281 tags in the dataset. After merging duplicated tags, we cluster tags into 40 clusters. The binary feature vector of length 40 was used as a meta-feature. We use the validation set predictions from the LFL models, combined with side-information and metafeatures as the training set for GBDT. We observe a marginal improvement over linear regression with GBDT in Table 7. After we add temporal metafeatures listend in Table 14 into the training set further improves the BCD in Table 8. For the nal submission, we randomly split the training set (GBDT) into two folds. We train GBDT with varying and depth parameters on fold one and predict on the the second one, and vice versa and save the predictions for performing linear regression. The test set predictions are calculated using the weights from linear regression and predictions on the the entire training set. The training set here refers to the validation set predictions from the LFL models, combined with side-information and meta-features.

Table 9. Results from linear-regression on GBDT models 15 models, using the following parameter combinations ( = .1, .2, .3, .4, .5) and (depth=3,5,7) Test(40%) 0.24665 Test(60%) 0.24792

5. Conclusions
We interpreted the task for the competition as a dyadic prediction task and experimented with dierent variations of LFL. The basic LFL model performs better than the baseline and would have placed us at 26th in the competition. We explored dierent ways to encode categorical side-information feature in the LFL model. The average prediction from all the LFL models would place us at rank 14th. We further explored dierent ways techniques to combine predictions i.e linear

Table 10. Leaderboard ranks of dierent methods Method Public 78 27 17 11 11 4 Private 76 26 14 11 11 5

Table 7. Results from GBDT with side-information listed in table 11. and meta-features listed in Table 12.

Rasch Model Basic LFL Average of LFL models Ensemble Methods Linear regression GBDT GBDT + LR with temporal features

Param(N,,d) 100,0.1,3

Validation 0.24782

Test(40%) 0.25138

Test(60%) 0.25269

MS Project:Competing on Kaggle in GrockIts What do you know ? Table 11. Private Leaderboard ranks Rank 1 2 3 4 5 6 7 8 9 10 Team Steen Dyakonov Alexander Ekla Planet Thanet & Birutas UCSD-Triton James Petterson Indy Actuaries Yetiman Gxav Two Tacos Location University of Konstanz Moscow State University UK & Brazil UC San Diego Australian National University Indianapolis Northpole Singapore UC Irvine

regression and gradient boosted decision trees. Using linear-regression or GBDT to combine predictions would place us at rank 11th in the competition. The nal improvement in BCD was achieved after adding temporal features and using linear regression on different parameter combination of GBDT placing us at rank 5th on the private leadboard and 4th on the public leaderboard. Table 10, contains both public and private leaderboard performances of various methods and Table 11 lists the top 10 nishers in the competition.

ing, KDD 11, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0813-7. Hofmann, Thomas, Puzicha, Jan, and Jordan, Michael I. Learning from dyadic data. In Proceedings of the 1998 conference on Advances in neural information processing systems II, pp. 466472, Cambridge, MA, USA, 1999. MIT Press. ISBN 0262-11245-0. Li, Xiaochun and Xu, Ronghui (eds.). Highdimensional data analysis in cancer research. Springer, CA, U.S.A, 2009. Menon, Aditya Krishna and Elkan, Charles. A loglinear model with latent features for dyadic prediction. In ICDM10, pp. 364373, 2010. Ng, Andrew Y., Jordan, Michael I., and Weiss, Yair. On spectral clustering: Analysis and an algorithm. In Advances in Nueral Information Processing Systems, pp. 849856. MIT Press, 2001. Rasch, Georg. Estimation of parameters and control of the model for two response categories, 1960. Schein, Andrew I., Lawrence, Andrew I., Saul, Lawrence K., and Ungar, Lyle H. A generalized linear model for principal component analysis of binary data, 2003. Takcs, Gbor, Pilszy, Istvn, Nmeth, Bottyn, and a a a a e a Tikk, Domonkos. Scalable collaborative ltering approaches for large recommender systems. J. Mach. Learn. Res., 10:623656, June 2009. ISSN 15324435. Tscher, Andreas, Jahrer, Michael, and Bell, Robert M. The bigchaos solution to the netix grand prize, 2009.

6. Acknowledgments
The author wishes to acknowledge the valuable inputs, insights and advise from Aditya Menon, PhD student, CSE, UCSD and Charles Elkan, Professor, CSE, UCSD during and after the competition without which the project would not have been possible.

References
Agarwal, Deepak and Chen, Bee-Chung. Regressionbased latent factor models. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 09, pp. 1928, New York, NY, USA, 2009. ACM. ISBN 9781-60558-495-9. Friedman, Jerome H. Stochastic gradient boosting. Computational Statistics and Data Analysis, 38: 367378, 1999. Gemulla, Rainer, Nijkamp, Erik, Haas, Peter J., and Sismanis, Yannis. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data min-

MS Project:Competing on Kaggle in GrockIts What do you know ? Table 12. Side-information Side-information for a question(All are categorical) Question-Type Group Track Subtrack l Tags Interaction Side-Information (All are categorical) Game Number of players (no=1,no>1)

Table 13. Meta-features Meta-features (numeric) variance in outcomes for the user variance in outcomes for the question ratio of correct response for the following:-1. -2. -3. -4. -5. -6. user question group track sub-track game-type

7. Appendix
7.1. Binomial Capped Deviance Algorithm 4 Binomial Capped Deviance Input: Dyads (s, q) y Validation set (V) S: Latent matrix for students Q: Latent matrix for questions BCD = 0 for each (s, q) y do 1 p(y = 1|(s, q)) = 1+exp(S 1 Q1 ) s q p = p(y = 1|(s, q)) if p > .99 then p = .99 end if if prediction < .01 then p = .01 end if
BCD = BCD +y log(p) + (1 y) log(1 p)

log(number response of the type) for the following:-1. -2. -3. -4. -5. -6. user question group track sub-track game-type

Preprocessed tags (feature vector of length 40) Table 14. Meta-features Temporal Meta-features (numeric) 1. Time to answer the question. 2. Average time (correct outcome) for current:User Question Group Track Subtrack Game Type 3. Average time (incorrect outcome)for current:User Question Group Track Subtrack Game Type 4. Dierence between the time to answer the question and average time (correct outcome) for current:User Question Group Track Subtrack Game Type 5. Dierence between the time to answer the question and average time (incorrect outcome) for current:User Question Group Track Subtrack Game Type

end foreach BCD = BCD |V | return BCD