Professional Documents
Culture Documents
net/publication/281084430
Predicting Grades
CITATIONS READS
42 336
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Yannick Meier on 09 June 2016.
Predicting Grades
Yannick Meier, Jie Xu, Onur Atan, and Mihaela van der Schaar
Abstract—To increase efficacy in traditional classroom courses receiving the necessary promotion to benefit maximally from
as well as in Massive Open Online Courses (MOOCs), automated the course. Remedial or promotional actions could consist
systems supporting the instructor are needed. One important of additional online study material presented to the student
problem is to automatically detect students that are going to
do poorly in a course early enough to be able to take remedial in a personalized and/or automated manner [4]. Hence, in
actions. Existing grade prediction systems focus on maximizing both offline and online education, it is of great importance
the accuracy of the prediction while overseeing the importance of to develop automated personalized systems that predict the
arXiv:1508.03865v1 [cs.LG] 16 Aug 2015
issuing timely and personalized predictions. This paper proposes performance of a student in a course before the course is over
an algorithm that predicts the final grade of each student in a and as soon as possible. While in online teaching systems a
class. It issues a prediction for each student individually, when the
expected accuracy of the prediction is sufficient. The algorithm variety of data about a student such as responses to quizzes,
learns online what is the optimal prediction and time to issue activity in the forum and study time can be collected, the
a prediction based on past history of students’ performance in available data in a practical offline setting are limited to
a course. We derive a confidence estimate for the prediction scores in early performance assessments such as homework
accuracy and demonstrate the performance of our algorithm on a assignments, quizzes and midterm exams.
dataset obtained based on the performance of approximately 700
UCLA undergraduate students who have taken an introductory In this paper we focus on predicting grades in traditional
digital signal processing over the past 7 years. We demonstrate classroom-teaching where only the scores of students from
that for 85% of the students we can predict with 76% accuracy past performance assessments are available. However, we
whether they are going do well or poorly in the class after the believe that our methods can also be applied for online courses
4th course week. Using data obtained from a pilot course, our such as MOOCs. We design a grade prediction algorithm that
methodology suggests that it is effective to perform early in-class
assessments such as quizzes, which result in timely performance finds for each student the best time to predict his/her grade
prediction for each student, thereby enabling timely interventions such that, based on this prediction, a timely intervention can
by the instructor (at the student or class level) when necessary. be made if necessary. Note that we analyze data from a digital
Index Terms—Forecasting algorithms, online learning, grade signal processing course where no interventions were made;
prediction, data mining, digital signal processing education. hence, we do not study the impact of inventions and consider
only a single grade prediction for each student. However, our
algorithm can be easily extended to multiple predictions per
I. I NTRODUCTION student.
A timely prediction exclusively based on the limited data
E DUCATION is in a transformation phase; knowledge
is increasingly becoming freely accessible to everyone
(through Massive Open Online Courses, Wikipedia, etc.) and
from the course itself is challenging for various reasons. First,
since at the beginning most students are motivated, the score
is developed by a large number of contributors rather than of students in early performance assessments (e.g. homework
by a single author [1]. Furthermore, new technology allows assignments) might have little correlation with their score in
for personalized education enabling students to learn more later performance assessments, in-class exams and the overall
efficiently and giving teachers the tools to support each student score. Second, even if the same material is covered in each
individually if needed, even if the class is large [2], [3]. year of the course, the assignments and exams change every
Grades are supposed to summarize in a single number or year. Therefore, the informativeness of particular assignments
letter how well a student was able to understand and apply with regard to predicting the final grade may change over the
the knowledge conveyed in a course. Thus it is crucial for years. Third, the predictability of students having a variety
students to obtain the necessary support to pass and do well of different backgrounds is very diverse. For some students
in a class. However, with large class sizes at universities an accurate prediction can be made very early based on the
and even larger class sizes in Massive Open Online Courses first few performance assessments. If for example a student
(MOOCs), which have undergone a rapid development in the shows an excellent performance in the first three homework
past few years, it has become impossible for the instructor assignments and in the midterm exam, it is highly likely
and teaching assistants to keep track of the performance of that he/she will pass the class. For other students it might
each student individually. This can lead to students failing take more time to make an equally accurate prediction. If a
in a class who could have passed if appropriate remedial student for example performs below average but not terribly
actions had been taken early enough or excellent students not at the beginning, it is risky to predict whether he/she is going
to pass or fail and, therefore, to decide whether or not to
Y. Meier, J. Xu, O. Atan and M. van der Schaar are with the Department intervene. This third challenge illustrates the necessity to make
of Electrical Engineering, University of California, Los Angeles, CA, 90095 the prediction for each student individually and not for all at
USA. e-mail: (see http://medianetlab.ee.ucla.edu/people.html). This research
is supported by the US Air Force Office of Scientific Research under the the same time.
DDDAS Program. The main contributions of this paper can be summarized as
2
follows. that simple linear and more complex nonlinear (e.g. artificial
1) We propose an algorithm that makes a personalized and neural network) models frequently lead to similar prediction
timely prediction of the grade of each student in a class. accuracies and concludes that there is either no complex
The algorithm can both be used in regression settings, nonlinear pattern to be found in the underlying data or the pat-
where the overall score is predicted, and in classification tern cannot be recognized by their approach. Our simulations
settings, where the students are classified into two (e.g. support the statement that simple linear models show a similar
do well/poorly) or more categories. accuracy in grade predictions as more complex methods.
2) We accompany each prediction with a confidence esti- Reference [14] argues that the accuracy of GPA predictions
mate indicating the expected accuracy of the prediction. frequently is mediocre due to different grading standards used
3) We derive a bound for the probability that the prediction in different classes and shows a higher validity for grade
error is larger than a desired value . predictions in single classes. Consequently, many works focus
4) We exclusively use the scores students achieve in early on identifying relationships between a student’s grade in a
performance assessments such as homework assign- particular class and variables related to the student [15]–[19].
ments and midterm exams and do not use any other Relevant factors were found to include the student’s prior GPA
information such as age, gender or previous GPA. This [15]–[17], [19], performance in related courses [16], [17], [19],
makes our algorithm applicable in all practical tradi- performance in early assignments of the class [17], [19], class
tional classroom and online teaching settings, where attendance [15], self-efficacy [18] and whether the student
such information may not be available. is repeating the class [17]. To make the predictions, various
5) Since the algorithm is learning from past years, the data mining models such as regression models, decision trees,
predictions become more accurate when more data from support vector machines and nearest neighbor techniques are
previous years become available. used.
6) We demonstrate that the algorithm shows good robust- A limitation of the algorithms in the previously discussed
ness if different instructors have taught the course in past papers is that they are difficult to apply in many education
years. scenarios. Frequently, variables related to the student such
7) We analyze real data from an introductory digital signal as performance in related classes, GPA or self-efficacy are
processing course taught at UCLA over 7 years and use not available to the instructor because the data has not been
the data to experimentally demonstrate the performance collected or is not accessible due to privacy reasons. However,
of our algorithm compared to benchmark prediction the instructor always has access to data he/she collects from
methods. As benchmark algorithms we use well known his/her own course, such as the performance of each student in
algorithms such as linear/logistic regression and k- early homework assignments or midterm exams. This paper,
Nearest Neighbors, which are still a current research therefore, focuses on predicting the final grade based on
topic [5]–[7]. this easily accessible data, which is collected anyway by the
8) Based on our simulations, we suggest a preferred way instructor.
of designing courses that enables early prediction and Other works [20]–[23], which also exclusively use data from
early intervention. Using data from a pilot course, we the course itself, differ significantly from this paper in several
demonstrate the advantages of the suggested design. aspects. First, they rely on logged data in online education
The rest of the paper is organized as follows. Section or Massive Open Online Course (MOOC) systems such as
II discusses related work in the field of grade and GPA information about video-watching behavior or time spent on
prediction in education. In Section III we introduce notation, specific questions. In contrast, our results are applicable to
define data structures, formalize the problem and present the both online and offline courses, which include some kind
grade prediction algorithm. We analyze the data, describe of graded assignments or related feedback from the students
benchmark methods and present simulation results including during the course. Second, in order for the instructor to be able
our and benchmark algorithms in Section IV. Finally, we draw to take corrective actions it is of great importance to predict
conclusions in Section V. with a certain confidence the performance of students as early
as possible. While our algorithm takes this into account by
deciding for each student individually the best time to make
II. R ELATED W ORK the prediction using a confidence measure, related works do
Various studies have investigated the value of standardized not provide a metric indicating the optimal time to predict.
tests [8]–[10] admissions exams [11] and GPA in previous Third, while related works need training data from the course
programs [9] in predicting the academic success of students whose grades they want to predict, we show that we can use
in undergraduate or graduate schools. They agree on a positive training data from past year classes of the same or even from
correlation between these predictors and success measures different courses. Finally, in contrast to algorithms from related
such as GPA or degree completion. Besides standardized work, which are only shown to be applicable to classification
tests, the relevancy of other variables for predictions of a settings (e.g. pass/fail or letter grade), our algorithm can be
student’s GPA have been investigated, usually resulting in the used both in regression and classification settings.
conclusion that GPA from prior education and past grades Table I summarizes the comparison between our paper and
in certain subjects (e.g. math, chemistry) [12], [13] have a related work investigating and predicting student performance
strongly positive correlation as well. Reference [12] observes in a course.
3
TABLE I
C OMPARISON W ITH R ELATED W ORK
Take Corrective Actions of the year and Iy to denote the total number of students
in Consequence of the
Predicted Grade attending in year y. Except for the rare case that a student
retakes the course, the students in each year are different. Let
ai,y,k ∈ [0, 1] denote the normalized score or grade of student
Performance Grade i in performance assessment k of year y.
Assessments Prediction
The feature vector of yth year student i after having
Grade Prediction Algorithm taken performance assessment k is given by xi,y,k =
(ai,y,1 , . . . , ai,y,k ). The normalized overall score zi,y ∈ [0, 1]
of yth year student i is the weighted sum of all performance
Wait or Predict? Final Prediction for assessments
Current Student Made K
X
Prediction & Acknowledge after Assessment 2
Confidence
Wait
Final Prediction
zi,y = wk ai,y,k (1)
k=1
Assess- Wait Assess- Assess- Assess-
ment 1 ment 2 ment 3 … ment K where the wk denote the weight of performance assessment k
Store Score Load Data Store Score No Need Store PK
and Feature form Past and Feature to Load Overall so that k=1 wk = 1. The weights are set by the instructor
Vector Years Vector Data Grade and we assume that in each year the number, sequence and
weight of performance assessments is the same. This assump-
Database tion is reasonable since the content of a course usually does
Feature Vectors & Grades
from Past Years not change drastically over the years and frequently the same
course material (e.g. course book) is used.2 This is especially
true in an introductory course such as the one we investigate
Fig. 1. System diagram for a single student. in Section IV. The residual (overall score) ci,y,k of yth year
student i after performance assessment k is defined as
(P
III. F ORMALISM , A LGORITHM AND A NALYSIS K
l=k+1 wl ai,y,l k ∈ {1, . . . , K − 1}
ci,y,k = (2)
In this section we mathematically formalize the problem 0 k=K
and propose an algorithm that predicts the final score or a
classification according to the final grade of a student with a Using this definition we can write the overall score of yth year
given confidence. student i as
k
X
zi,y = ci,y,k + wl ai,y,l . (3)
A. Definitions and System Description l=1
Consider a course which is taught for several years with Note that after having taken the performance assessment k,
only slight modifications. Students attending the course have to the instructor has access to all the scores up to assignment k
complete performance assessments such as graded homework but the residual scores ci,y,k need to be estimated. We denote
assignments, course projects and in-class exams and quizzes the estimate of the residual score for yth year student i at
throughout the entire course.1 Our goal is to predict with a time k by ĉi,y,k and the corresponding estimate of the overall
certain confidence the overall performance of a student before score by ẑi,y,k . In binary classification settings, where the goal
all performance assessments have been taken. See Fig. 1 for is to predict whether a student achieves a letter grade above
a depiction of the system. or below a certain threshold, we denote the class of yth year
We consider a discrete time model with y = 1, 2, . . . , Y student i by bi,y ∈ {0, 1}.
and k = 1, 2, . . . , K where y denotes the year in which the For each student i we store the set of feature vectors
course is taught and k the point in time in year y after the Xi,y = {xi,y,k |k ∈ {1, . . . , K}}, the set of residuals Ci,y =
kth performance assessment has been graded. Y gives the total {ci,y,1 , . . . , ci,y,K−1 } and the student’s overall score zi,y . All
number of years during which the course is taught and K is the
2 This assumption is made for simplicity. As we show in section IV-D2 we
total number of performance assessments of each year. For a can apply our algorithm to settings where different instructors using different
given year y we use index i as a representation of ith student weights for each performance assessment teach the course. A prediction across
courses with a different number of performance assessments is possible as
1 The performance assessments are usually graded by teaching assistants, well, for example by combining the scores of two or more performance
by the instructor or even by other students through peer review [24]. assessments to a single score.
4
feature vectors from all students of year y are given by Xy = C. Grade Prediction Algorithm, Regression Setting
SIy SY
i=1 Xi,y and X = y=1 Xy denotes all feature vectors In this section we propose an algorithm that learns to predict
S Y SIy
of all completed years. Similarly C = y=1 i=1 Ci,y and a student’s overall performance based on data from classes
SY S Iy
Z = y=1 i=1 zi,y denote all residuals and overall scores of held in past years and based on the student’s results in already
0
all completed years. Let Xk = {xi,y,k |k = k 0 , ∀i, y} denote graded performance assessments. We describe the algorithm
0
the set of feature vectors and Ck = {ci,y,k |k = k 0 , ∀i, y} for the regression setting and explain the changes needed to
denote the set of residuals saved after performance assessment use the algorithm in the classification setting in Section III-D.
k0 . Since at time k we know the scores ai,y,1 , . . . , ai,y,k of
the considered student from past performance assessments as
B. Problem Formulation well as the corresponding weights w1 , . . . , wk , we only predict
the residual ci,y,k and calculate the prediction of the overall
Having introduced notations, definitions and data structures,
score with (3). To make its prediction for the current residual
we now formalize the grade prediction problem. We will
of a student with feature vector xi,y,k , the algorithm finds
investigate two different types of predictions. The objective
all feature vectors from similar students of past years and
of the first type, which we refer to as regression setting,
their corresponding residuals ci,y,k . We define the similarity
is to accurately predict the overall score of each student
of students through their feature vectors. Two feature vectors
individually in a timely manner. The second problem, referred
xi , xj ∈ Xk are similar if hxi , xj ik ≤ r where h., .ik is a
to as classification setting, aims at making a binary prediction
distance metric defined on the feature space Xk and r is a
whether the student will do well or poorly or whether he/she
parameter. For two feature vectors x ∈ Xk1 and x0 ∈ Xk2
will necessitate additional help or not. Again, the prediction
from different feature spaces (i.e. k1 6= k2 ) the distance metric
is personalized and takes timeliness into account. For both
is not defined since we only need to determine distances
types of predictions, the same algorithm can be used with only
within a single feature space. Different feature spaces can
slight modifications, which we discuss in Section III-D. We
have different definitions of the distance metric; we are going
will also show that the binary prediction problem can easily
to define the distance metrics we use in Section IV-B. We
be generalized to a classification into three or more classes.
define a neighborhood B (xc , r) with radius r of feature vector
Irrespective of the type of the prediction, the decision for a
xc ∈ Xk as all feature vectors x ∈ Xk with hxc , xik ≤ r.
yth year student i consists of two parts. First, we decide after
∗ Let C k denote the random variable representing the residual
which performance assessment ki,y to predict for the given
score after performance assessment k. v k C k |x denotes the
student and second we determine his/her estimated overall
probability distribution over the residual score for a student
score ẑi,y or his/her estimated binary classification b̂i,y . At a
with feature vector x at time k and µk (x) denotes the student’s
point in time k of year y all scores including the overall scores
expected residual score. Let pk (x) denote the probability dis-
of all students of past years 1, . . . , y − 1 are known. Thus all
tribution of the students over the feature space Xk . Intuitively
feature vectors x ∈ X, residuals c ∈ C and overall scores
pk (x) is the fraction of students with feature vector x at time
z ∈ Z of all completed years are known. Furthermore, the
k. Note that the distributions v k C k |x and pk (x) are not
scores ai,y,1 , . . . , ai,y,k of yth year student i up to assessment
sampling distributions but unknown underlying distributions.
k are known as well and do not have to be estimated. However,
We assume that the distributions do not change over the years.
to determine the overall score of the student we need to
We define the probability distribution of the students in a
predict his/her residual score ci,y,k consisting of performance
neighborhood B (xc , r) with center xc and radius r as
assessments k + 1, . . . , K since they lie in the future and are
unknown. At time k we have to decide for each student of pk (x)
the current year whether this is the optimal time ki,y ∗
= k to pkxc ,r (x) := R 1B(xc ,r) (x),
x∈B(xc ,r)
dpk (x)
predict or whether it is better to wait for the next performance
assessment. If we decide to predict, we determine the optimal where 1 is the indicator function. Intuitively pkxc ,r (x) is the
prediction of the overall score ẑi,y = ẑi,y,ki,y ∗ . Both decisions fraction of students in neighborhood B(xc , r) with feature vec-
are made based on the feature vector xi,y,k of the given student tor x. Let C k (B(xc , r)) be the random variable representing
and the feature vectors x ∈ Xk and residuals c ∈ Ck of the residual score of students in neighborhood B(xc , r) after
past students. To determine the optimal time to predict, we having taken performance assessment k. The distribution of
calculate a confidence qi,y (k) indicating the expected accuracy C k (B(xc , r)) is given by
of the prediction for each student after each performance Z
fxkc ,r C k := v k (C k |x)dpkxc ,r (x)
assessment. The prediction for a particular student is made
as soon as the confidence exceeds a user-defined threshold x∈Xk
qi,y (k) > qth . The problem of finding the optimal prediction We denote the true expected value of the residual scores
time for yth year student i is formalized as follows: after assignment k of students in a particular neighborhood
minimize k by µk (xc , r) := E(C k (B (xc , r))). Note that
k (4)
µk (xc , r) = Ex∼pkx ,r E C k |x = Ex∼pkx ,r µk (x)
subject to qi,y (k) > qth Z c c
Algorithm 1 Grade Prediction Algorithm, Regression Setting of our predictions increases with an increasing number of
Input: All x and z from past years, qth , number M and radii neighbors. Hence, our algorithm learns the best predictions
r1 , . . . , rM of neighborhoods online as the knowledge base is expanded after each year,
Output: Predictions ẑ for the overall scores of the students when the feature vectors and results from the past-year stu-
1: for all years y do dents are added to the database. In Section IV-D1 we show
2: for all performance assessments k do that this learning can be experimentally illustrated with our
3: for all current-year students i for whom the final data from the digital signal processing course taught at UCLA.
prediction has not been made do Second, the term V ark (x, rm∗k )/2 shows that the prediction
4: if k = K then accuracy will be higher if the variance of the residuals in a
5: Calculate zi,y according to (1) neighborhood is small. With increasing time k we expect this
6: Return zi,y as final prediction for student i variance to decrease since we have more information about
7: end if the students and we expect the students in a neighborhood to
8: Create M neighborhoods with radii r1 , . . . , rM be more similar and achieve similar (residual) scores.
9: for all neighborhoods m do Note that it is possible to restrict the data kept in the
10: Estimate residual ĉ (B (xi,y,k , rm )) with (5) knowledge base to recent years, which allows the algorithm
11: Compute Vd ar (x, rm ) with (7) to adapt faster to slowly changing students and to changes in
12: Compute q (B(xi,y,k , rm )) with (8) the course.
13: end for
14: Find m̂k = arg maxm q (B (xi,y,k , rm ))
15: if q (B (xi,y,k , rm̂k )) ≥ qth then D. Grade Prediction Algorithm, Classification Setting
16: Compute ẑi,y with (3) In the binary classification setting we predict the overall
17: Return ẑi,y as final prediction for student i score analogously to the regression setting and then determine
18: end if the class by comparing the predicted overall score ẑi,y with
19: Add xi,y,k and ai,y,k to database a threshold score zth . To illustrate how we find zth let us
20: end for assume that we want to predict whether a student does well
21: end for (letter grades ≥ B−) or does poorly (letter grades ≤ C+). To
22: Calculate all ci,y,k of year y according to (2) determine zth , we find the average zavg,B− of all students from
23: Add all ci,y,k to database past years who received a B− and the average zavg,C+ of all
24: end for students from past years who achieved a C+. Subsequently,
we define zth as zth = (zavg,B− + zavg,C+ ) /2. The predicted
classification b̂i,y of yth year student i is then given by
Similarly m∗k,2 (x) denotes the index of the neighborhood (
with the second highest confidence. 0 ẑi,y ≥ zth
b̂i,y = (13)
m∗k,2 (x) = arg min V ark (x, rm ). 1 ẑi,y < zth .
1≤m≤M,m6=m∗
k (x)
We are more confident in the classification not only if the
Let ∆k (x) denote the difference between the standard devia- variance of the neighbor-scores is small, which is the metric
tions of the residual distribution of neighborhoods m∗k (x) and we used for the confidence in the regression setting, but also
m∗k,2 (x) if the distance d (B(x, r)) = |ẑ (B(x, r)) − zth | between the
predicted score and the threshold score is large. Note that
q q
∆k (x) = V ark (x, rm∗k,2 ) − V ark (x, rm∗k ). (12)
ẑ (B(x, r)) is the estimate of the overall score based on
Theorem. Without loss of generality we assume that all neighborhood B(x, r). Because of this intuition we use a
scores a are normalized to the range [0, 1]. Consider the modified confidence
prediction ẑi,y,k of the overall score of yth year student i k
−d(B(x,r)) V ar C (B(x, r))
d
with feature vector x made by algorithm 1. The probability bin
q (B(x, r)) = 1 − e (14)
that the absolute error the prediction exceeds is bounded by 2
to decide whether to make the final prediction in binary
4V ark x, rm∗k (x) classification settings. Since d (B(x, r)) should only influence
P [|zi,y − ẑi,y,k | ≥ ] ≤
2 whether the final prediction is made for a given neighborhood
|B (x, rm )| but not the neighborhood selection process, we still use the
+ 2 exp −2 min unmodified confidence from (8) to select the optimal neigh-
1≤m≤M 2
|B(x, rm )| − 1
borhood.
+ 2M exp −∆k (x)2 min In summary, four changes have to be made to algorithm 1 to
1≤m≤M 8
make it applicable to binary classification settings. First, zth
has to be determined/updated at the beginning of each new
Proof: See Appendix. year. Second, we calculate b̂i,y after line 16 according to (13).
This theorem illustrates two important aspects of algorithm Third, we return b̂i,y at line 17 instead of ẑi,y . Fourth, we
1. First, we see that for a given neighborhood the accuracy use the modified confidence q bin according to (14) in line 15
7
instead of the unmodified confidence q. We use the unmodified Algorithm 2 Confidence-Learning Prediction Algorithm
confidence from (8) in line 14. Input: Emax , pmin , qth,0 , all x and z from past years, number
The described binary classification algorithm can easily be M and radii r1 , . . . , rM of neighborhoods
generalized to a larger number of categories. In a classifi- Output: Predictions ẑ, ky and qth,y for all years
cation with L categories, we define L − 1 threshold values 1: for all years y do
zth,1 < zth,2 < . . . < zth,L−1 and determine in which of 2: if y > 1 then
the L score intervals {[0, zth,1 ), [zth,1 , zth,2 ), . . . , [zth,L−1 , 1]} 3: Find ky−1 and qth,y−1 according (15) by running
the predicted overall score ẑi,y of a student lies. The index of algorithm 1 with various qth
the interval corresponds to the classification of the student. 4: Return ky−1 and qth,y−1
In this general classification setting, the modified confidence 5: end if
from (14) can be used as well by defining d as the distance 6: Use algorithm 1 with qth = qth,y−1 to predict and
of ẑi,y to the nearest threshold value. return the grades of current year y students
We discuss the performance of the proposed algorithm 1 in 7: end for
both regression and classification settings in Section IV-D.
E. Confidence-Learning Prediction Algorithm student with a desired confidence, this problem (15) aims at
Besides the radii of the neighborhoods ri , the only param- finding the minimum time by which the overall scores of a
eter to be chosen by the user in algorithm 1 is the desired specific percentage of all students can be predicted with a
confidence threshold qth . Since for an instructor it is more desired maximum error.
natural and practical to specify a desired prediction accuracy At a given year y we solve this optimization problem
or error directly rather than the confidence threshold, we show using a brute force approach using the all available data
in this section how to automatically learn the appropriate con- from years 1, . . . , y. For this purpose, we extract k, p and E
fidence threshold to achieve a certain prediction performance from algorithm 2 for a large number of different confidence
and what consequences this has on the average prediction time. thresholds qth . We then select the confidence threshold qth,y
We will discuss a possible way of choosing the radii ri of the which is optimal with respect to optimization problem (15)
neighborhoods in Section IV-B. and determine the corresponding prediction time ky . To make
Formally we define the problem as follows. Let p(k, qth ) ∈ the grade predictions for year y + 1 we use the learned
[0, 1] denote the proportion of current year students for confidence threshold qth,y as input to prediction algorithm 1.
which the grade prediction algorithm working with confidence Since there are no training data available yet at year y = 1,
threshold qth ≤ 1 has predicted the overall score by time the algorithm uses a user-defined starting value qth,0 for the
(performance assessment) k ∈ [0, K]. pmin is the minimum grade predictions of the first year. Algorithm 2 summarizes
percentage of current year students whose grade the user wants the learning algorithm in pseudocode.
to predict with a specified accuracy. E(k, qth ) ≥ 0 denotes the
average absolute prediction error up to time k for the given IV. E XPERIMENTS
confidence. Emax is the maximum error the user is willing to In this section, we present the data, discuss details of
tolerate. k(p, qth ) is the time necessary to predict the grade the application of algorithm 1 to our dataset, illustrate the
for proportion p of all students of the class using confidence functioning of the algorithm and evaluate its performance
threshold qth . Please note that since the variables p, E, qth and by comparing it against other prediction methods in both
k are dependent, we can only independently specify two of the regression and binary classification settings. Due to space limi-
four variables. If we for example specify to predict all (p = 1) tations, we will not show experimental results for classification
students with zero error (E = 0), the algorithm will have to settings with more than two categories.
wait until the end of the course when the overall score is
known (k = K) and will use maximum confidence (qth = 1).
Without making any assumptions on the dependence of the A. Data Analysis
variables of each other, multiple pairs (k, qth ) might lead to Our experiments are based on a dataset from an under-
the same specified pair (p, E). graduate digital signal processing course (EE113) taught at
Our goal is, therefore, to learn from past data the yth UCLA over the past 7 years. The dataset contains the scores
year estimate of the minimal time ky and corresponding from all performance assessments of all students and their
confidence threshold qth,y necessary to achieve the desired final letter grades. The number of students enrolled in the
share pmin of students predicted and the desired maximum course for a given year varied between 30 and 156, in total
average prediction error Emax . This is formally defined as: the dataset contains the scores of approximately 700 students.
Each year the course consists of 7 homework assignments, one
minimize k(p, qth ) in-class midterm exam taking place after the third homework
qth
(15) assignment, one course project that has to be handed in after
subject to p(k, qth ) ≥ pmin
homework 7 and the final exam. The duration of the course
E(k, qth ) ≤ Emax is 10 weeks and in each week one performance assessments
Note that while the goal of optimization problem (4) is to find takes place. The weights of the performance assessments are
the minimum time to predict the overall score of a particular given by: 20% homework assignments with equal weight on
8
each assignment, 25% midterm exam, 15% course project performance assessments might be different across years,
and 40% final exam.3 Fig. 3a shows the distribution of the homework 2 might for instance be very easy in year 2 so
letter grades assigned over the 7 years. We observe that on that almost everyone achieves the maximum score and very
average B is the grade the instructor assigned most frequently. difficult in year 3 so that few achieve half of the maximum
A was assigned second most and C third most frequently. score. The normalization according to (16) eliminates this bias
Surprisingly, however, the distribution varies drastically over by transforming the absolute scores of a student to scores
the years; in year 1 for example only 18.75% received a B relative to his/her classmates of the same year. Note that
while in year 6 the frequency was 38.9%. algorithm 1 does not require a specific normalization and it
To understand the predictive power of the scores in different does not matter that the normalized scores according to (16)
performance assessments, Fig. 3b shows the sample Pearson will not be in the interval [0, 1] as assumed in Section III for
correlation coefficient between all performance assessments simplicity.
and the overall score. We make several important observa- Second, we use feature vectors that simply contain the
tions from this graph. First, on average the final exam has scores of all performance assessments student i has taken up to
the strongest correlation to the overall score, followed by time k in the order they occurred xi,y,k = (ai,y,1 , . . . , ai,y,k ).
the midterm exam. This is not surprising, since the final To incorporate the fact that students who have performed sim-
contributes 40% and the midterm contributes 25% to the ilarly in a performance assessment with a lot of weight should
overall score. Second, the score from the course project on be nearer to each other in the feature space than students
average does not have a higher correlation with the overall that have had similar scores in a performance assessment (e.g.
score than the homework assignments despite the fact that it homework assignment) with low weight, we use a weighted
accounts for 15% of the overall score. Third, all homework metric to calculate the distance between two feature vectors.
assignments have similar correlation coefficients. Fourth, the We define the distance of two feature vectors xi , xj ∈ Xk as
correlation between the individual performance assessments Pk
wl |xi,l − xj,l |
and the overall score varies greatly over the years. This hxi , xj ik = l=1 Pk , (17)
indicates that predicting student scores based on training data l=1 wl
from past years might be difficult. where k is the length of the feature vectors, wl is the weight
Since all performance assessments are part of the overall of performance assessment l and xi,l denotes entry l of feature
score and, therefore, a high correlation is expected, it is also vector xi .
informative to consider the correlation between the perfor- Third, rather than specifying the radii of the neighbor-
mance assessments and the final exam shown in Fig. 3c. It hoods to consider as an input, as suggested in the pseudo-
is interesting to observe that still the midterm exam shows, code of algorithm 1, we automatically adapt the radii of the
besides the overall score, the highest correlation with the final neighborhoods such that they contain a certain number of
exam. A possible explanation for this is that both the midterm neighbors. Since the sample variance gets more accurate with
and final are in-class exams while the other performance an increasing number of samples, we refrain from consid-
assessments are take-home. ering neighborhoods with only 2 neighbors. Therefore, the
smallest radius considered r1 is the minimal radius such
B. Our Algorithm that the neighborhood includes 3 neighbors. For subsequent
neighborhoods the minimal radius is chosen such that the
In this section we discuss three important details of the ap- neighborhood includes at least one neighbor more than the
plication of algorithm 1 to the dataset from the undergraduate previous neighborhood. Formally, we define the selection of
digital signal processing course. the radii recursively as
First, the rule we use to normalize all scores ai,y,k in our
dataset is given by r1 = min r, s.t. |B(xi,y,k , r)| ≥ 3
(18)
a∗i,y,k − µ̂y,k rm+1 = min r, s.t. |B(xi,y,k , r)| > |B(xi,y,k , rm )|.
ai,y,k = , (16)
σ̂y C. Benchmarks
where a∗i,y,k is the original score of the student, µ̂y,k is We compare the performance of our algorithm against five
the sample mean of all yth year student’s original scores different prediction methods.
in performance assessment k and σ̂y is the standard de- • We use the score ai,y,k student i has achieved in the
viation of all yth year student’s original overall scores. A most recent performance assessment k alone to predict
normalization of the scores is needed for several reasons. the overall grade.
First, the instructor-defined maximum score in a particular • A second simple benchmark makes the prediction based
performance assessment may differ greatly across years and on the scores ai,y,1 , . . . , ai,y,k student i has achieved
since we use data from past years to predict the performance up to performance assessment k taking into account the
of students in a given year, we need to make the data across corresponding weights of the performance assessments.
years comparable. Second, also the difficulty of individual • Linear regression using the ordinary least squares (OLS)
0.4 1 1
Year 6
0.2 Year 7 All Years All Years
0.15 0.4 Year 1 0.4 Year 1
Year 2 Year 2
0.1 Year 3 Year 3
Year 4 Year 4
0.2 Year 5 0.2 Year 5
0.05
Year 6 Year 6
0 Year 7 Year 7
0 0
F D− D D+ C− C C+ B− B B+ A− A A+ H1 H2 H3 M H4 H5 H6 H7 P F H1 H2 H3 M H4 H5 H6 H7 P O
Grades Performance Assessments Performance Assessments
(a) Grade distribution (b) Correlation coefficients overall score (c) Correlation coefficients final score
Fig. 3. Data analysis: 3a shows the distribution of letter grades for all years. 3b and 3c present the sample Pearson correlation coefficient between individual
performance assessments and the overall (3b) or final exam (3c) score. Note that we use the abbreviations Hi (homework assignment i), M (midterm exam),
F (final exam) and O (overall score) in the figures.
0.8 0.8
0.5 0.5
0.4 0.4
0.3 0.3
Learn from Year 1
Single Performance Assessment Learn from Year 2
0.2 0.2
Past Assessments and Weights Learn from Year 3
Linear Regression Learn from Year 4
0.1 7 Nearest Neighbors 0.1 Learn from Year 5
Our Algorithm 1 Learn from Years 1−5
0 0
HW1 HW2 HW3 Midt. HW4 HW5 HW6 HW7 Proj. Final HW1 HW2 HW3 Midt. HW4 HW5 HW6 HW7 Proj. Final
Average Prediction Time (Performance Assessment) Average Prediction Time (Performance Assessment)
Fig. 4. Performance comparison of different prediction methods. Fig. 5. Illustration of learning from past data: Error of grade predictions for
year 7 depending on training data.
0.7
0.8
0.6
0.5
0.6
0.4
Accuracy (Logistic Regression)
0.3 Precision (Logistic Regression)
0.4
Recall (Logistic Regression)
0.2
Cum. Share of Students Predicted (Quizzes) Accuracy (Our Algorithm 1)
Cumulative Average Error (Quizzes) 0.1 Precision (Our Algorithm 1)
0.2
Cum. Share of Students Predicted (Midterm) Recall (Our Algorithm 1)
0
Cumulative Average Error (Midterm) HW1 HW2 HW3 Midt. HW4 HW5 HW6 HW7 Proj. Final
0 (Average) Time of Prediction k
H1 H2 H3 Midt. H4 H5 H6 H7 Proj. Final
Prediction Time (Course with Midterm, 1 Tick = 1 Week) Fig. 8. Performance comparison between our algorithm and logistic regression
using accuracy, precision and recall for binary do well/poorly classification.
Fig. 7. Comparison of prediction time and accuracy between the UCLA
course EE113, which contains a midterm exam, and the UCLA course EE103,
which contains four in-class quizzes instead of a midterm exam. Note that Cumulative Share of Students Predicted
0.8
3) Performance Comparison with Course Containing Early
Quizzes in Regression Setting: The results in both the data 0.6
analysis section (Fig. 3b) and Section IV-D1 (Fig. 4) indicate
that scores in in-class exams are much better predictors of the 0.4
overall score than homework assignments. To verify this, we
consider two consecutive years of the UCLA course EE103, 0.2
the expected accuracy of all predictions remains more or less can be derived.
constant irrespective of the prediction time.
Proof: See [26] for a proof of Fact 2.
V. C ONCLUSION Lemma 1. Let m̂k (x) denote the index of the neighborhood
In this paper we develop an algorithm that allows for selected by our algorithm for the student with feature vector
a timely and personalized prediction of the final grades of x at time k and m∗k (x) is given by (11). M denotes the
students exclusively based on their scores in early perfor- total number of neighborhoods our algorithm considers and
mance assessment such as homework assignments, quizzes ∆k (x) is given by (12). We can bound the probability that our
or midterm exams. Using data from an undergraduate digital algorithm chooses the wrong neighborhood by
signal processing course taught at UCLA, we show that the |B(x,rm )|−1
−∆k (x)2 min
P [m̂k (x) 6= m∗k (x)] ≤ 2e
8
algorithm is able to learn from past data, that it outperforms 1≤m≤M
. (19)
benchmark algorithms with regard to accuracy and timeliness
both in classification and regression settings and that the Proof: Consider:
predictions are robust even when the course is taught by P [m̂k (x) 6= m∗k (x)]
different instructors.
k ∗
We show that in-class exams are better predictors of the = P arg min V ar (x, rm ) 6= mk (x)
d
overall performance of a student than homework assignments. 1≤m≤M
q
Hence, designing courses to have early in-class evaluations k
enables timely identification of students who, with a high = P arg min Vd ar (x, rm ) 6= m∗k (x) .
1≤m≤M
probability, would do poorly without intervention and enables
remedial actions to be adopted at an early stage. If the estimation error of the standard deviation is smaller than
Our algorithm can easily be generalized to include context ∆k (x)/2 for all neighborhoods
data from students such as their prior GPA or demographic q
k
q
∆k (x)
Vd k
ar (x, rm ) − V ar (x, rm ) ≤ ,
data. If applied exclusively to MOOCs, the in-course data 2
used for the predictions could be extended for example by
the responses of students to multiple-choice questions, their our algorithm chooses the optimal neighborhood m∗k (x).
forum activity, the course material they studied or the time they Therefore, we get
spent studying online. Another direction of future work is to q
k
∗
apply our algorithm in practice and investigate to what extent P arg min V ar (x, rm ) 6= mk (x)
d
1≤m≤M
the performance of students can be improved by a timely
intervention based on the grade predictions. In this context, [ q k
our algorithm could be extended to make multiple predictions ≤P Vd
ar (x, rm )
for each student to monitor the trend in the predicted grade 1≤m≤M
after an intervention.
q ∆k (x)
k
− V ar (x, rm ) ≥
2
A PPENDIX M q
(a) X k
q ∆k (x)
In this Appendix, we proof the theorem from section III-C.
≤ P Vd ar (x, rm ) − V ark (x, rm ) ≥
Before we start with the proof, we discuss some preliminary m=1
2
results. M
(b) X |B(x, rm )| − 1
≤ 2 exp −∆k (x)2
Fact 1. (Chernoff-Hoeffding Bound) Let X1 , X2 , . . . , Xn be 8
m=1
independent and bounded random variables with range [0, 1]
|B(x, rm )| − 1
and expected value µ. Let µ̂n = (X1 + . . . + Xn )/n denote ≤ 2M exp −∆k (x)2 min
the sample mean of the random variables. Then, for all > 0 1≤m≤M 8