You are on page 1of 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/281084430

Predicting Grades

Article  in  IEEE Transactions on Signal Processing · August 2015


DOI: 10.1109/TSP.2015.2496278 · Source: arXiv

CITATIONS READS
42 336

4 authors, including:

Yannick Meier Onur Atan


Stanford University Yildiz Technical University
2 PUBLICATIONS   66 CITATIONS    18 PUBLICATIONS   273 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

use of waste marble additives in asphalt mixtures View project

All content following this page was uploaded by Yannick Meier on 09 June 2016.

The user has requested enhancement of the downloaded file.


1

Predicting Grades
Yannick Meier, Jie Xu, Onur Atan, and Mihaela van der Schaar

Abstract—To increase efficacy in traditional classroom courses receiving the necessary promotion to benefit maximally from
as well as in Massive Open Online Courses (MOOCs), automated the course. Remedial or promotional actions could consist
systems supporting the instructor are needed. One important of additional online study material presented to the student
problem is to automatically detect students that are going to
do poorly in a course early enough to be able to take remedial in a personalized and/or automated manner [4]. Hence, in
actions. Existing grade prediction systems focus on maximizing both offline and online education, it is of great importance
the accuracy of the prediction while overseeing the importance of to develop automated personalized systems that predict the
arXiv:1508.03865v1 [cs.LG] 16 Aug 2015

issuing timely and personalized predictions. This paper proposes performance of a student in a course before the course is over
an algorithm that predicts the final grade of each student in a and as soon as possible. While in online teaching systems a
class. It issues a prediction for each student individually, when the
expected accuracy of the prediction is sufficient. The algorithm variety of data about a student such as responses to quizzes,
learns online what is the optimal prediction and time to issue activity in the forum and study time can be collected, the
a prediction based on past history of students’ performance in available data in a practical offline setting are limited to
a course. We derive a confidence estimate for the prediction scores in early performance assessments such as homework
accuracy and demonstrate the performance of our algorithm on a assignments, quizzes and midterm exams.
dataset obtained based on the performance of approximately 700
UCLA undergraduate students who have taken an introductory In this paper we focus on predicting grades in traditional
digital signal processing over the past 7 years. We demonstrate classroom-teaching where only the scores of students from
that for 85% of the students we can predict with 76% accuracy past performance assessments are available. However, we
whether they are going do well or poorly in the class after the believe that our methods can also be applied for online courses
4th course week. Using data obtained from a pilot course, our such as MOOCs. We design a grade prediction algorithm that
methodology suggests that it is effective to perform early in-class
assessments such as quizzes, which result in timely performance finds for each student the best time to predict his/her grade
prediction for each student, thereby enabling timely interventions such that, based on this prediction, a timely intervention can
by the instructor (at the student or class level) when necessary. be made if necessary. Note that we analyze data from a digital
Index Terms—Forecasting algorithms, online learning, grade signal processing course where no interventions were made;
prediction, data mining, digital signal processing education. hence, we do not study the impact of inventions and consider
only a single grade prediction for each student. However, our
algorithm can be easily extended to multiple predictions per
I. I NTRODUCTION student.
A timely prediction exclusively based on the limited data
E DUCATION is in a transformation phase; knowledge
is increasingly becoming freely accessible to everyone
(through Massive Open Online Courses, Wikipedia, etc.) and
from the course itself is challenging for various reasons. First,
since at the beginning most students are motivated, the score
is developed by a large number of contributors rather than of students in early performance assessments (e.g. homework
by a single author [1]. Furthermore, new technology allows assignments) might have little correlation with their score in
for personalized education enabling students to learn more later performance assessments, in-class exams and the overall
efficiently and giving teachers the tools to support each student score. Second, even if the same material is covered in each
individually if needed, even if the class is large [2], [3]. year of the course, the assignments and exams change every
Grades are supposed to summarize in a single number or year. Therefore, the informativeness of particular assignments
letter how well a student was able to understand and apply with regard to predicting the final grade may change over the
the knowledge conveyed in a course. Thus it is crucial for years. Third, the predictability of students having a variety
students to obtain the necessary support to pass and do well of different backgrounds is very diverse. For some students
in a class. However, with large class sizes at universities an accurate prediction can be made very early based on the
and even larger class sizes in Massive Open Online Courses first few performance assessments. If for example a student
(MOOCs), which have undergone a rapid development in the shows an excellent performance in the first three homework
past few years, it has become impossible for the instructor assignments and in the midterm exam, it is highly likely
and teaching assistants to keep track of the performance of that he/she will pass the class. For other students it might
each student individually. This can lead to students failing take more time to make an equally accurate prediction. If a
in a class who could have passed if appropriate remedial student for example performs below average but not terribly
actions had been taken early enough or excellent students not at the beginning, it is risky to predict whether he/she is going
to pass or fail and, therefore, to decide whether or not to
Y. Meier, J. Xu, O. Atan and M. van der Schaar are with the Department intervene. This third challenge illustrates the necessity to make
of Electrical Engineering, University of California, Los Angeles, CA, 90095 the prediction for each student individually and not for all at
USA. e-mail: (see http://medianetlab.ee.ucla.edu/people.html). This research
is supported by the US Air Force Office of Scientific Research under the the same time.
DDDAS Program. The main contributions of this paper can be summarized as
2

follows. that simple linear and more complex nonlinear (e.g. artificial
1) We propose an algorithm that makes a personalized and neural network) models frequently lead to similar prediction
timely prediction of the grade of each student in a class. accuracies and concludes that there is either no complex
The algorithm can both be used in regression settings, nonlinear pattern to be found in the underlying data or the pat-
where the overall score is predicted, and in classification tern cannot be recognized by their approach. Our simulations
settings, where the students are classified into two (e.g. support the statement that simple linear models show a similar
do well/poorly) or more categories. accuracy in grade predictions as more complex methods.
2) We accompany each prediction with a confidence esti- Reference [14] argues that the accuracy of GPA predictions
mate indicating the expected accuracy of the prediction. frequently is mediocre due to different grading standards used
3) We derive a bound for the probability that the prediction in different classes and shows a higher validity for grade
error is larger than a desired value . predictions in single classes. Consequently, many works focus
4) We exclusively use the scores students achieve in early on identifying relationships between a student’s grade in a
performance assessments such as homework assign- particular class and variables related to the student [15]–[19].
ments and midterm exams and do not use any other Relevant factors were found to include the student’s prior GPA
information such as age, gender or previous GPA. This [15]–[17], [19], performance in related courses [16], [17], [19],
makes our algorithm applicable in all practical tradi- performance in early assignments of the class [17], [19], class
tional classroom and online teaching settings, where attendance [15], self-efficacy [18] and whether the student
such information may not be available. is repeating the class [17]. To make the predictions, various
5) Since the algorithm is learning from past years, the data mining models such as regression models, decision trees,
predictions become more accurate when more data from support vector machines and nearest neighbor techniques are
previous years become available. used.
6) We demonstrate that the algorithm shows good robust- A limitation of the algorithms in the previously discussed
ness if different instructors have taught the course in past papers is that they are difficult to apply in many education
years. scenarios. Frequently, variables related to the student such
7) We analyze real data from an introductory digital signal as performance in related classes, GPA or self-efficacy are
processing course taught at UCLA over 7 years and use not available to the instructor because the data has not been
the data to experimentally demonstrate the performance collected or is not accessible due to privacy reasons. However,
of our algorithm compared to benchmark prediction the instructor always has access to data he/she collects from
methods. As benchmark algorithms we use well known his/her own course, such as the performance of each student in
algorithms such as linear/logistic regression and k- early homework assignments or midterm exams. This paper,
Nearest Neighbors, which are still a current research therefore, focuses on predicting the final grade based on
topic [5]–[7]. this easily accessible data, which is collected anyway by the
8) Based on our simulations, we suggest a preferred way instructor.
of designing courses that enables early prediction and Other works [20]–[23], which also exclusively use data from
early intervention. Using data from a pilot course, we the course itself, differ significantly from this paper in several
demonstrate the advantages of the suggested design. aspects. First, they rely on logged data in online education
The rest of the paper is organized as follows. Section or Massive Open Online Course (MOOC) systems such as
II discusses related work in the field of grade and GPA information about video-watching behavior or time spent on
prediction in education. In Section III we introduce notation, specific questions. In contrast, our results are applicable to
define data structures, formalize the problem and present the both online and offline courses, which include some kind
grade prediction algorithm. We analyze the data, describe of graded assignments or related feedback from the students
benchmark methods and present simulation results including during the course. Second, in order for the instructor to be able
our and benchmark algorithms in Section IV. Finally, we draw to take corrective actions it is of great importance to predict
conclusions in Section V. with a certain confidence the performance of students as early
as possible. While our algorithm takes this into account by
deciding for each student individually the best time to make
II. R ELATED W ORK the prediction using a confidence measure, related works do
Various studies have investigated the value of standardized not provide a metric indicating the optimal time to predict.
tests [8]–[10] admissions exams [11] and GPA in previous Third, while related works need training data from the course
programs [9] in predicting the academic success of students whose grades they want to predict, we show that we can use
in undergraduate or graduate schools. They agree on a positive training data from past year classes of the same or even from
correlation between these predictors and success measures different courses. Finally, in contrast to algorithms from related
such as GPA or degree completion. Besides standardized work, which are only shown to be applicable to classification
tests, the relevancy of other variables for predictions of a settings (e.g. pass/fail or letter grade), our algorithm can be
student’s GPA have been investigated, usually resulting in the used both in regression and classification settings.
conclusion that GPA from prior education and past grades Table I summarizes the comparison between our paper and
in certain subjects (e.g. math, chemistry) [12], [13] have a related work investigating and predicting student performance
strongly positive correlation as well. Reference [12] observes in a course.
3

TABLE I
C OMPARISON W ITH R ELATED W ORK

[16], [18] [19] [15], [17] [20] [21]–[23] Our Work


Find Relevant Predict Course Predict Course Predict Accuracy Predict Course Predict Course
Goal of Paper
Features Grade Grade of Answer Grade Grade
Features Other Course & Other Course & Other From Course From Course From Course
Learning from Past Years n/a No No No No Yes
Accuracy-Timeliness Trade-Off n/a No No No No Yes
Regression / Classification n/a Classification Both Classification Classification Both

Take Corrective Actions of the year and Iy to denote the total number of students
in Consequence of the
Predicted Grade attending in year y. Except for the rare case that a student
retakes the course, the students in each year are different. Let
ai,y,k ∈ [0, 1] denote the normalized score or grade of student
Performance Grade i in performance assessment k of year y.
Assessments Prediction
The feature vector of yth year student i after having
Grade Prediction Algorithm taken performance assessment k is given by xi,y,k =
(ai,y,1 , . . . , ai,y,k ). The normalized overall score zi,y ∈ [0, 1]
of yth year student i is the weighted sum of all performance
Wait or Predict? Final Prediction for assessments
Current Student Made K
X
Prediction & Acknowledge after Assessment 2
Confidence
Wait
Final Prediction
zi,y = wk ai,y,k (1)
k=1
Assess- Wait Assess- Assess- Assess-
ment 1 ment 2 ment 3 … ment K where the wk denote the weight of performance assessment k
Store Score Load Data Store Score No Need Store PK
and Feature form Past and Feature to Load Overall so that k=1 wk = 1. The weights are set by the instructor
Vector Years Vector Data Grade and we assume that in each year the number, sequence and
weight of performance assessments is the same. This assump-
Database tion is reasonable since the content of a course usually does
Feature Vectors & Grades
from Past Years not change drastically over the years and frequently the same
course material (e.g. course book) is used.2 This is especially
true in an introductory course such as the one we investigate
Fig. 1. System diagram for a single student. in Section IV. The residual (overall score) ci,y,k of yth year
student i after performance assessment k is defined as
(P
III. F ORMALISM , A LGORITHM AND A NALYSIS K
l=k+1 wl ai,y,l k ∈ {1, . . . , K − 1}
ci,y,k = (2)
In this section we mathematically formalize the problem 0 k=K
and propose an algorithm that predicts the final score or a
classification according to the final grade of a student with a Using this definition we can write the overall score of yth year
given confidence. student i as
k
X
zi,y = ci,y,k + wl ai,y,l . (3)
A. Definitions and System Description l=1

Consider a course which is taught for several years with Note that after having taken the performance assessment k,
only slight modifications. Students attending the course have to the instructor has access to all the scores up to assignment k
complete performance assessments such as graded homework but the residual scores ci,y,k need to be estimated. We denote
assignments, course projects and in-class exams and quizzes the estimate of the residual score for yth year student i at
throughout the entire course.1 Our goal is to predict with a time k by ĉi,y,k and the corresponding estimate of the overall
certain confidence the overall performance of a student before score by ẑi,y,k . In binary classification settings, where the goal
all performance assessments have been taken. See Fig. 1 for is to predict whether a student achieves a letter grade above
a depiction of the system. or below a certain threshold, we denote the class of yth year
We consider a discrete time model with y = 1, 2, . . . , Y student i by bi,y ∈ {0, 1}.
and k = 1, 2, . . . , K where y denotes the year in which the For each student i we store the set of feature vectors
course is taught and k the point in time in year y after the Xi,y = {xi,y,k |k ∈ {1, . . . , K}}, the set of residuals Ci,y =
kth performance assessment has been graded. Y gives the total {ci,y,1 , . . . , ci,y,K−1 } and the student’s overall score zi,y . All
number of years during which the course is taught and K is the
2 This assumption is made for simplicity. As we show in section IV-D2 we
total number of performance assessments of each year. For a can apply our algorithm to settings where different instructors using different
given year y we use index i as a representation of ith student weights for each performance assessment teach the course. A prediction across
courses with a different number of performance assessments is possible as
1 The performance assessments are usually graded by teaching assistants, well, for example by combining the scores of two or more performance
by the instructor or even by other students through peer review [24]. assessments to a single score.
4

feature vectors from all students of year y are given by Xy = C. Grade Prediction Algorithm, Regression Setting
SIy SY
i=1 Xi,y and X = y=1 Xy denotes all feature vectors In this section we propose an algorithm that learns to predict
S Y SIy
of all completed years. Similarly C = y=1 i=1 Ci,y and a student’s overall performance based on data from classes
SY S Iy
Z = y=1 i=1 zi,y denote all residuals and overall scores of held in past years and based on the student’s results in already
0
all completed years. Let Xk = {xi,y,k |k = k 0 , ∀i, y} denote graded performance assessments. We describe the algorithm
0
the set of feature vectors and Ck = {ci,y,k |k = k 0 , ∀i, y} for the regression setting and explain the changes needed to
denote the set of residuals saved after performance assessment use the algorithm in the classification setting in Section III-D.
k0 . Since at time k we know the scores ai,y,1 , . . . , ai,y,k of
the considered student from past performance assessments as
B. Problem Formulation well as the corresponding weights w1 , . . . , wk , we only predict
the residual ci,y,k and calculate the prediction of the overall
Having introduced notations, definitions and data structures,
score with (3). To make its prediction for the current residual
we now formalize the grade prediction problem. We will
of a student with feature vector xi,y,k , the algorithm finds
investigate two different types of predictions. The objective
all feature vectors from similar students of past years and
of the first type, which we refer to as regression setting,
their corresponding residuals ci,y,k . We define the similarity
is to accurately predict the overall score of each student
of students through their feature vectors. Two feature vectors
individually in a timely manner. The second problem, referred
xi , xj ∈ Xk are similar if hxi , xj ik ≤ r where h., .ik is a
to as classification setting, aims at making a binary prediction
distance metric defined on the feature space Xk and r is a
whether the student will do well or poorly or whether he/she
parameter. For two feature vectors x ∈ Xk1 and x0 ∈ Xk2
will necessitate additional help or not. Again, the prediction
from different feature spaces (i.e. k1 6= k2 ) the distance metric
is personalized and takes timeliness into account. For both
is not defined since we only need to determine distances
types of predictions, the same algorithm can be used with only
within a single feature space. Different feature spaces can
slight modifications, which we discuss in Section III-D. We
have different definitions of the distance metric; we are going
will also show that the binary prediction problem can easily
to define the distance metrics we use in Section IV-B. We
be generalized to a classification into three or more classes.
define a neighborhood B (xc , r) with radius r of feature vector
Irrespective of the type of the prediction, the decision for a
xc ∈ Xk as all feature vectors x ∈ Xk with hxc , xik ≤ r.
yth year student i consists of two parts. First, we decide after
∗ Let C k denote the random variable representing  the residual
which performance assessment ki,y to predict for the given
score after performance assessment k. v k C k |x denotes the
student and second we determine his/her estimated overall
probability distribution over the residual score for a student
score ẑi,y or his/her estimated binary classification b̂i,y . At a
with feature vector x at time k and µk (x) denotes the student’s
point in time k of year y all scores including the overall scores
expected residual score. Let pk (x) denote the probability dis-
of all students of past years 1, . . . , y − 1 are known. Thus all
tribution of the students over the feature space Xk . Intuitively
feature vectors x ∈ X, residuals c ∈ C and overall scores
pk (x) is the fraction of students with feature vector x at time
z ∈ Z of all completed years are known. Furthermore, the 
k. Note that the distributions v k C k |x and pk (x) are not
scores ai,y,1 , . . . , ai,y,k of yth year student i up to assessment
sampling distributions but unknown underlying distributions.
k are known as well and do not have to be estimated. However,
We assume that the distributions do not change over the years.
to determine the overall score of the student we need to
We define the probability distribution of the students in a
predict his/her residual score ci,y,k consisting of performance
neighborhood B (xc , r) with center xc and radius r as
assessments k + 1, . . . , K since they lie in the future and are
unknown. At time k we have to decide for each student of pk (x)
the current year whether this is the optimal time ki,y ∗
= k to pkxc ,r (x) := R 1B(xc ,r) (x),
x∈B(xc ,r)
dpk (x)
predict or whether it is better to wait for the next performance
assessment. If we decide to predict, we determine the optimal where 1 is the indicator function. Intuitively pkxc ,r (x) is the
prediction of the overall score ẑi,y = ẑi,y,ki,y ∗ . Both decisions fraction of students in neighborhood B(xc , r) with feature vec-
are made based on the feature vector xi,y,k of the given student tor x. Let C k (B(xc , r)) be the random variable representing
and the feature vectors x ∈ Xk and residuals c ∈ Ck of the residual score of students in neighborhood B(xc , r) after
past students. To determine the optimal time to predict, we having taken performance assessment k. The distribution of
calculate a confidence qi,y (k) indicating the expected accuracy C k (B(xc , r)) is given by
of the prediction for each student after each performance Z
fxkc ,r C k := v k (C k |x)dpkxc ,r (x)

assessment. The prediction for a particular student is made
as soon as the confidence exceeds a user-defined threshold x∈Xk

qi,y (k) > qth . The problem of finding the optimal prediction We denote the true expected value of the residual scores
time for yth year student i is formalized as follows: after assignment k of students in a particular neighborhood
minimize k by µk (xc , r) := E(C k (B (xc , r))). Note that
k (4)
µk (xc , r) = Ex∼pkx ,r E C k |x = Ex∼pkx ,r µk (x)
    
subject to qi,y (k) > qth Z c c

The optimization problem results in the optimal prediction = µk (x) dpkxc ,r .



time ki,y . x∈Xk
5

Our estimation of the true expected residual of students


within a particular neighborhood B(xi,y,k , r) is given by
r1
P
cx,k Student
Database
x∈B(xi,y,k ,r)
µ̂(C k (B (xi,y,k , r))) = (5)
|B (xi,y,k , r)| r2 r3

where cx,k denotes the residual after time k of the student


Student i with
with feature vector x. For notational simplicity, we use Unknown Residual
µ̂k (xi,y,k , r) := µ̂(C k (B (xi,y,k , r))) to denote the estimated
Neighborhood 
Confidence q B  xi , r  
expectation. In the following we are going to derive how B  xi , r1  0.347
confident we are in the estimation of the residual score B  xi , r2  0.502
based on a given neighborhood B(x, r) and how we use this B  xi , r3  -0.092

confidence q (B(x, r)) to both select the optimal radius of the


neighborhood and to decide when to predict. q  B  xi , r2    qth
Yes No
Intuitively, if the feature vectors after performance assess-
ment k in a neighborhood B(x, r) of x contain a lot of Predict Wait
information about the residual cx,k , past students with feature
vectors in this neighborhood should have had similar residuals. Fig. 2. Illustration of the neighborhood selection process.
Hence, the variance of the residuals Var C k (B(xi,y,k , r))
of the students in the neighborhood should be small. To k
mathematically support this intuition, we consider the residuals arg maxr q (B (xi,y,k , r)) = arg minr Vd ar (x, r). To esti-
ci,y,k in a neighborhood mate r∗ after each performance assessment k, our algorithm
 B(x, r) of feature vector x with
distribution fxkc ,r C k . For any confidence interval  the considers M different neighborhoods B(xi,y,k , rm ), m =
probability that the absolute difference between the unknown 1, . . . , M with user-defined radii rm and chooses the best
residual cx,k of the student with feature vector x and the neighborhood m̂k (xi,y,k ) according to our confidence measure
expected value of the residual distribution µk (x, r) in his/her m̂k (xi,y,k ) = arg maxm q (B(xi,y,k , rm )). In the following
neighborhood is smaller than  can be bounded by we use m̂k := m̂k (xi,y,k ) to denote the best neighborhood.
 Let
 k k
 V ar C k (B(x, r)) ĉi,y,k := µ̂k (xi,y,k , rm̂k ) (9)
P |C (B(x, r)) − µ (x, r)| <  > 1− .
2
(6) denote the estimated residual of the best neighborhood at time
This statement directly follows from Chebyshev’s inequality. k and ẑi,y,k denotes the corresponding estimated overall score
We conclude that the lower the variance of the residual dis- k
X
tribution in the neighborhood, the more confident we are that ẑi,y,k = ĉi,y,k + wl ai,y,l . (10)
the true residual cx,k will be close to µk (x, r). Since both the l=1
expected value µk (x, r) and the variance Var C k (B(x, r))
If the confidence bound for the best neighborhood qi,y (k) =
of the distribution are unknown, we estimate the two values
q (B (xi,y,k , rm̂k )) is above a given threshold qi,y (k) ≥ qth ,
through the sample mean from (5) and the sample variance
the algorithm returns the final prediction of the overall score
Vd ar C k (B(x, r)) given by
ẑi,y = ẑi,y,k for the considered student.
2 If the confidence is below the threshold, we wait for the
cx,k − µ̂k (x, r)
P
k
 x∈B(x,r)
Vd ar C (B(x, r)) = . (7) next performance assessment and start the next iteration. Fig.
|B(x, r)| − 1 2 illustrates the neighborhood selection process. Algorithm 1
In the following we use V ark (x, r) := Var C k (B(x, r)) to
 provides a formal description of the grade prediction algorithm
k  in pseudocode.
denote the variance and Vd ar (x, r) := Vd ar C k (B(x, r)) To conclude the discussion of the grade prediction algorithm
to denote the sample variance of the residual distribution in the regression setting, we derive a bound for the probability
in neighborhood B(x, r). From the law of large number that the prediction error is larger than a value . Before we
it follows that the sample mean and the sample variance state the theorem, we introduce some further notations. Let
converge to the true expected value and the true variance for m∗k (x) denote the index of the neighborhood with the smallest
|B(x, r)| → ∞. We will provide a bound for the probability variance of residuals for the student with feature vector x at
that the prediction error is larger than a given value in the time k
theorem below. Given a desired confidence interval , we m∗k (x) = arg min V ark (x, rm ). (11)
define the confidence on the prediction of the residual as 1≤m≤M

k Note that m∗k (x)


is not necessarily equal to m̂k (x), the index
Vd
ar (x, r)
q (B(x, r)) = 1 − . (8) of the neighborhood with the highest confidence chosen by
2 our algorithm, since the confidence defined in (8) is calculated
Using this confidence measure the radius of the optimal with the known sample variance of residuals Vd ar(x, r) and
neighborhood after performance assessment k is given by r∗ = not with the unknown true variance V ark (x, r) used in (11).
6

Algorithm 1 Grade Prediction Algorithm, Regression Setting of our predictions increases with an increasing number of
Input: All x and z from past years, qth , number M and radii neighbors. Hence, our algorithm learns the best predictions
r1 , . . . , rM of neighborhoods online as the knowledge base is expanded after each year,
Output: Predictions ẑ for the overall scores of the students when the feature vectors and results from the past-year stu-
1: for all years y do dents are added to the database. In Section IV-D1 we show
2: for all performance assessments k do that this learning can be experimentally illustrated with our
3: for all current-year students i for whom the final data from the digital signal processing course taught at UCLA.
prediction has not been made do Second, the term V ark (x, rm∗k )/2 shows that the prediction
4: if k = K then accuracy will be higher if the variance of the residuals in a
5: Calculate zi,y according to (1) neighborhood is small. With increasing time k we expect this
6: Return zi,y as final prediction for student i variance to decrease since we have more information about
7: end if the students and we expect the students in a neighborhood to
8: Create M neighborhoods with radii r1 , . . . , rM be more similar and achieve similar (residual) scores.
9: for all neighborhoods m do Note that it is possible to restrict the data kept in the
10: Estimate residual ĉ (B (xi,y,k , rm )) with (5) knowledge base to recent years, which allows the algorithm
11: Compute Vd ar (x, rm ) with (7) to adapt faster to slowly changing students and to changes in
12: Compute q (B(xi,y,k , rm )) with (8) the course.
13: end for
14: Find m̂k = arg maxm q (B (xi,y,k , rm ))
15: if q (B (xi,y,k , rm̂k )) ≥ qth then D. Grade Prediction Algorithm, Classification Setting
16: Compute ẑi,y with (3) In the binary classification setting we predict the overall
17: Return ẑi,y as final prediction for student i score analogously to the regression setting and then determine
18: end if the class by comparing the predicted overall score ẑi,y with
19: Add xi,y,k and ai,y,k to database a threshold score zth . To illustrate how we find zth let us
20: end for assume that we want to predict whether a student does well
21: end for (letter grades ≥ B−) or does poorly (letter grades ≤ C+). To
22: Calculate all ci,y,k of year y according to (2) determine zth , we find the average zavg,B− of all students from
23: Add all ci,y,k to database past years who received a B− and the average zavg,C+ of all
24: end for students from past years who achieved a C+. Subsequently,
we define zth as zth = (zavg,B− + zavg,C+ ) /2. The predicted
classification b̂i,y of yth year student i is then given by
Similarly m∗k,2 (x) denotes the index of the neighborhood (
with the second highest confidence. 0 ẑi,y ≥ zth
b̂i,y = (13)
m∗k,2 (x) = arg min V ark (x, rm ). 1 ẑi,y < zth .
1≤m≤M,m6=m∗
k (x)
We are more confident in the classification not only if the
Let ∆k (x) denote the difference between the standard devia- variance of the neighbor-scores is small, which is the metric
tions of the residual distribution of neighborhoods m∗k (x) and we used for the confidence in the regression setting, but also
m∗k,2 (x) if the distance d (B(x, r)) = |ẑ (B(x, r)) − zth | between the
predicted score and the threshold score is large. Note that
q q
∆k (x) = V ark (x, rm∗k,2 ) − V ark (x, rm∗k ). (12)
ẑ (B(x, r)) is the estimate of the overall score based on
Theorem. Without loss of generality we assume that all neighborhood B(x, r). Because of this intuition we use a
scores a are normalized to the range [0, 1]. Consider the modified confidence
prediction ẑi,y,k of the overall score of yth year student i k

−d(B(x,r)) V ar C (B(x, r))
d
with feature vector x made by algorithm 1. The probability bin
q (B(x, r)) = 1 − e (14)
that the absolute error the prediction exceeds  is bounded by 2
  to decide whether to make the final prediction in binary
4V ark x, rm∗k (x) classification settings. Since d (B(x, r)) should only influence
P [|zi,y − ẑi,y,k | ≥ ] ≤
2  whether the final prediction is made for a given neighborhood

|B (x, rm )| but not the neighborhood selection process, we still use the
+ 2 exp −2 min unmodified confidence from (8) to select the optimal neigh-
1≤m≤M 2

|B(x, rm )| − 1
 borhood.
+ 2M exp −∆k (x)2 min In summary, four changes have to be made to algorithm 1 to
1≤m≤M 8
make it applicable to binary classification settings. First, zth
has to be determined/updated at the beginning of each new
Proof: See Appendix. year. Second, we calculate b̂i,y after line 16 according to (13).
This theorem illustrates two important aspects of algorithm Third, we return b̂i,y at line 17 instead of ẑi,y . Fourth, we
1. First, we see that for a given neighborhood the accuracy use the modified confidence q bin according to (14) in line 15
7

instead of the unmodified confidence q. We use the unmodified Algorithm 2 Confidence-Learning Prediction Algorithm
confidence from (8) in line 14. Input: Emax , pmin , qth,0 , all x and z from past years, number
The described binary classification algorithm can easily be M and radii r1 , . . . , rM of neighborhoods
generalized to a larger number of categories. In a classifi- Output: Predictions ẑ, ky and qth,y for all years
cation with L categories, we define L − 1 threshold values 1: for all years y do
zth,1 < zth,2 < . . . < zth,L−1 and determine in which of 2: if y > 1 then
the L score intervals {[0, zth,1 ), [zth,1 , zth,2 ), . . . , [zth,L−1 , 1]} 3: Find ky−1 and qth,y−1 according (15) by running
the predicted overall score ẑi,y of a student lies. The index of algorithm 1 with various qth
the interval corresponds to the classification of the student. 4: Return ky−1 and qth,y−1
In this general classification setting, the modified confidence 5: end if
from (14) can be used as well by defining d as the distance 6: Use algorithm 1 with qth = qth,y−1 to predict and
of ẑi,y to the nearest threshold value. return the grades of current year y students
We discuss the performance of the proposed algorithm 1 in 7: end for
both regression and classification settings in Section IV-D.

E. Confidence-Learning Prediction Algorithm student with a desired confidence, this problem (15) aims at
Besides the radii of the neighborhoods ri , the only param- finding the minimum time by which the overall scores of a
eter to be chosen by the user in algorithm 1 is the desired specific percentage of all students can be predicted with a
confidence threshold qth . Since for an instructor it is more desired maximum error.
natural and practical to specify a desired prediction accuracy At a given year y we solve this optimization problem
or error directly rather than the confidence threshold, we show using a brute force approach using the all available data
in this section how to automatically learn the appropriate con- from years 1, . . . , y. For this purpose, we extract k, p and E
fidence threshold to achieve a certain prediction performance from algorithm 2 for a large number of different confidence
and what consequences this has on the average prediction time. thresholds qth . We then select the confidence threshold qth,y
We will discuss a possible way of choosing the radii ri of the which is optimal with respect to optimization problem (15)
neighborhoods in Section IV-B. and determine the corresponding prediction time ky . To make
Formally we define the problem as follows. Let p(k, qth ) ∈ the grade predictions for year y + 1 we use the learned
[0, 1] denote the proportion of current year students for confidence threshold qth,y as input to prediction algorithm 1.
which the grade prediction algorithm working with confidence Since there are no training data available yet at year y = 1,
threshold qth ≤ 1 has predicted the overall score by time the algorithm uses a user-defined starting value qth,0 for the
(performance assessment) k ∈ [0, K]. pmin is the minimum grade predictions of the first year. Algorithm 2 summarizes
percentage of current year students whose grade the user wants the learning algorithm in pseudocode.
to predict with a specified accuracy. E(k, qth ) ≥ 0 denotes the
average absolute prediction error up to time k for the given IV. E XPERIMENTS
confidence. Emax is the maximum error the user is willing to In this section, we present the data, discuss details of
tolerate. k(p, qth ) is the time necessary to predict the grade the application of algorithm 1 to our dataset, illustrate the
for proportion p of all students of the class using confidence functioning of the algorithm and evaluate its performance
threshold qth . Please note that since the variables p, E, qth and by comparing it against other prediction methods in both
k are dependent, we can only independently specify two of the regression and binary classification settings. Due to space limi-
four variables. If we for example specify to predict all (p = 1) tations, we will not show experimental results for classification
students with zero error (E = 0), the algorithm will have to settings with more than two categories.
wait until the end of the course when the overall score is
known (k = K) and will use maximum confidence (qth = 1).
Without making any assumptions on the dependence of the A. Data Analysis
variables of each other, multiple pairs (k, qth ) might lead to Our experiments are based on a dataset from an under-
the same specified pair (p, E). graduate digital signal processing course (EE113) taught at
Our goal is, therefore, to learn from past data the yth UCLA over the past 7 years. The dataset contains the scores
year estimate of the minimal time ky and corresponding from all performance assessments of all students and their
confidence threshold qth,y necessary to achieve the desired final letter grades. The number of students enrolled in the
share pmin of students predicted and the desired maximum course for a given year varied between 30 and 156, in total
average prediction error Emax . This is formally defined as: the dataset contains the scores of approximately 700 students.
Each year the course consists of 7 homework assignments, one
minimize k(p, qth ) in-class midterm exam taking place after the third homework
qth
(15) assignment, one course project that has to be handed in after
subject to p(k, qth ) ≥ pmin
homework 7 and the final exam. The duration of the course
E(k, qth ) ≤ Emax is 10 weeks and in each week one performance assessments
Note that while the goal of optimization problem (4) is to find takes place. The weights of the performance assessments are
the minimum time to predict the overall score of a particular given by: 20% homework assignments with equal weight on
8

each assignment, 25% midterm exam, 15% course project performance assessments might be different across years,
and 40% final exam.3 Fig. 3a shows the distribution of the homework 2 might for instance be very easy in year 2 so
letter grades assigned over the 7 years. We observe that on that almost everyone achieves the maximum score and very
average B is the grade the instructor assigned most frequently. difficult in year 3 so that few achieve half of the maximum
A was assigned second most and C third most frequently. score. The normalization according to (16) eliminates this bias
Surprisingly, however, the distribution varies drastically over by transforming the absolute scores of a student to scores
the years; in year 1 for example only 18.75% received a B relative to his/her classmates of the same year. Note that
while in year 6 the frequency was 38.9%. algorithm 1 does not require a specific normalization and it
To understand the predictive power of the scores in different does not matter that the normalized scores according to (16)
performance assessments, Fig. 3b shows the sample Pearson will not be in the interval [0, 1] as assumed in Section III for
correlation coefficient between all performance assessments simplicity.
and the overall score. We make several important observa- Second, we use feature vectors that simply contain the
tions from this graph. First, on average the final exam has scores of all performance assessments student i has taken up to
the strongest correlation to the overall score, followed by time k in the order they occurred xi,y,k = (ai,y,1 , . . . , ai,y,k ).
the midterm exam. This is not surprising, since the final To incorporate the fact that students who have performed sim-
contributes 40% and the midterm contributes 25% to the ilarly in a performance assessment with a lot of weight should
overall score. Second, the score from the course project on be nearer to each other in the feature space than students
average does not have a higher correlation with the overall that have had similar scores in a performance assessment (e.g.
score than the homework assignments despite the fact that it homework assignment) with low weight, we use a weighted
accounts for 15% of the overall score. Third, all homework metric to calculate the distance between two feature vectors.
assignments have similar correlation coefficients. Fourth, the We define the distance of two feature vectors xi , xj ∈ Xk as
correlation between the individual performance assessments Pk
wl |xi,l − xj,l |
and the overall score varies greatly over the years. This hxi , xj ik = l=1 Pk , (17)
indicates that predicting student scores based on training data l=1 wl
from past years might be difficult. where k is the length of the feature vectors, wl is the weight
Since all performance assessments are part of the overall of performance assessment l and xi,l denotes entry l of feature
score and, therefore, a high correlation is expected, it is also vector xi .
informative to consider the correlation between the perfor- Third, rather than specifying the radii of the neighbor-
mance assessments and the final exam shown in Fig. 3c. It hoods to consider as an input, as suggested in the pseudo-
is interesting to observe that still the midterm exam shows, code of algorithm 1, we automatically adapt the radii of the
besides the overall score, the highest correlation with the final neighborhoods such that they contain a certain number of
exam. A possible explanation for this is that both the midterm neighbors. Since the sample variance gets more accurate with
and final are in-class exams while the other performance an increasing number of samples, we refrain from consid-
assessments are take-home. ering neighborhoods with only 2 neighbors. Therefore, the
smallest radius considered r1 is the minimal radius such
B. Our Algorithm that the neighborhood includes 3 neighbors. For subsequent
neighborhoods the minimal radius is chosen such that the
In this section we discuss three important details of the ap- neighborhood includes at least one neighbor more than the
plication of algorithm 1 to the dataset from the undergraduate previous neighborhood. Formally, we define the selection of
digital signal processing course. the radii recursively as
First, the rule we use to normalize all scores ai,y,k in our
dataset is given by r1 = min r, s.t. |B(xi,y,k , r)| ≥ 3
(18)
a∗i,y,k − µ̂y,k rm+1 = min r, s.t. |B(xi,y,k , r)| > |B(xi,y,k , rm )|.
ai,y,k = , (16)
σ̂y C. Benchmarks
where a∗i,y,k is the original score of the student, µ̂y,k is We compare the performance of our algorithm against five
the sample mean of all yth year student’s original scores different prediction methods.
in performance assessment k and σ̂y is the standard de- • We use the score ai,y,k student i has achieved in the
viation of all yth year student’s original overall scores. A most recent performance assessment k alone to predict
normalization of the scores is needed for several reasons. the overall grade.
First, the instructor-defined maximum score in a particular • A second simple benchmark makes the prediction based
performance assessment may differ greatly across years and on the scores ai,y,1 , . . . , ai,y,k student i has achieved
since we use data from past years to predict the performance up to performance assessment k taking into account the
of students in a given year, we need to make the data across corresponding weights of the performance assessments.
years comparable. Second, also the difficulty of individual • Linear regression using the ordinary least squares (OLS)

3 As we explain in footnote 2 and show in section IV-D2, our algorithm


finds the least squares optimal linear mapping between
can also be applied to settings where the number and weights of performance the scores of first k performance assessments and the
assessments change over the years. overall score.
9

0.4 1 1

Sample Pearson Correlation Coefficient r

Sample Pearson Correlation Coefficient r


All Years
0.35 Year 1
Year 2 0.8 0.8
0.3 Year 3
Year 4
0.25 Year 5 0.6 0.6
Frequency

Year 6
0.2 Year 7 All Years All Years
0.15 0.4 Year 1 0.4 Year 1
Year 2 Year 2
0.1 Year 3 Year 3
Year 4 Year 4
0.2 Year 5 0.2 Year 5
0.05
Year 6 Year 6
0 Year 7 Year 7
0 0
F D− D D+ C− C C+ B− B B+ A− A A+ H1 H2 H3 M H4 H5 H6 H7 P F H1 H2 H3 M H4 H5 H6 H7 P O
Grades Performance Assessments Performance Assessments

(a) Grade distribution (b) Correlation coefficients overall score (c) Correlation coefficients final score
Fig. 3. Data analysis: 3a shows the distribution of letter grades for all years. 3b and 3c present the sample Pearson correlation coefficient between individual
performance assessments and the overall (3b) or final exam (3c) score. Note that we use the abbreviations Hi (homework assignment i), M (midterm exam),
F (final exam) and O (overall score) in the figures.

TABLE II first four performance assessments is still about average and,


C ASE S TUDY: I LLUSTRATIVE E XAMPLE therefore, logistic regression predicts that the students will do
HW 1 HW 2 HW 3 Midterm Do Well/Poorly well. For student 3 the situation is the other way around.
Student 1 0.53 0.00 -0.37 -1.35 Poorly
Student 2 1.07 0.87 -0.30 -1.06 Poorly
Student 3 -1.39 -1.54 -2.15 0.50 Well D. Results
In this section we evaluate the performance of our algorithm
1 in different settings and compared to benchmarks in both
In classification settings we use logistic regression instead
• regression and the classification tasks.
of linear regression. As a performance measure in the regression setting, we use
• The k-Nearest Neighbors algorithm with 7 neighbors. the average of the absolute values of the prediction errors
This number provided the best results with training data E. Since we normalized the overall score to have zero mean
from the first year. and a standard deviation of 1, E directly corresponds to the
The advantage of the method we use in our algorithm over number of standard deviations the predictions on average are
linear and logistic regression is that being a nearest neighbor away from the true values. The overall performance measure
method, it is able to recognize certain patterns such as trends in classification settings is the accuracy of the classification.
in the data that are missed in linear/logistic regression where Furthermore, we use the quantities precision, recall and false
a single parameter per performance assessment has to fit all positive/negative rate besides accuracy to measure perfor-
students. In contrast, our algorithm is able to detect such mance. Please note that positive in our case means that the
patterns if there have been students in the past who have shown student does poorly.
similar patterns. 1) Performance Comparison with Benchmarks in Regres-
Table II illustrates this through a case study extracted sion Setting: Having discussed the various performance mea-
from the UCLA undergraduate digital signal processing course sures, we first address the regression setting. Fig. 4 visualizes
data. We present cases from a simulation where we predicted the performance of the algorithm we presented in Section
whether students are going to do well (letter grade ≥ B−) III-C and of benchmark methods. We generated Fig. 4 by
or do poorly (letter grade ≤ C+) and consider the students predicting the overall scores of all students from years 2−7. To
for which our algorithm decided to predict after the midterm make the prediction for year y, we used the entire data from
exam. The table shows 3 students whom logistic regression years 1 to y − 1 to learn from. Unlike our algorithms, the
classified wrongly while our algorithm made the accurate benchmark methods do not provide conditions to decide after
prediction. In columns 2-4 we present the scores the students which performance assessment the decision should be made.
achieved up to the midterm exam and the last column shows Therefore, for benchmark methods we specified the prediction
the true classification of the students. These cases are typical time (performance assessment) k for an entire simulation and
examples of settings where our algorithm outperforms logistic repeated the experiment for all k = 1, . . . , 10; the results are
regression. Student 1 and 2 both showed a good performance plotted in Fig. 4. To generate the curve of our algorithm 1,
in homework assignment 1. However, in later assignments and we ran simulations using different confidence thresholds qth
especially at the midterm exam their performance successively and for each threshold we determined E and the performance
deteriorated, an indication that the students might do poorly assessment (time) k̄ after which the prediction was made on
the class if they or the instructor and teaching assistants do average.
not take corrective actions. Our algorithm is likely to have Irrespective of the prediction method, Fig. 4 shows the
learned such patterns from past data and predicts the students trade-off between timeliness and accuracy; the later we pre-
to do poorly. On average, however, their performance in the dict the more accurate our prediction gets. From the curve
10

0.8 0.8

Average Absolute Prediction Error 0.7 0.7

Average Absolute Prediction Error


0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3
Learn from Year 1
Single Performance Assessment Learn from Year 2
0.2 0.2
Past Assessments and Weights Learn from Year 3
Linear Regression Learn from Year 4
0.1 7 Nearest Neighbors 0.1 Learn from Year 5
Our Algorithm 1 Learn from Years 1−5
0 0
HW1 HW2 HW3 Midt. HW4 HW5 HW6 HW7 Proj. Final HW1 HW2 HW3 Midt. HW4 HW5 HW6 HW7 Proj. Final
Average Prediction Time (Performance Assessment) Average Prediction Time (Performance Assessment)

Fig. 4. Performance comparison of different prediction methods. Fig. 5. Illustration of learning from past data: Error of grade predictions for
year 7 depending on training data.

for the prediction using a single performance assessment 0.9


we infer that there is a low correlation between homework Learn from a Different Instructor
0.8 Learn from the Same Instructor
assignments/course project and the overall score and a high

Average Absolute Prediction Error


0.7
correlation between the in-class assessments (midterm and
final exam) and the overall score. This observation is con- 0.6
gruent with the correlation analysis from Section IV-A. If 0.5
the prediction is made early, before the midterm, all methods
0.4
(except the prediction using a single performance assessment)
lead to similar prediction errors. We observe that while the 0.3

error decreases approximately linearly for our algorithm, the 0.2


performance of benchmark methods steeply increases after the 0.1
midterm and the final but stays approximately constant during
0
the rest of the time. The reason for this is that we obtained HW1 HW2 HW3 Midt. HW4 HW5 HW6 HW7 Proj. Final
Average Prediction Time (Performance Assessment)
the points of the curve for our algorithm by averaging the
prediction time of all students. Therefore, the point of the Fig. 6. Illustration of learning across instructors: Error of grade predictions
curve above the midterm was not generated by predicting for year 7 depending on training data.
after the midterm for all students; some predictions were made
earlier, some later. If on average the prediction is made after
homework 4, our algorithm shows a significantly smaller error instructor takes over a course previously taught by someone
E than benchmark methods outperforming linear regression by else. It is interesting to see whether our grade prediction still
up to 65%. works well in this setting. A good performance is not self-
2) Learning across Years and Instructors in Regression evident for several reasons. Different instructors might set a
Setting: Consider Fig. 5 demonstrating the performance in- different focus concerning the knowledge imparted, they might
crease of our algorithm when more data to learn from become use a different textbook and they might prefer different styles
available. To generate the figure, we used our algorithm to of homework assignments and in-class exams. Furthermore,
predict the overall scores of all 7th year students for different the structure of the course, e.g. the number and sequence of
confidence thresholds. The curves in dashed lines stem from homework assignments, the time when the midterm exam takes
simulations using only one of the years 1-5 as training data place, the weights of performance assessments and whether a
and the solid magenta curve uses all years 1-5 to learn from. course project and quizzes exist, might change drastically. To
We observe that the prediction performance strongly depends generate Fig. 6, we predicted the overall score for the year 7
on the training data and differs if different years are used. class of instructor 1 based on two different sets of previous
Most importantly, the performance is highest irrespective of data. The solid blue curve was generated by using the data
the average prediction time if the combination of the data from from the classes in years 1-5 from the same instructor 1 as
all 5 years is used. This shows that our algorithm is able to training data. To obtain the dashed red curve, we used the data
learn and improves its predictions over time. from classes in years 1-5 from instructor 2 to learn from. While
The undergraduate digital signal processing course is taught the predictions using training data from the same instructor
twice a year by three different instructors at UCLA. While we are slightly more accurate, the performance with training data
used only the data from one instructor in the previous plots, from a different instructor is still very satisfying, showing a
Fig. 6 investigates the situation when we predict the grades good robustness of our algorithm with respect to different
for a class of instructor 1 based exclusively on past data from instructors. For the subsequent results we again exclusively
a different instructor 2. In practice this happens when a new use data from one instructor.
11

Prediction Time (Course with Quizzes, 1 Tick = 1 Week) 1

Accuracy, Precision and Recall of Prediction


H1 Q1/H2 H3 Q2/H4 H5 Q3/H6 H7 Q4 H8 Final
0.9
Share of Predictions Made / Prediction Error
1
0.8

0.7
0.8
0.6

0.5
0.6
0.4
Accuracy (Logistic Regression)
0.3 Precision (Logistic Regression)
0.4
Recall (Logistic Regression)
0.2
Cum. Share of Students Predicted (Quizzes) Accuracy (Our Algorithm 1)
Cumulative Average Error (Quizzes) 0.1 Precision (Our Algorithm 1)
0.2
Cum. Share of Students Predicted (Midterm) Recall (Our Algorithm 1)
0
Cumulative Average Error (Midterm) HW1 HW2 HW3 Midt. HW4 HW5 HW6 HW7 Proj. Final
0 (Average) Time of Prediction k
H1 H2 H3 Midt. H4 H5 H6 H7 Proj. Final
Prediction Time (Course with Midterm, 1 Tick = 1 Week) Fig. 8. Performance comparison between our algorithm and logistic regression
using accuracy, precision and recall for binary do well/poorly classification.
Fig. 7. Comparison of prediction time and accuracy between the UCLA
course EE113, which contains a midterm exam, and the UCLA course EE103,
which contains four in-class quizzes instead of a midterm exam. Note that Cumulative Share of Students Predicted

Share of Predictions / Success / Error Rates


1.2
the tick labels Qi/Hi above the plot stand for quiz/homework i and that for Cumulative Accuracy
EE103 there are weeks in which both a homework and a quiz take place. Cumulative False Positive Error Rate
1 Cumulative False Negative Error Rate

0.8
3) Performance Comparison with Course Containing Early
Quizzes in Regression Setting: The results in both the data 0.6
analysis section (Fig. 3b) and Section IV-D1 (Fig. 4) indicate
that scores in in-class exams are much better predictors of the 0.4
overall score than homework assignments. To verify this, we
consider two consecutive years of the UCLA course EE103, 0.2

which contains four in-class quizzes in course weeks 2, 4, 6


0
and 8 instead of a midterm. Fig. 7 visualizes that, starting HW1 HW2 HW3 Midt. HW4 HW5 HW6 HW7 Proj. Final
Time of Prediction (Performance Assessment)
from the first quiz in week 2, indeed our algorithm is able
to predict the same percentage of the students with an up to
Fig. 9. Cumulative prediction time, accuracy, false positive and false negative
22% smaller cumulative average prediction error by a certain error rates for a binary do well/poorly classification with fixed qt h.
week. We generated Fig. 7 by using algorithm 1 to predict for
both courses the overall scores of the students in a particular
year based on data from the previous year. Note that for the with few performance assessments. Starting from homework 4,
course with quizzes, the increase in the share of students our algorithm performs significantly better, with an especially
predicted is larger in weeks that contain quizzes than in weeks drastic improvement of recall. It is interesting to see that even
without quizzes. This supports the thesis that quizzes are good with full information, the algorithms do not achieve a 100%
predictors as well. prediction accuracy. The reason for this is that the instructor
According to this result, it is desirable to design courses did not use a strict mapping between overall score and letter
with early in-class exams. This enables a timely and accurate grade and the range of overall scores that lead to a particular
grade prediction based on which the instructor can intervene letter grade changed slightly over the years.
if necessary. 5) Decision Time and Accuracy in Classification Setting:
4) Performance Comparison with Benchmarks in Classifi- To better understand when our algorithm makes decisions and
cation Setting: The performances in the binary classification with what accuracy, consider Fig. 9. We again investigate
settings are summarized in Fig. 8. Since logistic regression binary do well/poorly classifications as discussed above. The
turns out to be the most challenging benchmark in the classi- red curve shows (square markers) for what share of the total
fication setting, we do not show the performance of the other number of students the algorithm makes the prediction by a
benchmark algorithms for the sake of clarity. The goal was specific point in time. The remaining curves show different
to predict whether a student is going to do well, still defined measures of cumulative performance. We can for example see
as letter grades equal to or above B−, or do poorly, defined that by the midterm exam we classify 85% of the students
as letter grades equal to or below C+. Again, to generate with an accuracy of 76%. These timely predictions are desir-
the curves for the benchmark method, logistic regression, we able since the earlier the prediction is made the more time
specified manually when to predict. For our algorithm we an instructor has to take corrective action. The cumulative
again averaged the prediction times of an entire simulation accuracy stays almost constant around 80% irrespective of
and varied qth to obtain different points of the curve. Up the prediction time. We believe that the reason for this is
to homework 4, the performance of the two algorithms is that thanks to the confidence threshold, the easy decisions are
very similar, both showing a high prediction accuracy even made early and harder decisions are made later. Consequently,
12

the expected accuracy of all predictions remains more or less can be derived.
constant irrespective of the prediction time.
Proof: See [26] for a proof of Fact 2.
V. C ONCLUSION Lemma 1. Let m̂k (x) denote the index of the neighborhood
In this paper we develop an algorithm that allows for selected by our algorithm for the student with feature vector
a timely and personalized prediction of the final grades of x at time k and m∗k (x) is given by (11). M denotes the
students exclusively based on their scores in early perfor- total number of neighborhoods our algorithm considers and
mance assessment such as homework assignments, quizzes ∆k (x) is given by (12). We can bound the probability that our
or midterm exams. Using data from an undergraduate digital algorithm chooses the wrong neighborhood by
signal processing course taught at UCLA, we show that the |B(x,rm )|−1
−∆k (x)2 min
P [m̂k (x) 6= m∗k (x)] ≤ 2e
8
algorithm is able to learn from past data, that it outperforms 1≤m≤M
. (19)
benchmark algorithms with regard to accuracy and timeliness
both in classification and regression settings and that the Proof: Consider:
predictions are robust even when the course is taught by P [m̂k (x) 6= m∗k (x)]
different instructors.  
k ∗
We show that in-class exams are better predictors of the = P arg min V ar (x, rm ) 6= mk (x)
d
overall performance of a student than homework assignments. 1≤m≤M
q
 
Hence, designing courses to have early in-class evaluations k
enables timely identification of students who, with a high = P arg min Vd ar (x, rm ) 6= m∗k (x) .
1≤m≤M
probability, would do poorly without intervention and enables
remedial actions to be adopted at an early stage. If the estimation error of the standard deviation is smaller than
Our algorithm can easily be generalized to include context ∆k (x)/2 for all neighborhoods
data from students such as their prior GPA or demographic q
k
q
∆k (x)
Vd k
ar (x, rm ) − V ar (x, rm ) ≤ ,
data. If applied exclusively to MOOCs, the in-course data 2
used for the predictions could be extended for example by
the responses of students to multiple-choice questions, their our algorithm chooses the optimal neighborhood m∗k (x).
forum activity, the course material they studied or the time they Therefore, we get
spent studying online. Another direction of future work is to  q
k


apply our algorithm in practice and investigate to what extent P arg min V ar (x, rm ) 6= mk (x)
d
1≤m≤M
the performance of students can be improved by a timely 
intervention based on the grade predictions. In this context, [  q k
our algorithm could be extended to make multiple predictions ≤P Vd
ar (x, rm )
for each student to monitor the trend in the predicted grade 1≤m≤M

after an intervention.
q ∆k (x)
k
− V ar (x, rm ) ≥
2
A PPENDIX M  q 
(a) X k
q ∆k (x)
In this Appendix, we proof the theorem from section III-C.

≤ P Vd ar (x, rm ) − V ark (x, rm ) ≥
Before we start with the proof, we discuss some preliminary m=1
2
results. M  
(b) X |B(x, rm )| − 1
≤ 2 exp −∆k (x)2
Fact 1. (Chernoff-Hoeffding Bound) Let X1 , X2 , . . . , Xn be 8
m=1
independent and bounded random variables with range [0, 1]  
|B(x, rm )| − 1
and expected value µ. Let µ̂n = (X1 + . . . + Xn )/n denote ≤ 2M exp −∆k (x)2 min
the sample mean of the random variables. Then, for all  > 0 1≤m≤M 8

P [|µ̂n − µ| ≥ ] ≤ 2 exp −2n2 .


  where (a) is the union bound and (b) follows from Fact 2.
Proof of Theorem: Note that
Proof: A proof of Fact 1 can be found in Hoeffding’s
paper [25]. |zi,y − ẑi,y,k |
!
k k
Fact 2. (Empirical Bernstein Bound) Let n ≥ 2 and

(a) X X
= ci,y,k + wl ai,y,l − ĉi,y,k + wl ai,y,l

X1 , X2 , . . . , Xn be independent and bounded random random
l=1 l=1
variables with range [0, 1] and variance V ar. µ̂n denotes the
Pn = |ci,y,k − ĉi,y,k |
n-sample mean µ̂n = n1 i=1 Xi and Vd arn denotes the n-
1
PN 2
sample variance V arn = n−1 i=1 (xi − µ̂n ) . Then, the
d where (a) follows from equations (3) and (10).
following inequality bounds the probability that the error of There are three sources of error in the prediction of an
the sample standard deviation, which is the square root of the overall score of algorithm 1:
sample variance, is larger than a given value 1) The wrong neighborhood size may be selected due to
√ inaccurate approximations of the true residual score vari-
 q   
n−1 2
P V arn − V ar ≥  ≤ 2 exp −
d  ances of the neighborhoods through the sample variance.
2
13

2) If the optimal neighborhood is selected, the sample mean R EFERENCES


of the residual scores in the neighborhood may not be a [1] R. Baraniuk, “Open education: New opportunities for signal processing,”
good approximation of their true mean. Plenary Speech, 2015 IEEE International Conference on Acoustics,
3) Even if the optimal neighborhood is selected and the Speech and Signal Processing (ICASSP), 2015.
[2] “Openstax college,” http://openstaxcollege.org/, accessed: 2015-05-07.
sample mean equals the true mean, the residual score of [3] “Openstax tutor,” http://openstaxtutor.org/, accessed: 2015-05-07.
the considered student may be different from the mean [4] C. Tekin, J. Braun, and M. van der Schaar, “etutor: Online learning
of the residual score distribution. for personalized education,” in Acoustics, Speech and Signal Processing
(ICASSP), 2015 IEEE International Conference on Acoustics, Speech
In the following we separate these three error sources and and Signal Processing. IEEE, 2015.
derive a bound for each one. [5] G. Marjanovic and V. Solo, “l {q} sparsity penalized linear regression
with cyclic descent,” IEEE Transactions on Signal Processing, vol. 62,
We have no. 6, pp. 1464–1475, 2014.
[6] S. Marano, V. Matta, and P. Willett, “Nearest-neighbor distributed learn-
P [|zi,y − ẑi,y,k | ≥ ] = P [|ci,y,k − ĉi,y,k | ≥ ] ing by ordered transmissions,” IEEE Transactions on Signal Processing,
(b)  vol. 61, no. 21, pp. 5217–5230, 2013.
=P ci,y,k − µ̂k x, rm̂k (x) ≥ 
 
[7] G. Mateos, J. A. Bazerque, and G. B. Giannakis, “Distributed sparse
(c)  linear regression,” IEEE Transactions on Signal Processing, vol. 58,
= P ci,y,k − µ̂k x, rm̂k (x) ≥ , m̂k (x) = m∗k (x)
 
no. 10, pp. 5262–5276, 2010.
[8] N. R. Kuncel and S. A. Hezlett, “Standardized tests predict graduate
+ P ci,y,k − µ̂k x, rm̂k (x) ≥ , m̂k (x) 6= m∗k (x)
  
students success,” Science, vol. 315, pp. 1080–1081, 2007.
(d) h   i [9] E. Cohn, S. Cohn, D. C. Balch, and J. Bradley, “Determinants of
≤ P ci,y,k − µ̂k x, rm∗k (x) ≥ , m̂k (x) = m∗k (x)

undergraduate gpas: Sat scores, high-school gpa and high-school rank,”
Economics of Education Review, vol. 23, no. 6, pp. 577–586, 2004.
+ P [m̂k (x) 6= m∗k (x)] [10] E. R. Julian, “Validity of the medical college admission test for predict-
(e) h   i ing medical school performance,” Academic Medicine, vol. 80, no. 10,
pp. 910–917, 2005.
≤ P ci,y,k − µ̂k x, rm∗k (x) ≥  + P [m̂k (x) 6= m∗k (x)]

[11] P. A. Gallagher, C. Bomba, and L. R. Crane, “Using an admissions exam
to predict student success in an adn program,” Nurse Educator, vol. 26,
where (b) follows from (9), (c) is the law of total probability no. 3, pp. 132–135, 2001.
and (d) and (e) both follow from the fact that P [A, B] ≤ P [A]. [12] W. L. Gorr, D. Nagin, and J. Szczypula, “Comparative study of artificial
Lemma 1 provides a bound for the second term. Therefore, we neural network and statistical models for predicting student grade point
averages,” International Journal of Forecasting, vol. 10, no. 1, pp. 17–
focus on the first term 34, 1994.
h   i
[13] N. T. Nghe, P. Janecek, and P. Haddawy, “A comparative analysis of
P ci,y,k − µ̂k x, rm∗k (x) ≥ 

techniques for predicting academic performance,” in Frontiers In Ed-
h     ucation Conference-Global Engineering: Knowledge Without Borders,
= P ci,y,k − µk x, rm∗k (x) +µk x, rm∗k (x)

Opportunities Without Passports, 2007. FIE’07. 37th Annual. IEEE,
  i 2007, pp. T2G–7.
−µ̂k x, rm∗k (x) ≥  [14] R. D. Goldman and R. E. Slaughter, “Why college grade point average

is difficult to predict.” Journal of Educational Psychology, vol. 68, no. 1,
(f ) h   i p. 9, 1976.
≤ P ci,y,k − µk x, rm∗k (x) ≥ [15] P. Cortez and A. M. G. Silva, “Using data mining to predict secondary

h    2  i
school student performance,” 2008.
k k [16] L. H. Werth, Predicting student performance in a beginning computer
+ P µ x, rm∗k (x) − µ̂ x, rm∗k (x) ≥

2 science class. ACM, 1986, vol. 18, no. 1.
k
  [17] J. L. Turner, S. A. Holmes, and C. E. Wiggins, “Factors associated with
(g) 4V ar x, rm∗k (x) 2 
   grades in intermediate accounting,” Journal of Accounting Education,
≤ + 2 exp − B x, rm∗k (x)

vol. 15, no. 2, pp. 269–288, 1997.
2 2 [18] A. Y. Wang and M. H. Newlin, “Predictors of web-student performance:
 
The role of self-efficacy and reasons for taking an on-line class,”
4V ark x, rm∗k (x) 
|B (x, rm )|

Computers in Human Behavior, vol. 18, no. 2, pp. 151–163, 2002.
2
≤ + 2 exp − min [19] S. Kotsiantis, C. Pierrakeas, and P. Pintelas, “Predicting stu-
2 1≤m≤M 2 dents’performance in distance learning using machine learning tech-
where (f ) follows from the triangle inequality, the fact that niques,” Applied Artificial Intelligence, vol. 18, no. 5, pp. 411–426, 2004.
[20] C. G. Brinton and M. Chiang, “Mooc performance prediction via
P [X + Y ≥ x0 + y0 ] ≤ P [{X ≥ x0 } ∪ {Y ≥ y0 }] and the clickstream data and social learning networks,” in 34th INFOCOM IEEE.
union bound. The bound for the first term in step (g) follows 2015, To appear.
from Chebyshev’s inequality and the bound for the second [21] B. Minaei-Bidgoli, D. A. Kashy, G. Kortemeyer, and W. F. Punch,
“Predicting student performance: an application of data mining methods
term follows from the Chernoff-Hoeffding Bound from Fact with an educational web-based system,” in Frontiers in education, 2003.
1. FIE 2003 33rd annual, vol. 1. IEEE, 2003, pp. T2A–13.
Including the second term again and using Lemma 1 we get [22] C. Romero, S. Ventura, P. G. Espejo, and C. Hervás, “Data mining
algorithms to classify students.” in EDM, 2008, pp. 8–17.
P [|zi,y − ẑi,y,k | ≥ ] = P [|ci,y,k − ĉi,y,k | ≥ ] [23] D. Garcıa-Saiz and M. Zorrilla, “A promising classification method for
h   i predicting distance students performance.” EDM, pp. 206–207, 2012.
≤ P ci,y,k − µ̂k x, rm∗k (x) ≥  + P [m̂k (x) 6= m∗k (x)] [24] Y. Xiao, F. Dörfler, and M. van der Schaar, “Incentive design in peer re-

view: Rating and repeated endogenous matching,” Allerton Conference,
2014.
 
4V ark x, rm∗k (x) 
|B (x, rm )|

[25] W. Hoeffding, “Probability inequalities for sums of bounded random
2
≤ + 2 exp − min variables,” Journal of the American statistical association, vol. 58, no.
2 1≤m≤M 2 301, pp. 13–30, 1963.
 
|B(x, rm )| − 1 [26] A. Maurer and M. Pontil, “Empirical bernstein bounds and sample vari-
+ 2M exp −∆k (x)2 min , ance penalization,” in Proceedings of the Int. Conference on Learning
1≤m≤M 8 Theory, 2009.
which concludes the proof.

View publication stats

You might also like