Professional Documents
Culture Documents
net/publication/328575803
CITATIONS READS
0 60
3 authors, including:
Josh Gardner
University of Washington Seattle
18 PUBLICATIONS 146 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Josh Gardner on 22 February 2019.
Timothy NeCamp
Department of Statistics
University of Michigan
Ann Arbor, Michigan, USA
arXiv:1810.11185v2 [stat.AP] 1 Feb 2019
tnecamp@umich.edu
Josh Gardner
Paul G. Allen School of Computer Science & Engineering
University of Washington
Seattle, WA, USA
Christopher Brooks
School of Information
University of Michigan
Ann Arbor, Michigan, USA
Abstract 1 Introduction
1
adapt based on learner performance, building upon work on 2 Prior Work
theories of spaced repetition [40].
Due to the large number of users and ease of content ma- 2.1 Adaptive Interventions
nipulation in SDLEs, randomized controlled trials, such as
A/B tests, are often used to evaluate intervention options. Adaptive interventions have been shown to be useful
Typically, these experiments are single randomized trials in SDLEs, and have been used extensively in web-based
where each subject is randomized once, and assigned to adaptive educational systems [6]. For instance, in [37],
a single intervention option for the entire trial. However, learners were provided personalized content recommenda-
when the goal is to design a high-quality adaptive interven- tions based on their clickstream data, and in [14], learn-
tion in SDLEs, researchers may have important questions ers were provided personalized feedback on their learning
about the sequencing, timing, and personalization of inter- plans. Adaptive sequences of interventions have also been
vention options which cannot be answered by A/B tests. developed in MOOCs. For example, in [11], sequences
In this work, we discuss and demonstrate the advantages of problems were adapted based on predictions of student
of an experimental design for developing high-quality adap- knowledge acquisition. Similarly [13] chose quiz questions
tive interventions in SDLEs: the sequential randomized based on course content accessed previously by the learner.
trial (SRT). In SRTs, a subject is randomized several times While these are only a few examples of adaptive interven-
to different intervention options throughout the experiment. tions in large scale learning environments, they motivate our
Sequential randomization is beneficial over one-time ran- desire to improve the process by which such interventions
domization for several reasons. Firstly, by re-randomizing, are developed.
subjects receive a variety of intervention sequences, and In the work discussed, adaptive interventions were de-
these various sequences can be compared to discover the op- veloped using learning theory and intention [13], or predic-
timal intervention sequence. Secondly, instead of only be- tion and machine learning [37, 44, 11]. In all examples,
ing able to assess the overall effect of receiving one particu- experimentation was not used to develop the adaptive inter-
lar treatment, sequential randomization lets researchers dis- vention. In some cases [13, 11], experimentation was used
cover effects at smaller time-scales (e.g., treatment A does for evaluating the adaptive intervention. In these cases, an
better in week 2 of the course, but treatment B does better A/B test is used to compare the designed intervention to a
in week 3 of the course). These discoveries inform at what control.
time points certain interventions are most effective. Thirdly, Reinforcement learning techniques, such as multi-armed
re-randomization permits the discovery of important vari- bandits and contextual bandits, are a type of adaptive in-
ables measured throughout the course for adapting and per- tervention which combine exploration (often done through
sonalizing an intervention. As opposed to only being able implicit experimentation) and optimization. They use real-
to discover baseline personalization variables (e.g., treat- time data to learn how to adapt and have also been shown to
ment A works better for women), we can also discover mid- be useful in SDLEs [31, 9, 42, 38].
course personalization variables (treatment A works better
for subjects who have been active in the course during the 2.2 Experimental Designs
previous day).
The paper is structured as follows: We provide an Experimentation in SDLEs is a common tool for evalu-
overview of related prior work on experimentation and per- ating interventions. Unlike quasi- or natural experimental
sonalized interventions in SDLEs in Section 2. In Section settings [16, 33], by randomly assigning interventions, ef-
3, we formally introduce SRTs and compare them to other fects of interventions can be separated from effects of con-
common trial designs. In Section 4, to motivate the de- founders (variables that relate both to the type of treatment
sign we describe a novel SRT performed in MOOCs, the received and to subsequent outcomes).
Problem-based Email Reminder Coursera Study (PERCS). A/B tests are a valuable experimental design for im-
This was an experiment aiming to improve student reten- proving content and evaluating interventions in MOOCs
tion in an online course by developing a high-quality adap- [41]. In [39], for example, A/B tests evaluated emails and
tive email intervention which leverages aspects of cultural course on-boarding to improve learner engagement and pre-
diversity and inclusion [1]. This case study both serves to il- vent dropout. In [12], A/B tests were used to test the ef-
lustrate the advantages of SRTs and provide context regard- fectiveness of self-regulated learning strategies. Kizilcec
ing implementation and analysis. Section 5 details three and Brooks [22] survey prior work utilizing A/B tests in
specific advantages of running SRTs. These advantages are MOOCs to evaluate nudge interventions and test theory-
exemplified by showing specific results from PERCS. We driven course content changes.
conclude by providing some practical recommendations for There has also been considerable work in other types of
researchers designing their own SRTs in Section 6. experimental designs beyond A/B testing in SDLEs. Fac-
2
torial designs [32] are common ways to evaluate multiple tial intervention, users who did not respond to initial treat-
experimental factors simultaneously. Automatic experimen- ment are re-randomized, while users who did respond to
tation [30], where algorithms are used to search through initial treatment are not re-randomized and continue with
different intervention options, is another alternative to A/B the efficacious initial treatment [19].
testing. Though automatic exploration of intervention op- MRTs are useful for interventions that can be delivered
tions may be more efficient, intervention options are still quickly and frequently (such as delivering text messages or
evaluated with single randomized trials. Adaptive exper- notifications to a subject’s phone). Typically, the goal of an
imental designs change the randomization probabilities to MRT is to estimate the short-term effect of these interven-
favor efficacious treatment, with the goal of both evaluat- tions and understand how that effect depends on time and
ing treatment and helping learners currently in the trial [8]. context. MRTs have been mostly used in the mobile health
Adaptive designs have been used in SDLEs [43]. We ad- space [24]. Due to the high frequency of intervention deliv-
dress the differences between SRTs and these other kinds ery, users in an MRT are typically re-randomized hundreds
of designs in Section 3.2. or thousands of times.
SRTs are trials where an individual is randomized mul- 3.2.1 Single Randomized Trials
tiple times throughout the course of the trial. Suppose
there are two intervention options, such as learners receiv- Single randomized trials, such as A/B tests and factorial de-
ing videos taught by a female instructor (intervention A) or signs, are trials where subjects are randomized one time. In
a male instructor (intervention B). The simplest example of A/B tests (often called a 2-arm randomized controlled trial
a SRT would be: During week 1, users have a 50% chance in healthcare and education), each subject is randomized
of receiving intervention A and a 50% chance of receiving one time to either intervention A or intervention B. Factorial
intervention B. During week 2, users are re-randomized to designs are an extension of A/B tests where each subject is
another treatment. They again have a 50% chance of receiv- randomized one time to several intervention components si-
ing either intervention A or B, independent of their week multaneously. SRTs differ from single randomized trials be-
1 treatment or activity. Hence, about 25% of users will cause in SRTs, subjects are randomized several times, caus-
have received one of each sequence (A, A), (A,B), (B,A), ing them to receive different treatments at different times
or (B,B), where the parenthetical notation means (week 1 throughout the trial.
treatment, week 2 treatment). Single randomized trials can still be used to evaluate
This simple SRT can be modified for both practical and adaptive interventions. For example, if treatment A is de-
scientific reasons. Common modifications include using fined as an adaptive intervention (e.g., a fixed sequence of
different time durations (e.g., re-randomize every month), different intervention options) and treatment B is defined as
increasing the number of time points (e.g., each person is a control, an A/B test can compare the adaptive intervention
randomized 10 times instead of twice), changing the num- to the control with a single randomization. However, this
ber of treatments (e.g., A vs B vs C instead of A vs B, or A A/B test is limited in answering questions about sequenc-
vs B in week 1 and C vs D in week 2), and altering the ran- ing, timing, and personalizing.
domization scheme (e.g., not having uniform randomization
probabilities each week). A/B tests are often used in confirmatory trials. Confir-
SRTs [28, 27] have become increasingly common in matory trials are trials used to ensure strong evidence (or
clinical settings [29] but are less common in educational set- additional evidence) of a treatment’s efficacy. In contrast,
tings [2, 21] and are even rarer in SDLEs. Two of the most SRTs are useful as exploratory trials which explore a large
common types of SRTs are Sequential Multiple Assignment number of possible treatment sequences and learn how to
Randomized Trials (SMARTs) [34], and Micro-randomized adapt those sequences. After running a SRT and developing
Trials (MRTs) [24]. a high quality adaptive intervention, the intervention can be
SMARTs are often used in settings where, for either confirmed in a simple A/B confirmatory trial [10].
practical or ethical reasons, future randomization and treat- SRTs can also be thought of as factorial designs, where
ment assignment should depend on response to prior treat- users are sequentially randomized to each factor over time.
ment. For example, in a prototypical SMART, individuals In the simplified SRT example in Section 3.1, the design
are first randomized to one of two different treatment op- can be considered a 2 × 2 factorial design where factor 1 is
tions. Then, at a pre-specified decision point following ini- week 1 treatment, and factor 2 is week 2 treatment [3].
3
3.2.2 Online Optimization Designs might prefer those with instructors, or some might prefer
a variety of videos. Also, there may be learner character-
There are other designs (both experimental and not) which
istics that affect learner preference. For example, learners
aim to optimize intervention delivery while simultaneously
who initially performed poorly after watching an instruc-
collecting data. In adaptive trial designs [8], randomiza-
tor video might be better off seeing a slide-only video. To
tion probabilities are changed throughout the trial in order to
provide insight into these questions and hypotheses, a re-
both provide efficacious treatment to users, and still obtain
searcher could run a SRT: When a learner enters a course,
good estimates of treatment effects. Online reinforcement
they are randomized initially to receive an instructor video
learning methods (such as multi-armed bandit and contex-
or a slide only video. Then, after watching the first video,
tual bandit algorithms) can also be used for optimizing in-
they are re-randomized to either instructor or slide video for
tervention delivery [31, 9, 42, 38].
the next video they watch. This continues through the entire
SRTs are distinctive from adaptive trial designs and on- course.
line reinforcement learning methods since they do not use
data collected during the trial to change randomization 4.2 Content Spacing
schemes. In reinforcement learning language, SRTs can be
seen as pure exploration with no exploitation. There are Researchers are often unsure about optimal review prob-
advantages of not using earlier trial data to inform future lem sequencing to maximize knowledge retention while
treatment allocation: minimizing review time [40]. Learners may benefit from
frequent review at the beginning, with less frequent re-
1. Using online optimization techniques can cause bias in
view later. These benefits may also be dependent on cer-
estimating treatment effects [35, 38]. These bias issues
tain learner characteristics. For example, poor-performing
do not arise in SRTs.
learners may benefit from more frequent review. A SRT
2. SRTs provide rich exploratory data for discovering can answer these questions: For a problem recommenda-
which variables are valuable for informing treatment tion system, a learner starts the recommender for a time
decisions, making them useful when these variables window (e.g., 50 problems). Then, after this time period,
are unknown (see Section 5.3). In contrast, many rein- every learner is randomized to one of 3 groups: no review,
forcement learning algorithms, such as contextual ban- minimal review, large review. The grouping determines
dits [26], require these variables to be pre-specified. how often they receive previously-seen problems (for re-
view) during the next 20 problems. If a learner is in the
3. SRTs can actually utilize reinforcement learning meth- no review, minimal review, or large review group they will
ods. Batch off-policy reinforcement learning algo- receive previously-seen problems 0%, 5%, or 20% of the
rithms (such as Q-learning) can be applied to SRT data time, respectively. After completing the next 20 problems,
to discover an optimal adaptive intervention, as in [45]. each user will be re-randomized to one of the 3 groups. This
randomization scheme continues. After every 20 problems
4 Applications of Sequential Randomized completed, a user is re-randomized to one of the 3 groups.
Trials
4.3 Problem-based Email Reminder Coursera
Study (PERCS)
SRTs can inform a large variety of interventions includ-
ing course content sequencing, course material manipula- 4.3.1 Motivation
tions, and learner nudges (such as encouraging messages
and problem feedback). We highlight three examples. The A well-known challenge in MOOCs are the low comple-
first two examples are hypothetical scenarios to demon- tion rates. While there are many factors contributing to
strate different types of possible SRTs. The third example, MOOC dropout [17], the goal of PERCS is to determine
PERCS, is a SRT run in a data science MOOC and is the whether dropout can be ameliorated by using weekly email
main working example. reminders to motivate learners to engage with course con-
tent. Our context of inquiry was the Applied Data Sci-
4.1 Video Optimization ence with Python Coursera MOOC, taught by Christopher
Brooks. Weekly emails were sent to learners and may have
For each video in a MOOC, there are two versions, one contained one or more of several factors intending to impact
video which shows only slides and one which shows the learner engagement (for an example, see Figure 1):
instructor’s head occasionally interspersed with the slides 1. The email could have contained a motivating data sci-
[18]. Researchers are unsure about which sequence of ence problem to challenge the user to learn the upcom-
videos are better. Learners might prefer only slides, some ing week’s content. This factor was based on evidence
4
from the problem-based learning literature suggesting
that situating instruction in the context of problems is
an effective way to engage learners [4].
RQ1 Sequencing: Which sequence of emails most im- Global Problem Email (P1) Growth Mindset (G1)
5
User begins course End Week 1 End Week 2 End Week 3
R R R User continues
through course
Figure 2. PERCS trial design. “R” indicates stages where randomization is conducted according to
the probabilities in Table 1.
no email at all. Those who received an email were uni- problem, and growth mindset framing. To refer to groups
formly randomly assigned to have the email be framed with of emails, we used labeling contained in the column and
or without growth mindset. row headers in Table 1. E is used to group together con-
The growth mindset factor crossed with the three email ditions where any email was sent (E1 ) vs no email being
categories makes each week of PERCS a 2 × 3 factorial sent (E0 ). So users in E1 refer to any user receiving emails
design with an additional control condition of no email. See T2 through T7 . G is used to group together growth mind-
Table 1 for each week’s factorial design and randomization set emails (G1 ) and non-growth mindset emails (G0 ). Users
probabilities. in G0 refer to any user receiving emails T2 , T3 , or T4 . P is
Figure 1 illustrates an example email. The emails were used to group together no problem emails (P0 ), global prob-
developed in a “cut and paste” format: When adding in lem emails (P1 ), and cultural problem emails (P2 ). Thus
a condition to an email (e.g., growth mindset framing or users in P1 are any users receiving emails T3 or T6 . Lastly,
adding a problem), we do not change other aspects of the we use parenthetical notation to describe the sequences of
email, and simply insert text from the relevant condition emails over the three weeks, i.e., (week 1 treatment, week
into a consistent email template. Using the cut and paste 2 treatment, week 3 treatment). As an example, we would
across different conditions allows us to attribute treatment refer to users who received a global problem email in week
effects purely to the condition being added, and not other 1, any email in week 2, and a no-growth mindset email in
aspects of the email. Emails are delivered directly using the week 3 using (P1 , E1 , G0 ).
Coursera platform’s email capabilities for instructors.
The most novel aspect of PERCS is the sequential ran- 4.3.4 Experimental Population
domization. That is, a particular learner is not assigned to a
single fixed email condition for all three weeks. Instead, as We focused our trial on the two largest populations of learn-
shown in Figure 2, a learner is re-assigned (with the same ers enrolled in the course as determined by IP address, In-
randomization probabilities) to a different email condition dian and US-based learners. All Indian and US learners
each week. The randomizations across weeks are indepen- who signed up for the Applied Data Science with Python
dent, hence in PERCS, there are 73 different possible email Coursera MOOC between April 1 and June 10, 2018 par-
sequences students may receive. ticipated in PERCS. A total of 8,681 unique learners (3,455
Indian, 5,226 US) were sent 22,073 emails.
4.3.3 Notation
4.3.5 Single randomized version of PERCS: PERCS-
Throughout the rest of the paper, we will be referring to AB
specific treatments, groups of treatments, and sequences of
treatments in PERCS. We introduce notation to ease our de- To highlight the advantages of sequential randomization,
scription. As shown in Table 1, T1 , T2 , . . . , T7 refers to a PERCS will be compared to the single randomized version
particular email. For example, T5 is an email containing no of PERCS, PERCS-AB. PERCS-AB has the exact same
6
randomization probabilities as PERCS, however in PERCS- week’s treatment is a different factor), comparing sequences
AB, learners are randomized only at the end of week 1 to is analogous to simple effects analysis in factorial designs.
one of the 7 email types. They are then sent that exact same In A/B tests, since learners are not re-randomized, only
email type in weeks 2 and 3. Hence, there are only 7 possi- sequences where each learner received the same treatment
ble sequences (T1 , T1 , T1 ), (T2 , T2 , T2 ), . . ., (T7 , T7 , T7 ). every time can be compared. In PERCS-AB, learners are
only randomized to 7 possible sequences and thus compar-
5 Sequencing, Timing, and Personalizing In- isons can only be done between these 7 sequences. If there
is any benefit to receiving different interventions (and/or
terventions through SRTs in a different order) then this could not be discovered by
PERCS-AB. However one would note that all of the com-
. parisons of PERCS-AB can be done with data collected
Next we highlight how SRTs can provide answers to through PERCS, coming at the cost of a reduced sample
questions regarding sequencing, timing, and personaliz- size. This tradeoff demonstrates again why SRTs are espe-
ing interventions. For each type of question, we pro- cially well suited for MOOC experimentation, where there
vide (1) an overview of the question and contextualize it is a large number of diverse participants (and thus a broad
within PERCS, (2) evidence of why SRTs are beneficial exploration might be suitable).
for answering that question as illustrated through compar-
ing PERCS and PERCS-AB, (3) further discussion, and (4)
5.1.3 Discussion
answers to the question in the context of PERCS.
Often times, researchers are not interested in such specific
5.1 Sequencing sequence comparisons. Instead, they are interested in ques-
tions about sequences of groups of interventions. In this
5.1.1 Overview case, one could perform similar comparisons, but combin-
ing users over all of these groups. For example, in PERCS,
When many different intervention options can be delivered we can assess how often reminder emails should be sent. Is
at different times, proper sequencing of interventions is crit- it better to space emails out weekly or bi-weekly?
ical. An intervention that worked at one time may not work To answer this question, we could compare course ac-
at a later time. Also, receiving the same intervention mul- tivity of users receiving the sequence which spaces emails
tiple times may be less effective than receiving a variety of out by 2-weeks (E1 , E0 , E1 ), to the sequence which sends
interventions. Researchers may not know which sequence emails every week (E1 , E1 , E1 ). The sample size for this
of intervention options will lead to the best outcomes for comparison will be much larger than for the individual se-
learners. For PERCS, RQ1 refers to sequencing– are there quence comparisons.
sequences of emails which improve course activity in the In addition to comparing groups of sequences, re-
later weeks? searchers might only be interested in comparing sequences
for a small number of time points. For example, we could
5.1.2 Advantages of SRTs use data collected in PERCS to understand if the effects of
week two problem-based emails on course activity were dif-
SRTs provide data that allows experimenters to compare ferent based on what type of problem-based email the user
different sequences of interventions. By re-randomizing had received in week one. Specifically, does receiving no
learners, learners receive a variety of treatment sequences. email vs cultural problem-based email in week one change
In PERCS, for example, with the large sample size, there the benefits of receiving a cultural problem email in week
are learners randomly assigned to each of the 73 possi- two? To answer this question, we compare course activity
ble sequences of treatments (e.g., (T1 , T1 , T1 ), (T1 , T1 , T2 ), in week two of users receiving sequences (E0 , P2 ,any) vs
(T1 , T2 , T2 ), or (T2 , T1 , T2 )). (P2 , P2 ,any). Notice we are not concerned with week three
By randomizing learners to each possible treatment se- assignment, so we include sequences with any treatment in
quence, researchers are now able to compare these treat- week three (i.e., any includes T1 , T2 , . . . , T7 ).
ment sequences. For example, researchers might hypoth-
esize that learners only need a cultural problem email in
5.1.4 Results from PERCS
the first week to believe the content is inclusive. They
could then compare the course activity of users initially re- We now present results addressing RQ1. Since there are
ceiving a cultural problem email followed by two global a large number of potential sequences, here we focus on
problem emails (P2 , P1 , P1 ) vs users receiving a cultural the highest-level comparison: understanding the proper se-
problem email all three weeks (P2 , P2 , P2 ). When think- quence of any email and no email. To do the comparison
ing of PERCS as a 7 × 7 × 7 factorial design (where each we look at the proportion of students returning to the course
7
build adaptive interventions which deliver the optimal treat-
ment at all times. For PERCS, RQ2 regards timing– which
email type is most effective during each week?
8
(1) What is the effect of receiving a cultural prob-
lem email in week 3, after not receiving any email prior,
(E0 , E0 , P2 ) vs (E0 , E0 , E0 )?
(2) What is the average effect of receiving a cultural
problem email in week 3, (any, any, P2 ) vs (any, any, E0 )?
(3) What is the effect of receiving a cultural prob-
lem email every week until week 3, (P2 , P2 , P2 ) vs
(E0 , E0 , E0 )?
Questions 1 and 3 are questions of comparing sequences
of treatments, while question 2 is about average treatment
effects. Note that all three questions can be answered by
PERCS, but only question 3 can be answered by PERCS-
AB.
Also, one advantage of analyzing average treatment ef-
fects is that the sample size is typically larger than when
comparing individual sequences of treatments, since the
comparison does not restrict users based on what they re-
ceived before or after the given week of interest.
9
5.3 Personalizing 5.3.3 Discussion
We can evaluate mid-course moderators that include pre-
5.3.1 Overview vious treatment. For example, we may think that respon-
siveness to previous treatment could inform how to person-
SDLEs are notable for the high degree of diversity both
alize future treatment. Those that were responsive should
across and within learners [23, 22]. Due to this diversity,
continue receiving the same treatment while those that
we might expect treatment effects to vary along with rele-
were non-responsive should receive different treatment. In
vant learner attributes. If an intervention works for a spe-
PERCS, for example, we might expect week two emails to
cific user at a given time, that intervention may not be ef-
benefit users who responded positively to emails in week
fective for a different user, or even the same user at a dif-
one. One could assess this by comparing two groups: Group
ferent time. Personalizing treatment, by discovering when
1 are users who received an email (E1 ) but did not click in
and for whom certain treatments are most effective, is crit-
the course afterwards (i.e., email non-responders). Group
ical. For PERCS, RQ3 regards personalization– are certain
2 are users who received an email (E1 ) but did click in the
emails more effective for active learners?
course afterwards (i.e., email responders). We could com-
In clinical trials, learning how to personalize treatment
pare the effect of emails in week two for Group 1 vs Group
is synonymous with discovering moderators [25]. Modera-
2.
tors are subject-specific variables which change the efficacy
Also, as learning content optimization starts to happen
of an intervention. For example, if a MOOC intervention
in real-time (i.e., reinforcement learning), knowing which
works better for older users than younger users, then age is
real-time variables to measure and use for online optimiza-
a moderator and can then be used to personalize; one may
tion is critical. SRTs permit this discovery.
only deliver the intervention to older learners. Answering
RQ3 in PERCS is equivalent to understanding if previous
course activity is a moderator of email effectiveness. 5.3.4 Results from PERCS
For PERCS, we were most interested in the previous week’s
5.3.2 Advantages of SRTs course activity as a mid-course moderator. We wanted to
see how different types of emails effect active and inactive
Both SRTs and A/B tests permit the discovery of baseline users differently. Active users are defined as users who had
moderators – variables measured prior to the first treat- one or more clicks in the course during the previous week.
ment randomization (e.g., gender, location, age, scores on Before the trial, we did not know if problem-based emails
early assignments) which moderate the effect of treatments. would encourage inactive users (because they need the mo-
Baseline moderators are important for personalizing treat- tivational reminder) or discourage inactive users (because
ment. However, understanding how to change treatment the problem may be too advanced). To assess this, in Fig-
based on variables measured throughout the course – mid- ure 5 we plot the log odds ratios of the probability of return-
course moderators – is critical. Discovering mid-course ing to the course in the subsequent week for different email
moderators tells practitioners how to personalize interven- conditions, compared to the control of no email (E0 ). A
tions throughout the course to account for heterogeneity positive log odds ratio indicates users had a higher chance
both between users and within a user over time. of returning to the course after receiving the corresponding
Unlike A/B tests, SRTs permit the discovery of mid- email problem type (compared to the no email control).
course moderators. Statistically, due to bias introduced In week one, for both US and Indian learners, reminder
when including post-randomization variables in analyses, emails of all types performed better for active users than
one can only discover moderator variables measured prior inactive users (higher log odds for blue than magenta, Fig-
to randomization [25]. If all learners were only randomized ure 5). In week three, the sign of the moderation switched,
once, potential moderators measured after the first treat- as emails performed better for inactive users compared to
ment cannot be discovered. By re-randomizing in SRTs, active users (higher log odds for magenta than blue). This
moderators measured before each of the randomizations moderation was larger for Indian learners. Also, the log
(which now includes mid-course data) can be discovered. odds ratios for inactive users were positive, while the log
Because users were randomized again at the end of week odds ratios for active users were negative, suggesting that
three, we can use PERCS data to answer RQ3 about week emails were beneficial for bringing back inactive users, but
three emails. Specifically, we can assess how activity dur- potentially harmful to active learners. To ensure we encour-
ing week three moderates the effect of emails sent at the age inactive users to return to the course while not discour-
end of week three. In PERCS-AB, week three course activ- aging active users, these results suggest that email sending
ity cannot be assessed as a moderator since it is measured should adapt based on course activity and the course week.
after the one and only randomization in week one. Note that the confidence intervals are mostly overlapping,
10
indicating that differences are not significant but only sug-
gestive.
11
7 Conclusions, Limitations, and Future to narrow down best treatment options based on current data
Work (which may be different for Indian and US learners). Since
the current evidence is not very strong, we would then run a
second SRT with fewer treatments, acquiring more data on
Adapting, sequencing, and personalizing interventions
those treatments of interest. Once we have significant evi-
is critical to meet the diverse needs of learners in SDLEs.
dence to indicate which sequence of emails is optimal, we
When designing adaptive interventions, experimentation is
can then compare this learned optimal sequence of emails
useful for evaluating and comparing possible sequences.
to a control in an A/B test confirmatory trial. Treatment
Unfortunately, single randomized A/B tests cannot answer
A would be the hypothesized optimal adaptive sequence of
questions about ways to adapt and sequence interventions
emails and treatment B would be no email.
throughout a course. In this work, we demonstrated how
Additional analyses of the PERCS data would also be in-
a new type of experimental design, SRTs, are valuable for
teresting. Using different outcomes other than course activ-
developing adaptive personalized interventions in SDLEs.
ity (such as course completion or assignment performance)
SRTs provide answers to questions regarding treatment se-
could help further understanding of intervention efficacy.
quencing, timing, and personalization throughout a course.
Also, email types had varying word lengths. Assessing how
Answers to these questions will both improve outcomes for
word length changes email efficacy could further elucidate
learners and deepen understanding of learning science in
the treatment effects. Lastly, using additional learner demo-
scalable environments.
graphic information such as age, gender, or previous educa-
In this work, we provided a few examples of different
tion as potential moderators would be useful. Although we
SRTs. There are more variations of SRTs that may be use-
currently cannot collect that information through the course,
ful for SDLEs. In PERCS, all users have the same ran-
other studies have demonstrated how these characteristics
domization for all three weeks. However, if a learner is
can be inferred from available data [5].
responding well to treatment, it may not make sense to re-
In the comparison of SRTs to A/B tests, we limited
randomize them and change their current treatment, as is
the comparison to one example design. There are many
done in Sequential Multiple Assignment Randomized Tri-
other possible comparators. For example, we could have
als (SMARTs) [34, 19]. Since many online courses are ac-
compared PERCS to a trial which performs a single-
cessible to learners at any time, randomization timing could
randomization in week 2, instead of week 1. This new de-
be based on when each user enters the course (randomized
sign would then allow one to discover mid-course modera-
one, two, and three weeks from the day the user enrolls in
tors (measured prior to week 2). The new design, however,
a course). If treatment delivery timing is a research ques-
would not be able to assess treatment effects in week 1. An
tion, the delivery time can also be (re)randomized (e.g.,
important characteristic of SRTs is that all of the questions
each week of the course, re-randomize users to receive one
mentioned in Section 5 can be answered from data collected
weekly email or seven daily email reminders). Also, since
in one trial.
many SDLEs provide learners with all course content at
the beginning of the course, a trial aiming to re-randomize
course content could use trigger-based re-randomizations– 8 Acknowledgements
where users’ future content is only changed after they
have watched a certain video or completed a particular This work was supported in part under the Holistic Mod-
assignment–to prevent early exposure to future treatment. eling of Education (HOME) project funded by the Michigan
Though useful, SRTs have limitations. SRTs are non- Institute for Data Science (MIDAS). We would also like to
adaptive trial designs. If an experimenter wishes to both acknowledge Dr. Joseph Jay Williams and Dr. Daniel Almi-
learn efficacious treatments and provide benefit to learners rall for valuable feedback on this work.
in the current trial, adaptive designs such as [43] would be
more appropriate. Combining adaptive designs with SRTs References
may also be useful for both developing high-quality adap-
tive interventions and improving benefits for current learn-
ers [7]. [1] T. Aceves and M. Orosco. Culturally responsive teach-
ing. U of Florida, 2014.
PERCS also has clear limitations. For one, since some
users (10%) retook the online course multiple times, those [2] D. Almirall, C. Kasari, D. McCaffrey, and I. Nahum-
users were repeated in multiple iterations of the trial. Sec- Shani. Developing optimized adaptive interventions
ondly, it’s important to note that PERCS was an exploratory in education. JREE, 11(1):27–34, 2018.
trial not a confirmatory trial. The goal of PERCS was to
explore several sequences of treatments and evaluate their [3] D. Almirall, I. Nahum-Shani, N. Sherwood, and
efficacy for different learners. The next step for PERCS is S. Murphy. Introduction to smart designs for the
12
development of adaptive interventions: with applica- [15] C. Dweck. Mindset: The new psychology of success.
tion to weight loss research. Translational behavioral Random House Digital, Inc., 2008.
medicine, 4(3):260–274, 2014.
[16] J. González-Brenes and Y. Huang. ” your model is
[4] H. Barrows. How to design a problem-based curricu- predictive–but is it useful?” theoretical and empirical
lum for the preclinical years, volume 8. Springer Pub considerations of a new paradigm for adaptive tutoring
Co, 1985. evaluation. In Proc. EDM ’15, pages 187–194. ERIC,
2015.
[5] C. Brooks, J. Gardner, and K. Chen. How gender cues
in educational video impact participation and reten- [17] J. Greene, C. Oswald, and Pomerantz. Predictors
tion. In 13th International Conference of the Learning of retention and achievement in a massive open on-
Sciences, 2018. line course. American Educational Research Journal,
[6] P. Brusilovsky and C. Peylo. Adaptive and intelligent 52(5):925–955, 2015.
web-based educational systems. Int. J. Artif. Intell.
[18] P. Guo, J. Kim, and R. Rubin. How video produc-
Ed., 13(2-4):159–172, 2003.
tion affects student engagement: an empirical study of
[7] Y. Cheung, B. Chakraborty, and K. Davidson. Sequen- mooc videos. In L@S, pages 41–50. ACM, 2014.
tial multiple assignment randomized trial (smart) with
adaptive randomization for quality improvement in [19] W. P. Jr., G. Fabiano, J. Waxmonsky, A. Greiner,
depression treatment program. Biometrics, 71(2):450– E. Gnagy, W. P. III, S. Coxe, J. Verley, I. Bha-
459, 2015. tia, K. Hart, K. Karch, E. Konijnendijk, K. Tresco,
I. Nahum-Shani, and S. Murphy. Treatment sequenc-
[8] S. Chow and M. Chang. Adaptive design methods in ing for childhood adhd: A multiple-randomization
clinical trials – a review. Orphanet Journal of Rare study of adaptive medication and behavioral interven-
Diseases, 3(1):11, 2008. tions. Journal of Clinical Child & Adolescent Psy-
chology, 45(4):396–415, 2016. PMID: 26882332.
[9] B. Clement, D. Roy, P. Oudeyer, and M. Lopes. Multi-
armed bandits for intelligent tutoring systems. arXiv [20] A. Kaijanaho and V. Tirronen. Fixed versus growth
preprint arXiv:1310.3174, 2013. mindset does not seem to matter much: A prospective
observational study in two late bachelor level com-
[10] L. Collins, S. Murphy, and V. Strecher. The multi- puter science courses. In Proc. ICER, pages 11–20.
phase optimization strategy (most) and the sequential ACM, 2018.
multiple assignment randomized trial (smart). Amer-
ican Journal of Preventive Medicine, 32(5):S112– [21] C. Kasari, A. Kaiser, K. Goods, J. Nietfeld, P. Mathy,
S118, 2007. R. Landa, S. Murphy, and D. Almirall. Communica-
tion interventions for minimally verbal children with
[11] Y. David, A. Segal, and Y. Gal. Sequencing educa-
autism: A sequential multiple assignment randomized
tional content in classrooms using bayesian knowl-
trial. Journal of the American Academy of Child &
edge tracing. In Proc. LAK ’16, pages 354–363, 2016.
Adolescent Psychiatry, 53(6):635–646, 2014.
[12] D. Davis, G. Chen, T. van der Zee, C. Hauff, and
G. Houben. Retrieval practice and study planning [22] R. Kizilcec and C. Brooks. Diverse Big Data and
in moocs: Exploring classroom-based self-regulated Randomized Field Experiments in Massive Open On-
learning strategies at scale. In K. Verbert, M. Sharples, line Courses. In The Handbook of Learning Analytics,
and T. Klobučar, editors, Adaptive and Adaptable pages 211–222. SOLAR, 2017.
Learning, pages 57–71, Cham, 2016. Springer.
[23] R. Kizilcec, C. Piech, and E. Schneider. Deconstruct-
[13] D. Davis, R. Kizilcec, C. Hauff, and G. Houben. The ing disengagement: Analyzing learner subpopulations
half-life of mooc knowledge: A randomized trial eval- in massive open online courses. In Proc. LAK ’13,
uating knowledge retention and retrieval practice in pages 170–179. ACM, 2013.
moocs. In Proc. LAK ’18, pages 1–10. ACM, 2018.
[24] P. Klasnja, E. Hekler, S. Shiffman, A. Boruvka,
[14] D. Davis, V. Triglianos, C. Hauff, and G. Houben. D. Almirall, A. Tewari, and S. Murphy. Microran-
Srlx: A personalized learner interface for moocs. In domized trials: An experimental design for develop-
Lifelong Technology-Enhanced Learning, pages 122– ing just-in-time adaptive interventions. Health Psy-
135, Cham, 2018. Springer. chology, 34(S):1220, 2015.
13
[25] H. Kraemer, G. Wilson, C. Fairburn, and W. Agras. [38] A. Rafferty, H. Ying, and J. Williams. Bandit assign-
Mediators and moderators of treatment effects in ran- ment for educational experiments: Benefits to students
domized clinical trials. Archives of general psychiatry, versus statistical power. In AIED, pages 286–290,
59(10):877–883, 2002. 2018.
[26] A. Lan and R. Baraniuk. A contextual bandits frame- [39] J. Renz, D. Hoffmann, T. Staubitz, and C. Meinel. Us-
work for personalized learning action selection. In ing a/b testing in mooc environments. In LAK ’16,
EDM, pages 424–429, 2016. pages 304–313. ACM, 2016.
[27] P. Lavori and R. Dawson. A design for testing clinical [40] J. Reynolds and R. Glaser. Effects of repetition and
strategies: biased adaptive within-subject randomiza- spaced review upon retention of a complex learning
tion. Journal of the Royal Statistical Society: Series A task. Journal of Educ. Psych., 55(5):297, 1964.
(Statistics in Society), 163(1):29–38, 2000.
[41] A. Savi, J. Williams, G. Maris, and H. van der Maas.
[28] P. Lavori and R. Dawson. Dynamic treatment The role of a/b tests in the study of large-scale online
regimes: practical design considerations. Clinical Tri- learning, Feb 2017.
als, 1(1):9–20, 2004.
[42] A. Segal, Y. David, J. Williams, K. Gal, and
[29] H. Lei, I. Nahum-Shani, K. Lynch, D. Oslin, and Y. Shalom. Combining difficulty ranking with multi-
S. Murphy. A ”smart” design for building individu- armed bandits to sequence educational content. In
alized treatment sequences. 8:21–48, 04 2011. AIED, pages 317–321, 2018.
[30] Y. Liu, T. Mandel, E. Brunskill, and Z. Popović. [43] J. Williams, A. Rafferty, D. Tingley, A. Ang,
Towards automatic experimentation of educational W. Lasecki, and J. Kim. Enhancing online problems
knowledge. In CHI ’14, pages 3349–3358. ACM, through instructor-centered tools for randomized ex-
2014. periments. In CHI ’18, pages 207:1–207:12. ACM,
2018.
[31] Y. Liu, T. Mandel, E. Brunskill, and Z. Popovic. Trad-
ing off scientific knowledge and user learning with [44] H. Yu, C. Miao, C. Leung, and T. White. Towards
multi-armed bandits. In EDM, pages 161–168, 2014. ai-powered personalization in mooc learning. npj Sci-
ence of Learning, 2:1–5, 2017.
[32] D. Lomas, K. Patel, J. Forlizzi, and K. Koedinger. Op-
timizing challenge in an educational game using large- [45] Y. Zhao, M. R. Kosorok, and D. Zeng. Reinforcement
scale design experiments. In CHI ’13, pages 89–98, learning design for cancer clinical trials. Statistics in
2013. medicine, 28(26):3294–3315, 2009.
[33] T. Mullaney and J. Reich. Staggered versus all-at-
once content release in massive open online courses.
In L@S, pages 185–194. ACM, 2015.
[34] S. Murphy. An experimental design for the devel-
opment of adaptive treatment strategies. Statistics in
medicine, 24(10):1455–1481, 2005.
[35] Nie, X. Tian, J. Taylor, and J. Zou. Why adaptively
collected data have negative bias and how to correct
for it. arXiv preprint arXiv:1708.01977, 2017.
[36] A. Oetting, J. Levy, R. Weiss, and S. Murphy. Statisti-
cal methodology for a smart design in the development
of adaptive treatment strategies. Causality and psy-
chopathology: Finding the determinants, pages 179–
205, 2011.
[37] Z. Pardos, S. Tang, D. Davis, and C. Le. Enabling real-
time adaptivity in moocs with a personalized next-step
recommendation framework. In L@S, pages 23–32.
ACM, 2017.
14