Professional Documents
Culture Documents
Lifetime Limited Memory Neural Networks
Lifetime Limited Memory Neural Networks
by
Master of Science
2019
This thesis entitled:
Lifetime Limited Memory Neural Networks
written by Jeffrey Matthew Maierhofer
has been approved for the Department of Applied Mathematics
Prof. Becker
Prof. Kleiber
Date
The final copy of this thesis has been examined by the signatories, and we find that both the
content and the form meet acceptable presentation standards of scholarly work in the above
mentioned discipline.
Maierhofer, Jeffrey Matthew (M.S., Applied Mathematics)
In the modern digital environment, many data sources can be characterized as event se-
quences. These event sequences describe a series of events and an associated time of occurrence.
Examples of event sequences include: the call log from a cell phone, an online purchase history, or
a trace of musical selections. The influx of data has led many researchers to develop deep architec-
tures that are able to discover event sequence patterns and predict future sequences. Many of these
have a tendency to either discard temporal data and treat the sequence as if all events are spaced
equally (e.g. LSTM[5], GRU [2]). There has also been previous work attempting to treat the tem-
poral data as continuous, (e.g., CT-GRU[7]), but this work was unable to show a benefit over the
LSTM or GRU networks with temporal data appended to the input[1] in prediction or classification.
We propose a Lifetime-Limited Memory (LLM) architecture that operates under the notion that
all information within a sequence is relevant for only a finite time period. The age, then, is used
to determine how much of the memory should be retained via a hierarchy of leaky integrators with
log linear spaced time constants. As the network trains, each cell linearly mixes the information
from the different timescales, and determines the most relevant time scales for each event. We be-
lieve that this architecture will be better equipped to handle this specific class of tasks then more
traditional methods because it incorporates temporal dynamics into its neuron activation functions
and permits the storage and utilization of information at multiple time scales.
In this paper, we performed experiments on the LLM network alongside the LSTM net with the
appended time data to determine strengths and weaknesses of the LLM net. We find that the LLM
net to be better formulated to tasks associated with two natural datasets we tested on, the LSTM
iv
net to perform better on two other datasets, and that the networks performed similarly on three
other datasets. We find potential upside to using this architecture, but are unable to show better
To all of my friends and family who have given me the strength to complete this chapter of
Acknowledgements
Funding Provided by National Science Foundations EXTREEMS Grant, DMS 1407340 through
Anne Dougherty.
vii
Contents
Chapter
1 Introduction 1
2 Background 4
3.2 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Experiments 18
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.1 Synthetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.2 Natural . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5 Conclusion 41
Bibliography 43
ix
Tables
Table
Figures
Figure
4.1 Accumulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.13 Accuracy of LLM with overfitting prevented by validation halting, and without val-
4.14 Accuracy of LSTM with overfitting prevented by validation halting, and without
Introduction
In this paper we present the Lifetime-Limited Memory (LLM) network to make specific use
of the temporal aspects of event sequence data. Different forms of recurrent neural networks have
been implemented to handle sequence data (the Long Short-Term Memory (LSTM)[5] network
is particularly prolific), but they do not impose specific structure to the networks treatment of
temporal data. The LLM is a recurrent neural network that explicitly decays memory at different
exponential rates based on the time between events. This bias applied to the network’s utilization
of temporal data should more closely model human behavior and memory than an unstructured
temporal dimension allows. Due to this, it should be more readily usable for tasks that involve a
network trying to learn something about human behavior. The different decay rates of memory
traces should allow the LLM to learn data patterns occurring at many different timescales. The
network uses this cell memory to either make a prediction at each step, or classify the sequence as
a whole.
Event sequence data, which the LLM takes as input, is formatted as an ordered list of paired
labels and timestamps. For instance, one could represent someone’s text message history as an
event sequence, where the recipient is the label for the event, and the time the message was sent
as the timestamp. It can also be beneficial to represent the timestamp as time since the last event,
as we use in the context of our model. There is a graphical representation of this data in figure 1.1
2
An example of an event sequence with 4 different event types (color coded by type)
3
One can easily see why a network that uses event sequence data could be applicable to a variety of
tasks. Consider, for instance, that an advertising company provides you with a shopping history
of a customer, and want to be able to best recommend a product to them at any given time.
Or perhaps, given a user’s browsing history, you wanted to classify some aspect of the user, like
whether they are a parent or not. Other examples include recommending song choices to music
listeners, predicting what number a user will call next given their phone history, or classifying a
user’s behavior as anomalous given a log of their activities. The network is structured with these
types of tasks in mind. This type of architecture could be very useful for many types of tasks
attempting to gain insight into human behavior, and the LLM will hopefully have a memory highly
We will seek to compare the LLM network against a similarly structured LSTM network in a
variety of trials in order to determine how the two networks compare in terms of performance on
Background
We now seek to introduce the class of models that the LLM belongs to, Recurrent Neural
Networks (RNN). First, however, we will briefly introduce neural networks. A neural network,
N N (x), takes as input of x, and has some target output y. Supposing the network has nl hidden
layers, and a training set {x(i) }, the vanilla neural network is defined
X
N N = argmin loss f x(s) , y (s) (2.1)
f
(x(s) ,y(s) )∈Xtrain
where f (x) = σnl (Lnl (σnl −1 (Lnl −1 (σnl −2 (...L1 (x)) ...)) , with Li (x) = Wi x + bi , Wi ∈ Rni ×mi ,
called the weights of the network. A visualization of this can be seen in figure 2.1. It is important
to note that in practice, the neural net actually approximates the argmin, as the optimization
The actual training of a neural network is typically handled by whatever package one is
creating their network in (in this case, Tensorflow). As long as one uses standard mathematical
operations, the software is able to calculate the gradient of a function with respect to any relevant
variables. Because of this, we feel it is important to understand how these systems work beneath
the surface. We’ve previously introduced the idea that a neural network attempts to minimize the
5
difference between the target and the output. One can then formulate the loss of the system on
its training set as a function of the weights of the network. Then, one calculates the gradient of
the loss function with respect to the weights. Finally, the weights are altered in the direction of
greatest decrease of loss. This process is called gradient descent. In practice, frequently the loss
and gradients are calculated in batches, as opposed to on the entire set, in order to minimize the
The actual calculation of a neural network’s gradient of the loss function is typically performed
automatically using some software package that the network is built in, such as Tensorflow. This
is done via a repeated uses of the chain rule. First, the network is run forward, the output is
calculated, and intermediary values of the network are stored. In the case of our vanilla network,
we track the values a1 = W1 x + b1 , and ai = Wi σ (ai−1 ) + bi for i ∈ {2, ..., nl }. Then, the cost is,
for instance,
1
C = ||σ (anl ) − y||2
2
∂C X ∂C ∂aki+1
=
∂aji k
∂aki+1 ∂aji
∂aki+1
= (Wi )kj σ 0 aji
∂aji
This allows the final gradient calculations
∂C ∂C j
= σ ai−1
∂ (Wi )kj ∂aji
∂C ∂C
=
∂ (bi )j ∂aji
7
The actual calculations of backwards propagation do vary if the network structure is different, but
follows the same general structure of featuring a forward pass followed by use of the chain rule. In
this way, a package such as Tensorflow only needs to keep track of how to calculate the gradient
for each function, and track the interactions of the functions that the network uses.
The recurrent neural network introduces a twist on the neural network structure, in the form
of memory. RNN’s process an entire sequence of data, x. It iteratively processes each element of
the sequence, xk . The network’s input is both the original input xk , as well as the previous output
of the network ok−1 . This allows context to be passed from element to element, so that the network
is better able to make use of sequences of inputs that are related. We present the vanilla RNN in
Figure 2.2.
2.1.3.1 LSTM
The LSTM is one of the most ubiquitous RNNs in practice. The LSTM is a variant on RNNs
that include a key difference to the vanilla RNN in its introduction of memory traces. While the
vanilla RNN has a method of passing information to the next state, its memory is restricted to its
output, which inherently restricts the network’s ability to maintain long-term memory. The vanilla
LSTM accomplishes this by taking as input the current input xk , the previous output, ok−1 , and
the previous memory hk−1 . At the start of the cell calculation, we set
zk = [xk , ok−1 ]
as a concatenation of the input and previous output. Next we calculate the forget gate,
fk = sigmoid W f zk + bf
This gate determines the amount of previous memory passing through to the next step. Because
the gates are activated by a sigmoid function, all of their values are mapped between 0 and 1,
such that when the elementwise multiplication () is calculated between the gate and memory, it
“gates” the memory maintains anywhere between none and all of the previous values, but cannot
negate or add to the value. We then calculate what our memory update will be, as well as gate
gk = sigmoid (W g zk + bg )
ĥk = tanh W h zk + bh
hk = hk−1 fk + ĥk gk
Finally, the output is calculated with our new memory trace, as well as the input values.
The LSTM’s introduction of memory traces allows for long term dependencies to develop. It allows
for much more robust pattern identification, and has become almost synonymous with recurrent
nets due in large part to the success it has achieved. Still, we believe that there is room for better
implementation of temporal data within the framework of a recurrent net than simply another input
fed into an LSTM. Memory decay should be more explicitly tied to the temporal data, otherwise
a burst of events in short time has the potential to quickly decay the memory of cells. For an
excellent blog post describing the functionality and computation behind the LSTM network, see
http://colah.github.io/posts/2015-08-Understanding-LSTMs/ [9].
Chapter 3
In this section, we will describe the structure of the Lifetime Limited Memory model devel-
oped through the course of the project. The LLM is meant to serve as an adaptation of the LSTM
previously described, with a focus on structuring temporal data. The LLM should be better able to
handle sequences with large varieties of time gaps between events. This is due to the fact that the
LLM memory only decays with time, whereas the LSTM forgets memory at every single element of
a sequence. It also only considers the memory traces for the output, restricting the dependency of
the prediction to the memory traces. This network differs from the previously explored CT-GRU[7]
in a couple of key ways. First, the LLM is structured as a variation of the LSTM as opposed to a
GRU unit. Second, while they both contain mutliple timescales, the CT-GRU attempts to mimic
assigning a single timescale for each event by determining the timescales of retrieval and storage for
an event, and masking signals accordingly when updating and retrieving memory, and also collapses
all memory traces within a cell to one value for prediction, whereas the LLM allows for the memory
traces within a cell to remain separate. Finally, the CT-GRU gates off memory at each element, as
well as decays memory over time, whereas the LLM only decays memory with time.
We consider a network with nc output classes, nx input classes nh hidden units, and ns
log-linear spaced timescales. The LLM takes as input a sequence of event labels, and outputs for
the a specified task as described in 3.2. Let Xtrain be the training set with N data points. Let
(s)
x(s) ∈ X be a sequence with ne events in the sequence. For simplicity moving forward, we drop
the (s) superscript and refer to a generic sequence x. Each event xk in the sequence is a tuple
(ek , ∆tk ) ∈ (Znc × R), where ek is either a one hot or signed one hot vector depending on the task.
The memory trace in hidden neuron l associated to the decay timescale j is initiated h0,j,l = 0.
The timescales γj are fixed at the initiation of the network. Then, for each event xk , k > 0, the
hk,j = h0k,j + fk gk
ns
X
ok = activation e−γj ∆tk+1 Vj hk,j + bo
j=1
where the activation function, and the loss function, is determined by the task. In the above
calculations, we can think of the vector fk as the new memory to be learned, and the vector gk
as the gate imposed on the memory, similar to the LSTM. A key note to make is that while each
12
hidden cell contains ns memory traces, they are all updated by a single value fk,l ∗ gk,l , so they
really represent a single “memory”, but represent it on multiple timescales. A visualization of these
and
Then
1
M nt −3
r=
L
log (L)
d=1−
log (r)
14
where r determines the log-linear spacing, and d shifts the timescales to the correct starting point.
This yields
1 1
γj = = , for j ∈ {1, ..., ns }
rj−d−1 Lrj−2
This gives full coverage of the potential ∆t, as well as one smaller and one larger timescale, as
1 1
γ1 = L and γns −1 = M
3.2 Tasks
There are three tasks that the model has been adapted to perform: prediction, classification,
3.2.1 Prediction
In this task, at every element in the sequence, the network attempts to predict which event
is most likely to occur next in the sequence, given the time until that event occurs. An example
of this task would be, given someones phone call history, predict who they are most likely to call
ez
softmax (z) = Pnc (z)
i=1 e
i
3.2.2 Classification
This task attempts to classify entire event sequences with a single label. A trivial example
of this task would be to learn to classify a sequence by the most frequent event occurring within
it. The activation and loss function for this task only depend upon the final output in the series.
15
This is due to their being only one “prediction” to make for the entire sequence, and we only care
ez
softmax (z) = Pnc zi
i=1 e
This task attempts to learn to assign binary labels to each event in an event sequence. The
example that we explore in this report involves predicting whether a student will get the correct
or incorrect answer on a sequence of question types. The target output here is instead a signed
one hot vector, so the activation and loss functions will need to be adjusted. Further, we only care
about predicting the polarity of the index associated with the upcoming event. Given that, the
activation function for this task is the tanh function. The loss function is
ne X
X nc
loss (o, y) = − (abs (yk ) log ((y o + 1) /2))i,k
k=1 i=1
In most packages used for training neural networks, the gradient for each weight is automat-
ically calculated using back propagation for most common functions. Despite this, we take as an
exercise the task of calculating the gradient of the cost function with respect to the weights of our
network for the prediction task. For any iteration of the network, all variables are first calculated,
∂C −yi,k
=
∂oi,k oi,k
serves as the basis for our gradient calculation. For the following calculations, we let i correspond
to the output class, j correspond to the timescale of a memory trace, k correspond to the element
of the sequence, l correspond to the hidden cell label, and m corresponds to input class. We need
∂C ∂C ∂C
to calculate ∂Vi,j,l , ∂Wi,l
∗ , and ∗
∂Uj,l , where ∗ ∈ {f, g}.
,l
1 2
∂C X ∂C X
= oi,k (1 − oi,k ) hj,k,l = yi,k (oi,k − 1) hj,k,l
∂Vi,j,l ∂oi,k
k k
∂C XX XX ∂hj,k,l2
∗ = −yi,k (1 − o i,k ) Vi,j,l2 e−∆tk+1 γj ∗
∂Wl,m ∂Wl,m
k i j l2
∂C XX XX ∂hj,k,l3
∗ = −y i,k (1 − o i,k ) Vi,j,l3 e−∆tk+1 γj ∗
∂Uj,l,l ∂Uj,l,l
2 k i j l3 2
∂hj,k=0,l
Next, we need to calculate these derivatives of h incrementally, noting that ∂X = 0 for any
∂hj,k−1,l2
weight X. Then, assuming we know f = Dj,l2 ,
∂Wl,m
where δll2 is the Kronecker delta function (= 1 if l = l2 , 0 else). This fully gives us an algorithm for
generating the derivatives of h incrementally with respect to W f . This can then be plugged back
17
∂hj,k−1,l2
into the previous derivative. Similarly, if we let g
∂Wl,m
= Dj,l2 , we find
∂hj,k−1,l3
Finally we calculate our derivatives of h with respect to U ∗ . Letting f = Dj,l3 , we find
∂Uj,l,l
2
∂hj,k−1,l3
and letting g
∂Uj,l,l
= Dj,l3 ,
2
So, it is certainly possible to compute the gradient in this manner, but one can easily see
why, in practice, gradients are handled by the package the network is built with.
Chapter 4
Experiments
In order to test the effectiveness of the network, we train both an LLM model and an LSTM
model on several different datasets. We both developed synthetic datasets, and natural datasets
4.1 Datasets
The synthetic datasets discussed below were created to attempt to mimic potential tasks that
the network will face in natural datasets. These datasets were adapted from tasks introduced in
4.1.1.1 Accumulator
The accumulator dataset has sequences of events generated via Poisson processes. Each
poisson process has a randomly generated rate, and events are labelled by which process they are
generated by. When an event is generated, the accumulator is incremented by 1. Over time, the
accumulators for each label are exponentially decayed with some decay rate λ. At the end of the
sequence, the sequence is labelled by the accumulator with the greatest value. An example from
this dataset can be seen in Figure 4.1. In this figure, the accumulators are combined into a single
function, with one accumulator represented as positive, and the other as negative. Due to the linear
4.1.1.2 Rhythm
The Rhythm dataset contains events followed by a set lag. Each event label corresponds
to a lag time (i.e. 1, 2, 4, 8...) between that event and when the next will occur. Each event
is chosen uniformly from the different event types. The task itself is presented as a classification,
where a sequence is labelled with a 1 if the sequence is generated with the normal lag times, or a 2
is the sequence is generated with altered lag times. This can be thought of as similar to anomaly
detection, where an ideal classifier would identify the sequence as 1 up until it “sees” a changed
lag time. There is an example sequence from this dataset in Figure 4.2.
4.1.1.3 Hawkes
The Hawkes dataset contains events generated by a Hawkes process. In a Hawkes process,
event are generated as a point process, where the intensity associated with an event type is increased
every time that the event occurs, and then decays over time to a base value[4]. This is called a
self-exciting process, and are descriptive of a process that tends to “burst”, where an event signifies
a likely reoccurrence of that event. An example of an events that are frequently modeled by this
20
Visualization of the a two event type accumulator sequence, and the associated classifier function
over time.
21
process are earthquakes, which tend to cause aftershocks within a short time of the initial quake.
A visualization of this type of sequence can be seen in Figure 4.3. For this dataset, we attempt to
4.1.2.1 Github
The Github dataset that we tested on was pulled from the Github BigQuery dataset, as
announced at https://github.blog/2017-01-19-github-data-ready-for-you-to-explore-with-bigquery/
[12]. The dataset contains lists of commits to different repositories by users, and associated times-
tamps. For the purposes of the model, each repository was considered a sequence, and each commit
was considered an event. The events were labelled by the user numerically, determined via order
of appearance for the specific repository. We also restricted our dataset to only include sequences
with at least two users that committed to it, and had at least 9 commits.
4.1.2.2 Dota
The Dota datasets are pulled from popular online game Dota 2. They contain chat logs
between two teams of five during a match. The logs contain information regarding who is chatting,
what is said, and when they are sent, as well as various metadata, including the label for which
team won. The events were labelled by the anonymized label of the user associated. Notably, labels
are always associated to the same team (i.e. 1-5 with team 1, 6-10 with team 2). Two data sets
were developed, one for predicting the next user to chat (prediction), and another for predicting
the outcome of the match (classification). The dataset was made available at kaggle.com by user
devinanzelmo[3].
4.1.2.3 Freecodecamp
The Freecodecamp dataset contains anonymized data regarding learners’ progress through
online coding courses. Each student’s progress was regarded as a sequence, and there were ≈
23
Hawkes Process event sequence and the associated intensities for each event type
24
60,000 students’ data between the test and training sets with at least 9 events. Each completion
of a module was considered an event, with about 450 different modules. Labelling of events was
consistent for the same module across different sequences. This caused for large (length 450) input
vectors, so we truncate sequences to 50 elements for this dataset. Students were able to complete
the same module multiple times, and modules were not necessarily completed in a specific order,
although some tended to follow others. We initially found the dataset through a blog post[11], but
at this point the dataset is no longer available. There is a copy of this dataset available at https:
//drive.google.com/file/d/1eG0ojFRbqiWIwpdgSn32DPG4AhbC7qbb/view?usp=sharing.
4.1.2.4 Reddit
There were two datasets we tested on from the popular online forum, Reddit. In the first
dataset, an event sequence is associated to each user’s posting history. For this dataset, events are
labelled by which subreddit the user posts to. The second dataset associates an event sequence to
a post. Each event in the sequence is a comment on the post, and is labelled by the user making
the comment. This dataset was made accessible by a reddit user at https://www.reddit.com/r/
datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/[10]
4.1.2.5 Quizlet
The Quizlet dataset is the only example of the predicting correctness task in this paper. The
Quizlet dataset involves anonymized data regarding students performance on vocabulary. Students
performance on specific vocabulary words are tracked over time, and our network attempts to
The process for comparing these two networks developed heavily over time. The network itself
was developed primarily using the tools provided by the tensorflow package in Python3. Through-
out the course, we consistently used 14 timescales for our hidden cells, which were originally set
25
constants but were later dynamically calculated by the method describe in 3. The learning rate was
set to 10−3 , and we used the Adam optimizer provided by tensorflow, as introduced in Kingma &
Ba, 2015 [6] with the recommended parameter values. The gradient itself was calculated through
tensorflow, as all operations, if not the cell itself, were standard in tensorflow. All code used within
Initially, we took care to experiment with a consistent number of hidden units, typically 100.
For most datasetsm the network only loaded the first 100 (or otherwise specified) events of a se-
quence for the prediction and predicting correctness tasks in order to help limit memory errors. We
note at this point that all experiments were run on an Nvidia GeForce GTX 970, which contributed
to some of our memory restrictions. We were able to limit some of the memory issues by reducing
minibatch sizes (from 64 to 8) for some of our datasets, but further shrinkage made training times
too lengthy for our experimentation. The sequences associated with the classification task could
not be truncated, as the entirety of the sequence might be essential to the class the sequence be-
longs to. For instance, consider a task where a sequence is classified solely on whether the network
contains an event of a certain type. This classification cannot occur without full knowledge of the
sequence. Prediction, however, makes a prediction at each timestep, and it is solely dependent on
the events that occur before it. As a result, while a network might, and should, become better at
predicting further into the sequence, the predictions up to the truncation are the same regardless
When training on a specific dataset, we typically randomly drew 4000 sequences each for the
training and test sets from the dataset, and then drew 10% of the training set as a validation set.
These random draws were done according to a seed such that the same datasets could be used
for the LLM and LSTM testing. This was done due to the fact that typically only 10 trials were
performed for a given set of parameters, so a single dataset draw that the network struggles to
learn on wouldn’t artificially make one of the networks seem to perform better (it is certainly still
26
possible that some datasets are more likely to perform better under one network than the other
in contrast to average, this can only be minimized by doing more trials). The pairing of trials
with a random seed was implemented when we realized that there was a high variance in network
performance, preventing any confidence in declaring one model better than the other for the dataset.
The validation set that we create for each instance of the network is used for determining when
the network has finished learning, and is beginning to overfit. It is also used when we selecting
the correct number of hidden units to build the network with for datasets, as discussed in 4.3.2. It
is important to note here that there was a period of time during testing where an incompatibility
between Python’s random package and numpy package caused the validation and training sets to
4.2.1 Metrics
Two different metrics were used for testing of the LLM. First, we used an accuracy metric,
which is simply calculated as the number of correct predictions divided by the total number of
predictions. We also used a multiclass Area Under the Curve (AUC) metric. The AUC is the
area under a ROC curve for a given classifier, which measures proportions of true positives and
true negatives for different prediction cutoffs. For more information on this metric, see https:
metric is calculated by finding an AUC for each output class in a one vs. all format, where we treat
all other predictions as a single event type. We then calculate the weighted sum of the AUCs for
4.3 Results
4.3.1 Synthetic
In our first stage of testing, we did many preliminary tests on different variations of the ac-
cumulator dataset. This enabled us to not only debug many aspects of the code we were running,
it also provided a quick to test on outlet for determining any potential strengths and weaknesses of
the LLM model. The initial results are displayed in Figures 4.4 and 4.5. It should be immediately
apparent that the LLM underperforms on these datasets in comparison, as well as has a much larger
confidence interval. In a naive sense, this might prevent us from declaring one network as better
than the other with any confidence. However, when we pair up our trials of LLM and LSTM, we
can consider the difference in metric (Accuracy or AUC) a random variable. To pair up the testing,
in this case, means we keep consistent training, testing, and validation datasets for a single trial
of LLM and LSTM via random seeding. We can then use these differences to test the hypothesis
that the better performing network is, in fact, better. In this case we represent the difference as a
t random variable, and calculate the associated p-value, which is represented at the bottom of our
Figures. A value of p < 0.05 represents a greater than 95% confidence that the higher performing
network is actually better suited to the task. In this case, the LSTM tends to perform better with
high confidence. While it is not shocking that LSTM performs well, as it has historically been
excellent at processing sequences, it is a bit disappointing that LLM underperforms. It seems likely
that the structure of the LLM network is more likely to reach a local minimum that is far from
the global. We see a much higher variance in our data for LLM, and especially see certain visible
For the accumulator dataset, we next looked into predicting the next event that would occur
in an accumulator sequence. This should be a simple task if the network is able to recognize the
structure of the sequence as two Poisson processes. This is due to the fact that Poisson processes
are memory-less, so at any point in time, the predictor should simply predict the sequence with
28
LLM vs LSTM
Accuracy on Accumulator
1.00
0.95
0.90
0.85
0.80
LLM LSTM LLM LSTM LLM LSTM LLM LSTM LLM LSTM
Lambda = 1/16 Lambda = 1/4 Lambda = 1 Lambda = 4 Lambda = 16
Paired p=3.39e-05 Paired p=7.85e-04 Paired p=4.52e-02 Paired p=1.55e-01 Paired p=8.48e-03
Accuracy Results for LLM and LSTM for different Accumulator Decay Rates
LLM vs LSTM
Area Under Curve on Accumulator
1.00
0.95
0.90
0.85
0.80
0.75
0.70
LLM LSTM LLM LSTM LLM LSTM LLM LSTM LLM LSTM
Lambda = 1/16 Lambda = 1/4 Lambda = 1 Lambda = 4 Lambda = 16
Paired p=1.77e-03 Paired p=2.22e-04 Paired p=5.34e-02 Paired p=3.07e-03 Paired p=6.75e-03
AUC Results for LLM and LSTM for different Accumulator Decay Rates
29
Predictor Accuracy
LLM 0.7388
LSTM 0.7388
Ideal 0.739
Table 4.2: Accuracy of Network Predicting Poisson Processes Compared to Ideal Predictor
a shorter timescale, which is most likely to be the most common event up to that point. We ran
several trials on this task, and present the results in Table 4.2.
It seems that this task is simple enough that both of the networks are able to learn this ideal
predictor fully.
Next we looked into the rhythm dataset. We first ran experiments on the classification task de-
scribed in 4.1.1.2. We restricted the length of the sequences to 10, 30, and 100 for different trials
to explore how more information affected the networks’ ability to classify. Hypothetically, an ideal
classifier would classify the sequence as a “1” up until an anomaly is detected. The generation
process does leave the possibility of an “anomalous” sequence not registering as such. This is due
to the fact that an event with an altered lag time might not appear in the sequence. Regardless,
Interestingly the LLM performs significantly better for the shorter sequences, but there is a drop
in performance on the longer sequence. This shouldn’t happen, when a longer sequence should
only provide a greater ability to classify. In this case, one must assume that the network is either
forgetting its classification (the presence of an anomaly), or it is retaining too much information,
and is overwhelmed by too much memory in this case, not able to forget quickly enough. We ran
another test on the data, this time with a wider net of timescales, in order to attempt to discover
LLM vs LSTM
Accuracy
1.00
0.95
0.90
0.85
0.80
0.75
LLM LSTM LLM LSTM LLM LSTM
Length = 10 Length = 30 Length = 100
Paired p=2.15e-05 Paired p=4.59e-03 Paired p=2.76e-04
Accuracy Results for LLM and LSTM with different sequence lengths
LLM vs LSTM
Accuracy
0.95
0.90
0.85
0.80
0.75
0.70
LLM LSTM LLM LSTM LLM LSTM
Length = 10 Length = 30 Length = 100
Paired p=3.01e-05 Paired p=1.25e-05 Paired p=3.32e-06
AUC Results for LLM and LSTM with different sequence lengths
31
LLM vs LSTM
Accuracy
0.975
0.950
0.925
0.900
0.875
0.850
0.825
0.800
0.775
LLM LSTM LLM LSTM LLM LSTM
Length = 10 Length = 30 Length = 100
Paired p=3.40e-02 Paired p=6.48e-02 Paired p=8.88e-03
Figure 4.8: Accuracy of Network on Rhythm Dataset with larger timescale range
32
Figure 4.8 shows the results from this experiment. It is apparent that there is something else
going on causing the drop in performance, and would warrant further investigation in later work.
We posit that it is related to the tendency of the network to appear to hit local minima, as shown
by the larger variances of the LLM in the length 100 sequences. Perhaps larger sequences provide
more local minima for the training to get stuck on. Despite this, on sequences that both networks
did “well” on, the LLM performed better at this classification task.
Finally, we tested on our last synthetic dataset, the Hawkes dataset. For this dataset, the net-
works should be attempting to mimic the intensity functions of the various Hawkes processes. The
process with the highest intensity at a given time step should be the most likely event to occur.
Using this, we built out an estimator for the average performance of a predictor.
Figure 4.9 shows the accuracy both networks achieve on the Hawkes dataset. It also includes
how a predictor would fare if it had knowledge of the intensity functions of each event type, or
an average “ideal” predictor. Both networks perform similarly well on this task, both performing
almost identically to the “ideal” predictor. This gives an indication that this task was too “easy”
4.3.2 Natural
The natural datasets were first experimented on with a consistent number of hidden units
(100). In Figures 4.10 and 4.11 , we display the accuracy results and AUC results over several of
the datasets.
The plots displayed in this section are box plots, showing the mean and 1st and 3rd quartiles.
Once again, they are frequently too large to confidently declare one network as performing better.
In the figures, we list the p-value of the hypothesis that the better performing network is better
suited for the task under the given metric. There does seem to be a few tasks on which LLM
33
0.48
0.46
0.44
0.42
0.40
Length 10 Length 30 Length 100
Paired p=3.07e-01 Paired p=6.15e-02 Paired p=2.85e-05
Accuracy results for LLM (Green) and LSTM (Blue) with 100 hidden units compared to the
highest intensity (ideal) predictor (Red).
LLM vs LSTM
1.0 Accuracy
0.8
0.6
0.4
0.2
0.0
Github Dota Dota Class Freecodecamp Reddit Thread Reddit Comments Quizlet
Paired p=1.66e-01 Paired p=3.11e-02 Paired p=6.28e-08 Paired p=1.60e-06 Paired p=2.56e-06 Paired p=3.42e-10 Paired p=8.51e-11
Accuracy Results for LLM and LSTM with 100 hidden units
34
LLM vs LSTM
Area Under Curve
1.0
0.9
0.8
0.7
0.6
0.5
LLM LSTM LLM LSTM LLM LSTM LLM LSTM LLM LSTM LLM LSTM LLM LSTM
Github Dota Dota Class Freecodecamp Reddit Thread Reddit Comments Quizlet
Paired p=8.40e-04 Paired p=1.28e-04 Paired p=2.72e-01 Paired p=9.88e-03 Paired p=4.74e-03 Paired p=4.94e-03 Paired p=1.35e-11
AUC Results for LLM and LSTM with 100 hidden units
35
Table 4.3: Table describing number of hidden cells in a network
Number of Hidden Cells Number of Weights for LLM Number of Weights for LSTM
50 67,110 14,050
100 254,210 47,950
200 988,410 175,750
400 3,896,810 671,350
outperforms LSTM, as well as tasks that show the opposite. But, for a given network with set
parameters, even if the networks have the same number of cells, they do not necessarily have the
same number of weights to learn. This allows a lot more variability within the network that could
artificially inflate the LLM’s ability to learn tasks. We have the number of trainable weights that
there are in each network for different numbers of hidden cells can be seen in Table 4.3.
This discrepancy doesn’t address the possibility that even if there were the same number of weights
to learn, one network themselves might be better suited to a larger number of weights than the
other. We combat this by introducing a hypertuning of the parameters. We train the networks
with several different numbers of hidden cells (the numbers addressed in Table 4.3). We then use
the accuracy on the validation set to determine which network size is ideal for the task. We use
the validation set since it is used to determine when the model has fully trained. We then reran
the paired tests with the optimal hyperparameters to calculate a new p value. The results of these
There is one interesting result that we found due to an error in generating the validation set.
For a while, there was an artificial correlation between the training and validation sets that didn’t
exist with the test set. This caused the networks to naturally overtrain until the training set’s
loss was at a minima. What may seem surprising, is the contrast between the LLM and LSTM in
terms of difference in performance. In order to explore this, we ran a set of trials with and without
validation halting to show this difference in performance. These results can be seen in Figures 4.13
36
.
LLM vs LSTM
1.0
Accuracy
0.8
0.6
0.4
0.2
0.0
LLM LSTM LLM LSTM LLM LSTM LLM LSTM LLM LSTM LLM LSTM LLM LSTM
Github Dota Dota Class Freecodecamp Reddit Thread Reddit Comments Quizlet
Hidden Units: 200, 200 Hidden Units: 50, 50 Hidden Units: 200, 50 Hidden Units: 100, 200 Hidden Units: 200, 200 Hidden Units: 100, 100 Hidden Units: 50, 50
Paired p=1.13e-01 Paired p=1.38e-02 Paired p=9.04e-04 Paired p=1.49e-04 Paired p=4.26e-10 Paired p=1.58e-09 Paired p=5.43e-04
Accuracy Results for LLM and LSTM with hypertuned hidden units
37
and 4.14.
What should be immediately apparent is that the LSTM overfits to the training set (in some cases
much more noticeably than others). This occurs at the detriment of the testing accuracy. Mean-
while, the LLM faces no penalty for the lack of validation. In fact, it appears that it even receives
It makes sense that the LLM would be less prone to overfit, since it imposes a structure to the
time data and ability to forget that the LSTM does not have. It naturally imposes a form of
regularization. We also take the fact that LLM performs similarly with and without validation as
evidence that the memory of the sequence does in fact mirror the memory of the network.
Before we begin a full discussion of our results, we present Table ?? denoting the accuracy
of three naive prediction tactics. First we examine simply guessing the previous event. Next, we
try guessing the most common event in the sequence. Finally, we guess that the next event will
be the sequentially next event, i.e. if an event was initially follow by an event previously, predict
that it will happen again. It is important to note here that the only predictions we posit for the
classification and predicting correctness tasks, we only put forward the most common classification
or binary label as the naive approach. We note here that the simple approaches brought to the
synthetic data were representative of an ideal estimator, but that this is not the case for the nat-
ural datasets. The natural data should contain much more complicated relationships, and if the
networks are learning them, they should outperform the naive approaches.
Given these tables, we can conclude that neither network learned anything meaningful about the
Github, Dota, or Quizlet datasets. Both the LLM and LSTM failed to learn anything about these
datasets beyond naive statistics (less than a relative improvement of 10% over the best naive ap-
38
LLM
Accuracy
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing
w/ w/ w/o w/o w/ w/ w/o w/o w/ w/ w/o w/o w/ w/ w/o w/o w/ w/ w/o w/o
Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid
Github Dota Dota Class Reddit Thread Reddit Comments
Figure 4.13: Accuracy of LLM with overfitting prevented by validation halting, and without vali-
dation to prevent overfitting
LSTM
Accuracy
1.0
0.9
0.8
0.7
0.6
0.5
0.4
Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing
w/ w/ w/o w/o w/ w/ w/o w/o w/ w/ w/o w/o w/ w/ w/o w/o w/ w/ w/o w/o
Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid
Github Dota Dota Class Reddit Thread Reddit Comments
Figure 4.14: Accuracy of LSTM with overfitting prevented by validation halting, and without
validation to prevent overfitting
39
Table 4.4
proach). Interestingly, these datasets were the ones that the networks performed most similarly on
(less than a total difference of 1% in accuracy from each other). Meanwhile, the LLM performed
better on the classification dataset, as well as the Freecodecamp dataset. Interestingly, these were
two of the three datasets in our tests that had consistently labelled events across sequence, versus
We once again feel that it is important to note the implications of Figures 4.13 and 4.14. The
ability of LLM to self regularize seems to be indicative of a greater structure to the data. The
LSTM is able to mimic the structure, but when given the opportunity to overfit to the training
Conclusion
Overall, it appears that the LLM performs better than the LSTM on certain datasets. It
tended to do very well on complex classification tasks in the datasets we experimented on. It
seemed like one of the networks might be better at predicting a single user’s behavior in the nat-
ural datasets, but both of the networks had one “successful” dataset in predicting a single user’s
behavior (Freecodecamp and Reddit Threads) vs. multiple user’s behavior (Dota Class and Reddit
Comments). It seems that the LLM doesn’t always capture the underlying structure of the data,
as discussed in 4.3.3. If nothing else, it does seem to be a good tool for analyzing the structure
of the dataset itself. In other words, it helps one to understand if the sequence appears to fit our
The explorations we performed to generate Figures 4.13 and 4.14 were perhaps the most exciting
results we came across. The bias we imposed on the structure of the network prevented overfitting
to the dataset. This seems to imply that the proposed structure is in fact representative of the
One of the major concerns regarding the structure is the tendency to underfit to the data. A
significant portion of the time, the network halts training at a local minima. This did seem to be
reduced when we halted only on the training accuracy, but nevertheless was concerning in terms
of viabality as a model. The LLM also tended to take longer to train due to the larger number of
42
Following the results of our trials, it seems that the LLM would be an excellent addition to a
toolbox of networks for use with event sequences. We’d love to say that the LLM is always the bet-
ter choice for event sequence data, but there are certainly cases where the LSTM performs better.
In further works, we would hope to find more applications and datasets to test the networks on.
This would allow us to further determine the strengths of the network. It could also be interesting
to explore training the networks with a more powerful machine, allowing for longer sequences and
larger potential network sizes. Finally, more experimentation into reduction of the variance of the
LLM’s performance could be extremely beneficial to showing the viability of the network as a model
[4] A. G. Hawkes, Spectra of Some Self-Exciting and Mutually Exciting Point Processes,
Biometrika, (1971).
[10] Stuck in the Matrix, I have every publicly available Reddit comment for research.
https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_
available_reddit_comment/, 2016.
[12] S. Wills, GitHub data, ready for you to explore with BigQuery. https://github.blog/
2017-01-19-github-data-ready-for-you-to-explore-with-bigquery/, 2017.