You are on page 1of 54

Lifetime Limited Memory Neural Networks

by

Jeffrey Matthew Maierhofer

B.S., University of Colorado, 2019

A thesis submitted to the

Faculty of the Graduate School of the

University of Colorado in partial fulfillment

of the requirements for the degree of

Master of Science

Department of Applied Mathematics

2019
This thesis entitled:
Lifetime Limited Memory Neural Networks
written by Jeffrey Matthew Maierhofer
has been approved for the Department of Applied Mathematics

Prof. Becker

Prof. Kleiber

Date

The final copy of this thesis has been examined by the signatories, and we find that both the
content and the form meet acceptable presentation standards of scholarly work in the above
mentioned discipline.
Maierhofer, Jeffrey Matthew (M.S., Applied Mathematics)

Lifetime Limited Memory Neural Networks

Thesis directed by Prof. Becker

In the modern digital environment, many data sources can be characterized as event se-

quences. These event sequences describe a series of events and an associated time of occurrence.

Examples of event sequences include: the call log from a cell phone, an online purchase history, or

a trace of musical selections. The influx of data has led many researchers to develop deep architec-

tures that are able to discover event sequence patterns and predict future sequences. Many of these

have a tendency to either discard temporal data and treat the sequence as if all events are spaced

equally (e.g. LSTM[5], GRU [2]). There has also been previous work attempting to treat the tem-

poral data as continuous, (e.g., CT-GRU[7]), but this work was unable to show a benefit over the

LSTM or GRU networks with temporal data appended to the input[1] in prediction or classification.

We propose a Lifetime-Limited Memory (LLM) architecture that operates under the notion that

all information within a sequence is relevant for only a finite time period. The age, then, is used

to determine how much of the memory should be retained via a hierarchy of leaky integrators with

log linear spaced time constants. As the network trains, each cell linearly mixes the information

from the different timescales, and determines the most relevant time scales for each event. We be-

lieve that this architecture will be better equipped to handle this specific class of tasks then more

traditional methods because it incorporates temporal dynamics into its neuron activation functions

and permits the storage and utilization of information at multiple time scales.

In this paper, we performed experiments on the LLM network alongside the LSTM net with the

appended time data to determine strengths and weaknesses of the LLM net. We find that the LLM

net to be better formulated to tasks associated with two natural datasets we tested on, the LSTM
iv

net to perform better on two other datasets, and that the networks performed similarly on three

other datasets. We find potential upside to using this architecture, but are unable to show better

performance across the board.


Dedication

To all of my friends and family who have given me the strength to complete this chapter of

my academic journey and are always there to catch me when I fall.


vi

Acknowledgements

Faculty Advisors: Michael Mozer and Stephen Becker

Continuing Work done by Denis Kazakov

Funding Provided by National Science Foundations EXTREEMS Grant, DMS 1407340 through

Anne Dougherty.
vii

Contents

Chapter

1 Introduction 1

2 Background 4

2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Training a Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.2 Backwards Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Lifetime Limited Memory 10

3.1 Timescale Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.3 Predicting Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Calculating Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Experiments 18

4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1.1 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1.2 Natural Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22


viii

4.2 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3.1 Synthetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3.2 Natural . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3.3 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Conclusion 41

Bibliography 43
ix

Tables

Table

3.1 Meaning of Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Size of Tensors in LLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1 Natural Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 Accuracy of Network Predicting Poisson Processes Compared to Ideal Predictor . . . 29

4.3 Number of Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.4 Naive Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39


x

Figures

Figure

1.1 Event Sequence Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 LSTM Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1 Lifetime Limited Memory Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Lifetime Limited Memory Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1 Accumulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Rhythm Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Hawkes Process Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4 Accuracy of Network on Different Decay Rates . . . . . . . . . . . . . . . . . . . . . 28

4.5 Area Under Curve of Network on Different Decay Rates . . . . . . . . . . . . . . . . 28

4.6 Accuracy of Network on Rhythm Dataset . . . . . . . . . . . . . . . . . . . . . . . . 30

4.7 Area Under Curve of Network on Rhythm Datasets . . . . . . . . . . . . . . . . . . . 30

4.8 Accuracy of Network on Rhythm Dataset with larger timescale range . . . . . . . . . 31

4.9 Accuracy of Network on Hawkes Prediction . . . . . . . . . . . . . . . . . . . . . . . 33

4.10 Accuracy of Network on Natural Datasets . . . . . . . . . . . . . . . . . . . . . . . . 33

4.11 Area Under Curve of Network on Natural Datasets . . . . . . . . . . . . . . . . . . . 34

4.12 Best Accuracy of Network on Natural Datasets . . . . . . . . . . . . . . . . . . . . . 36


xi

4.13 Accuracy of LLM with overfitting prevented by validation halting, and without val-

idation to prevent overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.14 Accuracy of LSTM with overfitting prevented by validation halting, and without

validation to prevent overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38


Chapter 1

Introduction

In this paper we present the Lifetime-Limited Memory (LLM) network to make specific use

of the temporal aspects of event sequence data. Different forms of recurrent neural networks have

been implemented to handle sequence data (the Long Short-Term Memory (LSTM)[5] network

is particularly prolific), but they do not impose specific structure to the networks treatment of

temporal data. The LLM is a recurrent neural network that explicitly decays memory at different

exponential rates based on the time between events. This bias applied to the network’s utilization

of temporal data should more closely model human behavior and memory than an unstructured

temporal dimension allows. Due to this, it should be more readily usable for tasks that involve a

network trying to learn something about human behavior. The different decay rates of memory

traces should allow the LLM to learn data patterns occurring at many different timescales. The

network uses this cell memory to either make a prediction at each step, or classify the sequence as

a whole.

Event sequence data, which the LLM takes as input, is formatted as an ordered list of paired

labels and timestamps. For instance, one could represent someone’s text message history as an

event sequence, where the recipient is the label for the event, and the time the message was sent

as the timestamp. It can also be beneficial to represent the timestamp as time since the last event,

as we use in the context of our model. There is a graphical representation of this data in figure 1.1
2

Figure 1.1: Event Sequence Data

An example of an event sequence with 4 different event types (color coded by type)
3

One can easily see why a network that uses event sequence data could be applicable to a variety of

tasks. Consider, for instance, that an advertising company provides you with a shopping history

of a customer, and want to be able to best recommend a product to them at any given time.

Or perhaps, given a user’s browsing history, you wanted to classify some aspect of the user, like

whether they are a parent or not. Other examples include recommending song choices to music

listeners, predicting what number a user will call next given their phone history, or classifying a

user’s behavior as anomalous given a log of their activities. The network is structured with these

types of tasks in mind. This type of architecture could be very useful for many types of tasks

attempting to gain insight into human behavior, and the LLM will hopefully have a memory highly

capable of mimicking that of a human.

We will seek to compare the LLM network against a similarly structured LSTM network in a

variety of trials in order to determine how the two networks compare in terms of performance on

different tasks and datasets.


Chapter 2

Background

2.1 Neural Networks

We now seek to introduce the class of models that the LLM belongs to, Recurrent Neural

Networks (RNN). First, however, we will briefly introduce neural networks. A neural network,

N N (x), takes as input of x, and has some target output y. Supposing the network has nl hidden

layers, and a training set {x(i) }, the vanilla neural network is defined

 X    
N N = argmin loss f x(s) , y (s) (2.1)
f
(x(s) ,y(s) )∈Xtrain

where f (x) = σnl (Lnl (σnl −1 (Lnl −1 (σnl −2 (...L1 (x)) ...)) , with Li (x) = Wi x + bi , Wi ∈ Rni ×mi ,

bi ∈ Rni and σi : Rni →


− Rmi+1 as a set of fixed non-linear activation functions. The values of Ai are

called the weights of the network. A visualization of this can be seen in figure 2.1. It is important

to note that in practice, the neural net actually approximates the argmin, as the optimization

problem is typically non-convex.

2.1.1 Training a Neural Network

The actual training of a neural network is typically handled by whatever package one is

creating their network in (in this case, Tensorflow). As long as one uses standard mathematical

operations, the software is able to calculate the gradient of a function with respect to any relevant

variables. Because of this, we feel it is important to understand how these systems work beneath

the surface. We’ve previously introduced the idea that a neural network attempts to minimize the
5

Figure 2.1: Neural Network

Diagram of a vanilla neural network


6

difference between the target and the output. One can then formulate the loss of the system on

its training set as a function of the weights of the network. Then, one calculates the gradient of

the loss function with respect to the weights. Finally, the weights are altered in the direction of

greatest decrease of loss. This process is called gradient descent. In practice, frequently the loss

and gradients are calculated in batches, as opposed to on the entire set, in order to minimize the

overall memory cost to the computer.

2.1.2 Backwards Propagation

The actual calculation of a neural network’s gradient of the loss function is typically performed

automatically using some software package that the network is built in, such as Tensorflow. This

is done via a repeated uses of the chain rule. First, the network is run forward, the output is

calculated, and intermediary values of the network are stored. In the case of our vanilla network,

we track the values a1 = W1 x + b1 , and ai = Wi σ (ai−1 ) + bi for i ∈ {2, ..., nl }. Then, the cost is,

for instance,
1
C = ||σ (anl ) − y||2
2

We calculate the first gradient

∇anl C = (σ (anl ) − y) σ 0 (anl )

and iteratively calculate the rest using chain rule:

∂C X ∂C ∂aki+1
=
∂aji k
∂aki+1 ∂aji

∂aki+1  
= (Wi )kj σ 0 aji
∂aji
This allows the final gradient calculations

∂C ∂C  j 
= σ ai−1
∂ (Wi )kj ∂aji

∂C ∂C
=
∂ (bi )j ∂aji
7

The actual calculations of backwards propagation do vary if the network structure is different, but

follows the same general structure of featuring a forward pass followed by use of the chain rule. In

this way, a package such as Tensorflow only needs to keep track of how to calculate the gradient

for each function, and track the interactions of the functions that the network uses.

2.1.3 Recurrent Neural Networks

The recurrent neural network introduces a twist on the neural network structure, in the form

of memory. RNN’s process an entire sequence of data, x. It iteratively processes each element of

the sequence, xk . The network’s input is both the original input xk , as well as the previous output

of the network ok−1 . This allows context to be passed from element to element, so that the network

is better able to make use of sequences of inputs that are related. We present the vanilla RNN in

Figure 2.2.

Figure 2.2: Recurrent Neural Network

Diagram of a vanilla recurrent neural network


8

2.1.3.1 LSTM

The LSTM is one of the most ubiquitous RNNs in practice. The LSTM is a variant on RNNs

that include a key difference to the vanilla RNN in its introduction of memory traces. While the

vanilla RNN has a method of passing information to the next state, its memory is restricted to its

output, which inherently restricts the network’s ability to maintain long-term memory. The vanilla

LSTM accomplishes this by taking as input the current input xk , the previous output, ok−1 , and

the previous memory hk−1 . At the start of the cell calculation, we set

zk = [xk , ok−1 ]

as a concatenation of the input and previous output. Next we calculate the forget gate,

 
fk = sigmoid W f zk + bf

This gate determines the amount of previous memory passing through to the next step. Because

the gates are activated by a sigmoid function, all of their values are mapped between 0 and 1,

such that when the elementwise multiplication ( ) is calculated between the gate and memory, it

“gates” the memory maintains anywhere between none and all of the previous values, but cannot

negate or add to the value. We then calculate what our memory update will be, as well as gate

how much of that memory will be added to the memory cell.

gk = sigmoid (W g zk + bg )

 
ĥk = tanh W h zk + bh

These are used to update the memory cell:

hk = hk−1 fk + ĥk gk

Finally, the output is calculated with our new memory trace, as well as the input values.

ok = sigmoid (W o zk + bo ) tanh (hk )


9

Figure 2.3: LSTM Cell

Image from http://colah.github.io/posts/2015-08-Understanding-LSTMs/

The LSTM’s introduction of memory traces allows for long term dependencies to develop. It allows

for much more robust pattern identification, and has become almost synonymous with recurrent

nets due in large part to the success it has achieved. Still, we believe that there is room for better

implementation of temporal data within the framework of a recurrent net than simply another input

fed into an LSTM. Memory decay should be more explicitly tied to the temporal data, otherwise

a burst of events in short time has the potential to quickly decay the memory of cells. For an

excellent blog post describing the functionality and computation behind the LSTM network, see

http://colah.github.io/posts/2015-08-Understanding-LSTMs/ [9].
Chapter 3

Lifetime Limited Memory

In this section, we will describe the structure of the Lifetime Limited Memory model devel-

oped through the course of the project. The LLM is meant to serve as an adaptation of the LSTM

previously described, with a focus on structuring temporal data. The LLM should be better able to

handle sequences with large varieties of time gaps between events. This is due to the fact that the

LLM memory only decays with time, whereas the LSTM forgets memory at every single element of

a sequence. It also only considers the memory traces for the output, restricting the dependency of

the prediction to the memory traces. This network differs from the previously explored CT-GRU[7]

in a couple of key ways. First, the LLM is structured as a variation of the LSTM as opposed to a

GRU unit. Second, while they both contain mutliple timescales, the CT-GRU attempts to mimic

assigning a single timescale for each event by determining the timescales of retrieval and storage for

an event, and masking signals accordingly when updating and retrieving memory, and also collapses

all memory traces within a cell to one value for prediction, whereas the LLM allows for the memory

traces within a cell to remain separate. Finally, the CT-GRU gates off memory at each element, as

Index Meaning Range


i Output Class ∈ {1, ..., nc }
j Timescale ∈ {1, ..., ns }
k Position in Sequence ∈ {1, ..., ne }
l Hidden Unit ∈ {1, ..., nh }
m Input Class ∈ {1, ..., nx }
s Sequence in Dataset ∈ {1, ..., N }

Table 3.1: Meaning of Indices


11
Tensor Purpose
h ∈ Rne ×ns ×nh Memory Traces
f ∈ Rne ×nh Memory Update
g ∈ Rne ×nh Memory Update Gate
W ∗ ∈ Rnh ×nx Input Matrix
U ∗ ∈ Rnh ×ns ×nh Memory Matrix
V ∈ Rnc ×ns ×nh Output Matrix

Table 3.2: Size of Tensors in LLM

well as decays memory over time, whereas the LLM only decays memory with time.

We consider a network with nc output classes, nx input classes nh hidden units, and ns

log-linear spaced timescales. The LLM takes as input a sequence of event labels, and outputs for

the a specified task as described in 3.2. Let Xtrain be the training set with N data points. Let
(s)
x(s) ∈ X be a sequence with ne events in the sequence. For simplicity moving forward, we drop

the (s) superscript and refer to a generic sequence x. Each event xk in the sequence is a tuple

(ek , ∆tk ) ∈ (Znc × R), where ek is either a one hot or signed one hot vector depending on the task.

The memory trace in hidden neuron l associated to the decay timescale j is initiated h0,j,l = 0.

The timescales γj are fixed at the initiation of the network. Then, for each event xk , k > 0, the

following updates are made:

h0k,j = e−γj ∆tk · hk−1,j


 
ns
Ujf h0k,j + bf 
X
fk = tanh W f ek +
j=1
 
ns
Ujg h0k,j + bg 
X
gk = sigmoid W g ek + (3.1)
j=1

hk,j = h0k,j + fk gk
 
ns
X
ok = activation  e−γj ∆tk+1 Vj hk,j + bo 
j=1

where the activation function, and the loss function, is determined by the task. In the above

calculations, we can think of the vector fk as the new memory to be learned, and the vector gk

as the gate imposed on the memory, similar to the LSTM. A key note to make is that while each
12

hidden cell contains ns memory traces, they are all updated by a single value fk,l ∗ gk,l , so they

really represent a single “memory”, but represent it on multiple timescales. A visualization of these

operations can be seen in Figures 3.1 and 3.2

Figure 3.1: Lifetime Limited Memory Network

Overall structure of the LLM network


13

h0j,k = e−γj ∆tk · hj,k−1


 ns   ns 
Ujf h0j,k Ujg h0j,k + bg
X X
f f g
fk = tanh W ek + +b gk = sigmoid e xk +
j=1 j=1
ns
X 
hj,k = h0j,k + fk gk ok = activation e−γj ∆tk+1 Vj hj,k + bg
j=1

Figure 3.2: Lifetime Limited Memory Cell

3.1 Timescale Determination

The timescales are determined as follows. Let

M = max (∆t ∈ Xtrain )

and

L = min (∆t ∈ Xtrain )

Then
  1
M nt −3
r=
L

log (L)
d=1−
log (r)
14

where r determines the log-linear spacing, and d shifts the timescales to the correct starting point.

This yields
1 1
γj = = , for j ∈ {1, ..., ns }
rj−d−1 Lrj−2

This gives full coverage of the potential ∆t, as well as one smaller and one larger timescale, as
1 1
γ1 = L and γns −1 = M

3.2 Tasks

There are three tasks that the model has been adapted to perform: prediction, classification,

and predicting correctness.

3.2.1 Prediction

In this task, at every element in the sequence, the network attempts to predict which event

is most likely to occur next in the sequence, given the time until that event occurs. An example

of this task would be, given someones phone call history, predict who they are most likely to call

next. The activation function for this task is

ez
softmax (z) = Pnc (z)
i=1 e
i

and the associated loss function is


ne X
X nc
loss (o, y) = − (y log (o))i,k
k=1 i=1

3.2.2 Classification

This task attempts to classify entire event sequences with a single label. A trivial example

of this task would be to learn to classify a sequence by the most frequent event occurring within

it. The activation and loss function for this task only depend upon the final output in the series.
15

This is due to their being only one “prediction” to make for the entire sequence, and we only care

how correct it is at the end of the sequence.

ez
softmax (z) = Pnc zi
i=1 e

and the associated loss function is


nc
X
loss (o, y) = − (y log (o))i,ne
i=1

3.2.3 Predicting Correctness

This task attempts to learn to assign binary labels to each event in an event sequence. The

example that we explore in this report involves predicting whether a student will get the correct

or incorrect answer on a sequence of question types. The target output here is instead a signed

one hot vector, so the activation and loss functions will need to be adjusted. Further, we only care

about predicting the polarity of the index associated with the upcoming event. Given that, the

activation function for this task is the tanh function. The loss function is
ne X
X nc
loss (o, y) = − (abs (yk ) log ((y o + 1) /2))i,k
k=1 i=1

3.3 Calculating Gradients

In most packages used for training neural networks, the gradient for each weight is automat-

ically calculated using back propagation for most common functions. Despite this, we take as an

exercise the task of calculating the gradient of the cost function with respect to the weights of our

network for the prediction task. For any iteration of the network, all variables are first calculated,

then used to calculate the gradients. First, we define


ne X
X nc
C= − (y log (o))i,k
k=1 i=1
16

as our cost function for the sequence. Then

∂C −yi,k
=
∂oi,k oi,k

serves as the basis for our gradient calculation. For the following calculations, we let i correspond

to the output class, j correspond to the timescale of a memory trace, k correspond to the element

of the sequence, l correspond to the hidden cell label, and m corresponds to input class. We need
∂C ∂C ∂C
to calculate ∂Vi,j,l , ∂Wi,l
∗ , and ∗
∂Uj,l , where ∗ ∈ {f, g}.
,l
1 2

∂C X ∂C X
= oi,k (1 − oi,k ) hj,k,l = yi,k (oi,k − 1) hj,k,l
∂Vi,j,l ∂oi,k
k k

due to the fact that d


dx softmax (f (x)) = softmax (f (x)) (1 − softmax (f (x))) f 0 (x). The other

partial derivatives need to be calculated iteratively, and depend on the derivatives of h.

∂C XX XX ∂hj,k,l2
∗ = −yi,k (1 − o i,k ) Vi,j,l2 e−∆tk+1 γj ∗
∂Wl,m ∂Wl,m
k i j l2

∂C XX XX ∂hj,k,l3
∗ = −y i,k (1 − o i,k ) Vi,j,l3 e−∆tk+1 γj ∗
∂Uj,l,l ∂Uj,l,l
2 k i j l3 2

∂hj,k=0,l
Next, we need to calculate these derivatives of h incrementally, noting that ∂X = 0 for any
∂hj,k−1,l2
weight X. Then, assuming we know f = Dj,l2 ,
∂Wl,m

∂hj,k,l2 ∂gk,l2 ∂fk,l2


f
= e−∆tk γj Dj,l2 + fk,l2 f
+ gk,l2 f
∂Wl,m ∂Wl,m ∂Wl,m
 
∂fk,l2
Ujf2 ,l2 ,l3 Dj2 ,l3 
XX
2 δll xk,m +

f
= 1 − fk,l2 2
∂Wl,m j2 l3
 
∂gk,l2 XX g
f
= gk,l2 (1 − gk,l2 )  Uj2 ,l2 ,l3 Dj2 ,l3 
∂Wl,m j2 l3

where δll2 is the Kronecker delta function (= 1 if l = l2 , 0 else). This fully gives us an algorithm for

generating the derivatives of h incrementally with respect to W f . This can then be plugged back
17

∂hj,k−1,l2
into the previous derivative. Similarly, if we let g
∂Wl,m
= Dj,l2 , we find

∂hj,k,l2 ∂gk,l2 ∂fk,l2


g = e−∆tk γj Dj,l2 + fk,l2 g + gk,l2 g
∂Wl,m ∂Wl,m ∂Wl,m
 
∂fk,l2 2
 XX f
g = 1 − fk,l2 Uj2 ,l2 ,l3 Dj2 ,l3 

∂Wl,m
j2 l 3
 
∂gk,l2
Ujg2 ,l2 ,l3 Dj2 ,l3 
X X
 l
g = gk,l2 (1 − gk,l2 ) δl2 xk,m +
∂Wl,m
j2 l3

∂hj,k−1,l3
Finally we calculate our derivatives of h with respect to U ∗ . Letting f = Dj,l3 , we find
∂Uj,l,l
2

∂hj,k,l3 ∂gk,l3 ∂fk,l3


f
= e−∆tk γj Dj,l3 + fk,l3 f
+ gk,l3 f
∂Uj,l,l2
∂Uj,l,l2
∂Uj,l,l2
 
∂fk,l3
Ujf2 ,l3 ,l4 Dj2 ,l4 
XX
2 δll h0j,k,l +

f
= 1 − fk,l3 3 2
∂Uj,l,l2 j2 l4
 
∂gk,l3
Ujg2 ,l3 ,l4 Dj2 ,l4 
XX
f
= gk,l3 (1 − gk,l3 ) 
∂Uj,l,l2 j2 l4

∂hj,k−1,l3
and letting g
∂Uj,l,l
= Dj,l3 ,
2

∂hj,k,l3 ∂gk,l3 ∂fk,l3


g = e−∆tk γj Dj,l3 + fk,l3 g + gk,l3 g
∂Uj,l,l2
∂U j,l,l2 ∂U j,l,l2
 
∂fk,l3   XX
g = 1 − (fk,l3 )2  Ujg2 ,l3 ,l4 Dj2 ,l4 
∂Uj,l,l 2 j2 l 4
 
∂gk,l3 XX g
g = gk,l3 (1 − gk,l3 ) δll3 h0j,k,l2 + Uj2 ,l3 ,l4 Dj2 ,l4 
∂Uj,l,l 2 j2 l4

So, it is certainly possible to compute the gradient in this manner, but one can easily see

why, in practice, gradients are handled by the package the network is built with.
Chapter 4

Experiments

In order to test the effectiveness of the network, we train both an LLM model and an LSTM

model on several different datasets. We both developed synthetic datasets, and natural datasets

found from various different sources.

4.1 Datasets

4.1.1 Synthetic Datasets

The synthetic datasets discussed below were created to attempt to mimic potential tasks that

the network will face in natural datasets. These datasets were adapted from tasks introduced in

Discrete-Event Continuous-Time Recurrent Nets [7].

Dataset Number of Classes M L Number of Sequences


Github 101 4.07 ∗ 105 2.78 ∗ 10−4 ∼ 2.2 ∗ 106
Dota 10 4.540 ∗ 103 3.33 ∗ 10−2 ∼ 3.0 ∗ 105
Dota Class 10 4.26 ∗ 103 1.00 ∗ 100 ∼ 3.8 ∗ 104
Freecodecamp 452 5.07 ∗ 106 5.56 ∗ 10−4 ∼ 6.4 ∗ 104
Reddit 50 2.16 ∗ 103 1.00 ∗ 10−5 ∼ 3.1 ∗ 104
Reddit Comments 101 7.08 ∗ 102 2.78 ∗ 10−4 ∼ 2.0 ∗ 104
Quizlet 101 5.06 ∗ 104 2.78 ∗ 10−4 ∼ 1.7 ∗ 105

Table 4.1: Natural Dataset Statistics


19

4.1.1.1 Accumulator

The accumulator dataset has sequences of events generated via Poisson processes. Each

poisson process has a randomly generated rate, and events are labelled by which process they are

generated by. When an event is generated, the accumulator is incremented by 1. Over time, the

accumulators for each label are exponentially decayed with some decay rate λ. At the end of the

sequence, the sequence is labelled by the accumulator with the greatest value. An example from

this dataset can be seen in Figure 4.1. In this figure, the accumulators are combined into a single

function, with one accumulator represented as positive, and the other as negative. Due to the linear

nature of multiplying by the decays, this is functionally equivalent


−λtk1
> k2 ∈K2 e−λtk2 → −λtk − δ −λtk > 0
P P P
k1 ∈K1 e − k∈K12 δk,K1 e k,K2 e

where δk,K = 1 if k ∈ K or 0 otherwise

4.1.1.2 Rhythm

The Rhythm dataset contains events followed by a set lag. Each event label corresponds

to a lag time (i.e. 1, 2, 4, 8...) between that event and when the next will occur. Each event

is chosen uniformly from the different event types. The task itself is presented as a classification,

where a sequence is labelled with a 1 if the sequence is generated with the normal lag times, or a 2

is the sequence is generated with altered lag times. This can be thought of as similar to anomaly

detection, where an ideal classifier would identify the sequence as 1 up until it “sees” a changed

lag time. There is an example sequence from this dataset in Figure 4.2.

4.1.1.3 Hawkes

The Hawkes dataset contains events generated by a Hawkes process. In a Hawkes process,

event are generated as a point process, where the intensity associated with an event type is increased

every time that the event occurs, and then decays over time to a base value[4]. This is called a

self-exciting process, and are descriptive of a process that tends to “burst”, where an event signifies

a likely reoccurrence of that event. An example of an events that are frequently modeled by this
20

Figure 4.1: Accumulator

Visualization of the a two event type accumulator sequence, and the associated classifier function
over time.
21

Figure 4.2: Rhythm Data


22

process are earthquakes, which tend to cause aftershocks within a short time of the initial quake.

A visualization of this type of sequence can be seen in Figure 4.3. For this dataset, we attempt to

predict which event will occur next.

4.1.2 Natural Datasets

4.1.2.1 Github

The Github dataset that we tested on was pulled from the Github BigQuery dataset, as

announced at https://github.blog/2017-01-19-github-data-ready-for-you-to-explore-with-bigquery/

[12]. The dataset contains lists of commits to different repositories by users, and associated times-

tamps. For the purposes of the model, each repository was considered a sequence, and each commit

was considered an event. The events were labelled by the user numerically, determined via order

of appearance for the specific repository. We also restricted our dataset to only include sequences

with at least two users that committed to it, and had at least 9 commits.

4.1.2.2 Dota

The Dota datasets are pulled from popular online game Dota 2. They contain chat logs

between two teams of five during a match. The logs contain information regarding who is chatting,

what is said, and when they are sent, as well as various metadata, including the label for which

team won. The events were labelled by the anonymized label of the user associated. Notably, labels

are always associated to the same team (i.e. 1-5 with team 1, 6-10 with team 2). Two data sets

were developed, one for predicting the next user to chat (prediction), and another for predicting

the outcome of the match (classification). The dataset was made available at kaggle.com by user

devinanzelmo[3].

4.1.2.3 Freecodecamp

The Freecodecamp dataset contains anonymized data regarding learners’ progress through

online coding courses. Each student’s progress was regarded as a sequence, and there were ≈
23

Figure 4.3: Hawkes Process Data

Hawkes Process event sequence and the associated intensities for each event type
24

60,000 students’ data between the test and training sets with at least 9 events. Each completion

of a module was considered an event, with about 450 different modules. Labelling of events was

consistent for the same module across different sequences. This caused for large (length 450) input

vectors, so we truncate sequences to 50 elements for this dataset. Students were able to complete

the same module multiple times, and modules were not necessarily completed in a specific order,

although some tended to follow others. We initially found the dataset through a blog post[11], but

at this point the dataset is no longer available. There is a copy of this dataset available at https:

//drive.google.com/file/d/1eG0ojFRbqiWIwpdgSn32DPG4AhbC7qbb/view?usp=sharing.

4.1.2.4 Reddit

There were two datasets we tested on from the popular online forum, Reddit. In the first

dataset, an event sequence is associated to each user’s posting history. For this dataset, events are

labelled by which subreddit the user posts to. The second dataset associates an event sequence to

a post. Each event in the sequence is a comment on the post, and is labelled by the user making

the comment. This dataset was made accessible by a reddit user at https://www.reddit.com/r/

datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/[10]

4.1.2.5 Quizlet

The Quizlet dataset is the only example of the predicting correctness task in this paper. The

Quizlet dataset involves anonymized data regarding students performance on vocabulary. Students

performance on specific vocabulary words are tracked over time, and our network attempts to

predict how likely a student is to correctly identify a vocabulary word.

4.2 Experimental Procedure

The process for comparing these two networks developed heavily over time. The network itself

was developed primarily using the tools provided by the tensorflow package in Python3. Through-

out the course, we consistently used 14 timescales for our hidden cells, which were originally set
25

constants but were later dynamically calculated by the method describe in 3. The learning rate was

set to 10−3 , and we used the Adam optimizer provided by tensorflow, as introduced in Kingma &

Ba, 2015 [6] with the recommended parameter values. The gradient itself was calculated through

tensorflow, as all operations, if not the cell itself, were standard in tensorflow. All code used within

this project can be found at https://github.com/mmaierhofer97/LLM.

Initially, we took care to experiment with a consistent number of hidden units, typically 100.

For most datasetsm the network only loaded the first 100 (or otherwise specified) events of a se-

quence for the prediction and predicting correctness tasks in order to help limit memory errors. We

note at this point that all experiments were run on an Nvidia GeForce GTX 970, which contributed

to some of our memory restrictions. We were able to limit some of the memory issues by reducing

minibatch sizes (from 64 to 8) for some of our datasets, but further shrinkage made training times

too lengthy for our experimentation. The sequences associated with the classification task could

not be truncated, as the entirety of the sequence might be essential to the class the sequence be-

longs to. For instance, consider a task where a sequence is classified solely on whether the network

contains an event of a certain type. This classification cannot occur without full knowledge of the

sequence. Prediction, however, makes a prediction at each timestep, and it is solely dependent on

the events that occur before it. As a result, while a network might, and should, become better at

predicting further into the sequence, the predictions up to the truncation are the same regardless

of the following events.

When training on a specific dataset, we typically randomly drew 4000 sequences each for the

training and test sets from the dataset, and then drew 10% of the training set as a validation set.

These random draws were done according to a seed such that the same datasets could be used

for the LLM and LSTM testing. This was done due to the fact that typically only 10 trials were

performed for a given set of parameters, so a single dataset draw that the network struggles to

learn on wouldn’t artificially make one of the networks seem to perform better (it is certainly still
26

possible that some datasets are more likely to perform better under one network than the other

in contrast to average, this can only be minimized by doing more trials). The pairing of trials

with a random seed was implemented when we realized that there was a high variance in network

performance, preventing any confidence in declaring one model better than the other for the dataset.

The validation set that we create for each instance of the network is used for determining when

the network has finished learning, and is beginning to overfit. It is also used when we selecting

the correct number of hidden units to build the network with for datasets, as discussed in 4.3.2. It

is important to note here that there was a period of time during testing where an incompatibility

between Python’s random package and numpy package caused the validation and training sets to

contain the same data, as described in https://www.reddit.com/r/Python/comments/42uqqo/

a_nasty_little_bug_involving_random_and_numpy/. While this was later fixed, it did lead to

some interesting revelations that we will address in 4.3.1

4.2.1 Metrics

Two different metrics were used for testing of the LLM. First, we used an accuracy metric,

which is simply calculated as the number of correct predictions divided by the total number of

predictions. We also used a multiclass Area Under the Curve (AUC) metric. The AUC is the

area under a ROC curve for a given classifier, which measures proportions of true positives and

true negatives for different prediction cutoffs. For more information on this metric, see https:

//towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5[8]. Our multiclass

metric is calculated by finding an AUC for each output class in a one vs. all format, where we treat

all other predictions as a single event type. We then calculate the weighted sum of the AUCs for

each event type, weighted by the number of occurances of that event.


27

4.3 Results

4.3.1 Synthetic

In our first stage of testing, we did many preliminary tests on different variations of the ac-

cumulator dataset. This enabled us to not only debug many aspects of the code we were running,

it also provided a quick to test on outlet for determining any potential strengths and weaknesses of

the LLM model. The initial results are displayed in Figures 4.4 and 4.5. It should be immediately

apparent that the LLM underperforms on these datasets in comparison, as well as has a much larger

confidence interval. In a naive sense, this might prevent us from declaring one network as better

than the other with any confidence. However, when we pair up our trials of LLM and LSTM, we

can consider the difference in metric (Accuracy or AUC) a random variable. To pair up the testing,

in this case, means we keep consistent training, testing, and validation datasets for a single trial

of LLM and LSTM via random seeding. We can then use these differences to test the hypothesis

that the better performing network is, in fact, better. In this case we represent the difference as a

t random variable, and calculate the associated p-value, which is represented at the bottom of our

Figures. A value of p < 0.05 represents a greater than 95% confidence that the higher performing

network is actually better suited to the task. In this case, the LSTM tends to perform better with

high confidence. While it is not shocking that LSTM performs well, as it has historically been

excellent at processing sequences, it is a bit disappointing that LLM underperforms. It seems likely

that the structure of the LLM network is more likely to reach a local minimum that is far from

the global. We see a much higher variance in our data for LLM, and especially see certain visible

outliers that halted much earlier than others in training.

For the accumulator dataset, we next looked into predicting the next event that would occur

in an accumulator sequence. This should be a simple task if the network is able to recognize the

structure of the sequence as two Poisson processes. This is due to the fact that Poisson processes

are memory-less, so at any point in time, the predictor should simply predict the sequence with
28

LLM vs LSTM
Accuracy on Accumulator
1.00

0.95

0.90

0.85

0.80

LLM LSTM LLM LSTM LLM LSTM LLM LSTM LLM LSTM
Lambda = 1/16 Lambda = 1/4 Lambda = 1 Lambda = 4 Lambda = 16
Paired p=3.39e-05 Paired p=7.85e-04 Paired p=4.52e-02 Paired p=1.55e-01 Paired p=8.48e-03

Figure 4.4: Accuracy of Network on Different Decay Rates

Accuracy Results for LLM and LSTM for different Accumulator Decay Rates

LLM vs LSTM
Area Under Curve on Accumulator
1.00

0.95

0.90

0.85

0.80

0.75

0.70

LLM LSTM LLM LSTM LLM LSTM LLM LSTM LLM LSTM
Lambda = 1/16 Lambda = 1/4 Lambda = 1 Lambda = 4 Lambda = 16
Paired p=1.77e-03 Paired p=2.22e-04 Paired p=5.34e-02 Paired p=3.07e-03 Paired p=6.75e-03

Figure 4.5: Area Under Curve of Network on Different Decay Rates

AUC Results for LLM and LSTM for different Accumulator Decay Rates
29

Predictor Accuracy
LLM 0.7388
LSTM 0.7388
Ideal 0.739

Table 4.2: Accuracy of Network Predicting Poisson Processes Compared to Ideal Predictor

a shorter timescale, which is most likely to be the most common event up to that point. We ran

several trials on this task, and present the results in Table 4.2.

It seems that this task is simple enough that both of the networks are able to learn this ideal

predictor fully.

Next we looked into the rhythm dataset. We first ran experiments on the classification task de-

scribed in 4.1.1.2. We restricted the length of the sequences to 10, 30, and 100 for different trials

to explore how more information affected the networks’ ability to classify. Hypothetically, an ideal

classifier would classify the sequence as a “1” up until an anomaly is detected. The generation

process does leave the possibility of an “anomalous” sequence not registering as such. This is due

to the fact that an event with an altered lag time might not appear in the sequence. Regardless,

we present the results of our experiment in Figures 4.6 and 4.7.

Interestingly the LLM performs significantly better for the shorter sequences, but there is a drop

in performance on the longer sequence. This shouldn’t happen, when a longer sequence should

only provide a greater ability to classify. In this case, one must assume that the network is either

forgetting its classification (the presence of an anomaly), or it is retaining too much information,

and is overwhelmed by too much memory in this case, not able to forget quickly enough. We ran

another test on the data, this time with a wider net of timescales, in order to attempt to discover

the cause of this discrepancy.


30

LLM vs LSTM
Accuracy
1.00

0.95

0.90

0.85

0.80

0.75
LLM LSTM LLM LSTM LLM LSTM
Length = 10 Length = 30 Length = 100
Paired p=2.15e-05 Paired p=4.59e-03 Paired p=2.76e-04

Figure 4.6: Accuracy of Network on Rhythm Dataset

Accuracy Results for LLM and LSTM with different sequence lengths

LLM vs LSTM
Accuracy

0.95

0.90

0.85

0.80

0.75

0.70
LLM LSTM LLM LSTM LLM LSTM
Length = 10 Length = 30 Length = 100
Paired p=3.01e-05 Paired p=1.25e-05 Paired p=3.32e-06

Figure 4.7: Area Under Curve of Network on Rhythm Datasets

AUC Results for LLM and LSTM with different sequence lengths
31

LLM vs LSTM
Accuracy

0.975

0.950

0.925

0.900

0.875

0.850

0.825

0.800

0.775
LLM LSTM LLM LSTM LLM LSTM
Length = 10 Length = 30 Length = 100
Paired p=3.40e-02 Paired p=6.48e-02 Paired p=8.88e-03

Figure 4.8: Accuracy of Network on Rhythm Dataset with larger timescale range
32

Figure 4.8 shows the results from this experiment. It is apparent that there is something else

going on causing the drop in performance, and would warrant further investigation in later work.

We posit that it is related to the tendency of the network to appear to hit local minima, as shown

by the larger variances of the LLM in the length 100 sequences. Perhaps larger sequences provide

more local minima for the training to get stuck on. Despite this, on sequences that both networks

did “well” on, the LLM performed better at this classification task.

Finally, we tested on our last synthetic dataset, the Hawkes dataset. For this dataset, the net-

works should be attempting to mimic the intensity functions of the various Hawkes processes. The

process with the highest intensity at a given time step should be the most likely event to occur.

Using this, we built out an estimator for the average performance of a predictor.

Figure 4.9 shows the accuracy both networks achieve on the Hawkes dataset. It also includes

how a predictor would fare if it had knowledge of the intensity functions of each event type, or

an average “ideal” predictor. Both networks perform similarly well on this task, both performing

almost identically to the “ideal” predictor. This gives an indication that this task was too “easy”

to give meaningful insight into the difference in network performance.

4.3.2 Natural

The natural datasets were first experimented on with a consistent number of hidden units

(100). In Figures 4.10 and 4.11 , we display the accuracy results and AUC results over several of

the datasets.

The plots displayed in this section are box plots, showing the mean and 1st and 3rd quartiles.

Once again, they are frequently too large to confidently declare one network as performing better.

In the figures, we list the p-value of the hypothesis that the better performing network is better

suited for the task under the given metric. There does seem to be a few tasks on which LLM
33

LLM vs LSTM vs Highest Intensity


Accuracy
LLM
0.50 LSTM
Average Highest Intensity

0.48

0.46

0.44

0.42

0.40
Length 10 Length 30 Length 100
Paired p=3.07e-01 Paired p=6.15e-02 Paired p=2.85e-05

Figure 4.9: Accuracy of Network on Hawkes Prediction

Accuracy results for LLM (Green) and LSTM (Blue) with 100 hidden units compared to the
highest intensity (ideal) predictor (Red).

LLM vs LSTM
1.0 Accuracy

0.8

0.6

0.4

0.2

0.0
Github Dota Dota Class Freecodecamp Reddit Thread Reddit Comments Quizlet
Paired p=1.66e-01 Paired p=3.11e-02 Paired p=6.28e-08 Paired p=1.60e-06 Paired p=2.56e-06 Paired p=3.42e-10 Paired p=8.51e-11

Figure 4.10: Accuracy of Network on Natural Datasets

Accuracy Results for LLM and LSTM with 100 hidden units
34

LLM vs LSTM
Area Under Curve
1.0

0.9

0.8

0.7

0.6

0.5
LLM LSTM LLM LSTM LLM LSTM LLM LSTM LLM LSTM LLM LSTM LLM LSTM
Github Dota Dota Class Freecodecamp Reddit Thread Reddit Comments Quizlet
Paired p=8.40e-04 Paired p=1.28e-04 Paired p=2.72e-01 Paired p=9.88e-03 Paired p=4.74e-03 Paired p=4.94e-03 Paired p=1.35e-11

Figure 4.11: Area Under Curve of Network on Natural Datasets

AUC Results for LLM and LSTM with 100 hidden units
35
Table 4.3: Table describing number of hidden cells in a network

Number of Hidden Cells Number of Weights for LLM Number of Weights for LSTM
50 67,110 14,050
100 254,210 47,950
200 988,410 175,750
400 3,896,810 671,350

outperforms LSTM, as well as tasks that show the opposite. But, for a given network with set

parameters, even if the networks have the same number of cells, they do not necessarily have the

same number of weights to learn. This allows a lot more variability within the network that could

artificially inflate the LLM’s ability to learn tasks. We have the number of trainable weights that

there are in each network for different numbers of hidden cells can be seen in Table 4.3.

This discrepancy doesn’t address the possibility that even if there were the same number of weights

to learn, one network themselves might be better suited to a larger number of weights than the

other. We combat this by introducing a hypertuning of the parameters. We train the networks

with several different numbers of hidden cells (the numbers addressed in Table 4.3). We then use

the accuracy on the validation set to determine which network size is ideal for the task. We use

the validation set since it is used to determine when the model has fully trained. We then reran

the paired tests with the optimal hyperparameters to calculate a new p value. The results of these

tests are shown in Figure 4.12

There is one interesting result that we found due to an error in generating the validation set.

For a while, there was an artificial correlation between the training and validation sets that didn’t

exist with the test set. This caused the networks to naturally overtrain until the training set’s

loss was at a minima. What may seem surprising, is the contrast between the LLM and LSTM in

terms of difference in performance. In order to explore this, we ran a set of trials with and without

validation halting to show this difference in performance. These results can be seen in Figures 4.13
36

.
LLM vs LSTM
1.0
Accuracy

0.8

0.6

0.4

0.2

0.0
LLM LSTM LLM LSTM LLM LSTM LLM LSTM LLM LSTM LLM LSTM LLM LSTM
Github Dota Dota Class Freecodecamp Reddit Thread Reddit Comments Quizlet
Hidden Units: 200, 200 Hidden Units: 50, 50 Hidden Units: 200, 50 Hidden Units: 100, 200 Hidden Units: 200, 200 Hidden Units: 100, 100 Hidden Units: 50, 50
Paired p=1.13e-01 Paired p=1.38e-02 Paired p=9.04e-04 Paired p=1.49e-04 Paired p=4.26e-10 Paired p=1.58e-09 Paired p=5.43e-04

Figure 4.12: Best Accuracy of Network on Natural Datasets

Accuracy Results for LLM and LSTM with hypertuned hidden units
37

and 4.14.

What should be immediately apparent is that the LSTM overfits to the training set (in some cases

much more noticeably than others). This occurs at the detriment of the testing accuracy. Mean-

while, the LLM faces no penalty for the lack of validation. In fact, it appears that it even receives

a slight boost to both the training and testing accuracy.

It makes sense that the LLM would be less prone to overfit, since it imposes a structure to the

time data and ability to forget that the LSTM does not have. It naturally imposes a form of

regularization. We also take the fact that LLM performs similarly with and without validation as

evidence that the memory of the sequence does in fact mirror the memory of the network.

4.3.3 Discussion of Results

Before we begin a full discussion of our results, we present Table ?? denoting the accuracy

of three naive prediction tactics. First we examine simply guessing the previous event. Next, we

try guessing the most common event in the sequence. Finally, we guess that the next event will

be the sequentially next event, i.e. if an event was initially follow by an event previously, predict

that it will happen again. It is important to note here that the only predictions we posit for the

classification and predicting correctness tasks, we only put forward the most common classification

or binary label as the naive approach. We note here that the simple approaches brought to the

synthetic data were representative of an ideal estimator, but that this is not the case for the nat-

ural datasets. The natural data should contain much more complicated relationships, and if the

networks are learning them, they should outperform the naive approaches.

Given these tables, we can conclude that neither network learned anything meaningful about the

Github, Dota, or Quizlet datasets. Both the LLM and LSTM failed to learn anything about these

datasets beyond naive statistics (less than a relative improvement of 10% over the best naive ap-
38

LLM
Accuracy
0.80

0.75

0.70

0.65

0.60

0.55

0.50

0.45

0.40

Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing
w/ w/ w/o w/o w/ w/ w/o w/o w/ w/ w/o w/o w/ w/ w/o w/o w/ w/ w/o w/o
Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid
Github Dota Dota Class Reddit Thread Reddit Comments

Figure 4.13: Accuracy of LLM with overfitting prevented by validation halting, and without vali-
dation to prevent overfitting

LSTM
Accuracy
1.0

0.9

0.8

0.7

0.6

0.5

0.4

Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing Training Testing
w/ w/ w/o w/o w/ w/ w/o w/o w/ w/ w/o w/o w/ w/ w/o w/o w/ w/ w/o w/o
Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid
Github Dota Dota Class Reddit Thread Reddit Comments

Figure 4.14: Accuracy of LSTM with overfitting prevented by validation halting, and without
validation to prevent overfitting
39

Relative Difference Relative Difference


Dataset Guess Previous Guess Most Increment Best LLM Best LSTM in Error Rate in Error Rate
LLM LSTM
Github 0.791 0.519 0.084 0.788 0.787 -1.4% -1.9%
Dota 0.406 0.105 0.052 0.418 0.420 2.0% 2.4%
Freecodecamp 0.013 0.015 0.810 0.914 0.880 54.7% 36.8%
Reddit Thread 0.376 0.268 0.171 0.491 0.509 18.4% 21.3%
Reddit Comments 0.134 0.153 0.511 0.588 0.606 15.7% 19.4%
Dota Class *** 0.513 *** 0.671 0.637 32.4% 25.6%
Quizlet *** 0.766 *** 0.775 0.780 3.8% 6.0%

Table 4.4

Table describing the accuracy of various naive approaches.


(*** indicates that the approach is not applicable to the task associated with the dataset)
Percent improvement over best naive approach is also included.
40

proach). Interestingly, these datasets were the ones that the networks performed most similarly on

(less than a total difference of 1% in accuracy from each other). Meanwhile, the LLM performed

better on the classification dataset, as well as the Freecodecamp dataset. Interestingly, these were

two of the three datasets in our tests that had consistently labelled events across sequence, versus

the reddit datasets, which had subreddits labelled in order.

We once again feel that it is important to note the implications of Figures 4.13 and 4.14. The

ability of LLM to self regularize seems to be indicative of a greater structure to the data. The

LSTM is able to mimic the structure, but when given the opportunity to overfit to the training

data, it seems to do so at the expense of the testing set.


Chapter 5

Conclusion

Overall, it appears that the LLM performs better than the LSTM on certain datasets. It

tended to do very well on complex classification tasks in the datasets we experimented on. It

seemed like one of the networks might be better at predicting a single user’s behavior in the nat-

ural datasets, but both of the networks had one “successful” dataset in predicting a single user’s

behavior (Freecodecamp and Reddit Threads) vs. multiple user’s behavior (Dota Class and Reddit

Comments). It seems that the LLM doesn’t always capture the underlying structure of the data,

as discussed in 4.3.3. If nothing else, it does seem to be a good tool for analyzing the structure

of the dataset itself. In other words, it helps one to understand if the sequence appears to fit our

model of decaying memory.

The explorations we performed to generate Figures 4.13 and 4.14 were perhaps the most exciting

results we came across. The bias we imposed on the structure of the network prevented overfitting

to the dataset. This seems to imply that the proposed structure is in fact representative of the

underlying structure of data.

One of the major concerns regarding the structure is the tendency to underfit to the data. A

significant portion of the time, the network halts training at a local minima. This did seem to be

reduced when we halted only on the training accuracy, but nevertheless was concerning in terms

of viabality as a model. The LLM also tended to take longer to train due to the larger number of
42

weights that gradients needed to be calculated for.

Following the results of our trials, it seems that the LLM would be an excellent addition to a

toolbox of networks for use with event sequences. We’d love to say that the LLM is always the bet-

ter choice for event sequence data, but there are certainly cases where the LSTM performs better.

In further works, we would hope to find more applications and datasets to test the networks on.

This would allow us to further determine the strengths of the network. It could also be interesting

to explore training the networks with a more powerful machine, allowing for longer sequences and

larger potential network sizes. Finally, more experimentation into reduction of the variance of the

LLM’s performance could be extremely beneficial to showing the viability of the network as a model

for event sequences.


Bibliography

[1] E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart, and J. Sun,


Doctor AI: Predicting Clinical Events via Recurrent Neural Networks, JMLR, (2016).

[2] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio,


Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, 2014.

[3] Devinanzelmo, Dota 2 Matches. https://www.kaggle.com/devinanzelmo/


dota-2-matches, 2017.

[4] A. G. Hawkes, Spectra of Some Self-Exciting and Mutually Exciting Point Processes,
Biometrika, (1971).

[5] S. Hochreiter and J. Schmidhuber, LONG SHORT-TERM MEMORY, Neural Compu-


tation, (1997).

[6] D. P. Kingma and J. L. Ba, ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION,


ICLR, (2015).

[7] R. V. L. Michael C. Mozer, Denis Kazakov, Discrete-event continuous-time recurrent


nets. https://arxiv.org/pdf/1710.04110.pdf, 2017.

[8] S. Narkhede, Understanding AUC - ROC Curve. https://towardsdatascience.com/


understanding-auc-roc-curve-68b2303cc9c5, 2018.

[9] C. Olah, Understanding LSTM Networks. http://colah.github.io/posts/


2015-08-Understanding-LSTMs/, 2015.

[10] Stuck in the Matrix, I have every publicly available Reddit comment for research.
https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_
available_reddit_comment/, 2016.

[11] A. Thomas, Mapping Student Course Activity. https://amber.rbind.io/projects/2016/


12/27/fcccourses/, 2016.

[12] S. Wills, GitHub data, ready for you to explore with BigQuery. https://github.blog/
2017-01-19-github-data-ready-for-you-to-explore-with-bigquery/, 2017.

You might also like