Professional Documents
Culture Documents
multiresolution structure
by
Noam Finkelstein
Baltimore, Maryland
August, 2017
In this work, we look to develop a method for handling time series data that can
from past work on wavelet decompositions and recent architectures in the recurrent
neural network (RNN) literature. By combing key elements from each tradition, we
Past research has proven that wavelet decompositions are a useful prism through
which to work with time series data, especially with signals that display a multiresolu-
tion structure. Traditional wavelet transforms such as the discrete wavelet transform
(DWT) and the stationary wavelet transform (SWT), while very useful in many set-
tings, have properties that make them difficult to work with in a forecasting setting
related to the DWT and the SWT but is easier to work with for forecasting prob-
lems with non-aligned, arbitrary-length data. We then explore two different ways
of applying neural networks to the SDWT to obtain forecasts that are aware of the
ii
ABSTRACT
We also note that there has been an increasing interest in certain kinds of model
interpretability. Towards that end, we develop a toolkit for visualizing the influence
of each past point in a timeseries on a forecast. We use this toolkit to explore some
We compare the WRNN to RNN models that are specifically designed with mut-
conducted on simulated data, as well as on heart rate data obtained from a publicly
available database. We find that while the WRNN does not improve performance
over existing models, its structure may be able to provide insight into the data not
iii
Acknowledgments
I would like to thank my thesis advisor Dr. Suchi Saria for her mentorship, and
I would also like to thank my academic advisor and thesis reader Dr. John
Muschelli for his statistical insight, and his guidance and support throughout the
year.
Thank you to Peter Schulam for invaluable conversations throughout this project,
I would also like to thank Mary Joy Argo for her help throughout the year, and
iv
Contents
Abstract ii
Acknowledgments iv
List of Figures ix
v
CONTENTS
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Wavelet RNN 28
4 Model Interpretability 34
6 Software 49
6.1 warp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 dwtviz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.4 hmstlm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.5 sdwt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.6 wrnn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
vi
CONTENTS
Bibliography 57
Vita 62
vii
List of Tables
viii
List of Figures
ix
Chapter 1
In this work, we develop a wavelet and recurrent neural network based method
for time series forecasting, called Wavelet RNN (WRNN). We also develop a toolkit
for evaluating the importance of past points in the time series to forecasts.
This introduction will frame the problems we hope to address, explore related
structure.
For example, heart rate data may have minute-scale structure that is influenced
1
CHAPTER 1. INTRODUCTION AND RELATED WORK
treatment effects, day-scale structure following the circadian rhythm, and longer term
that can take advantage of this underlying multiresolution structure to enhance its
predictions. We also believe that introspecting into such a model can provide insight
We hope to apply this method more broadly to data obtained in the clinical
medicine setting.
At its most basic level, a wavelet is any wave-like function with finite support
and mean zero. This differentiates it from something like a sine wave, which has infi-
nite support. Commonly used wavelets include the Haar wavelet [1] and Daubechies
The discrete wavelet transform (DWT) is a representation of any signal that can
capture both the frequencies present and the time at which they occur [3]. For this
The DWT is often discussed in terms of a mother wavelet ψ, which is the wavelet
2
CHAPTER 1. INTRODUCTION AND RELATED WORK
function, and some scaling function. The choice of which ψ to use depends on the
nature of the signal - different wavelet functions are adept at capturing different kinds
of structure in the signal through a DWT. To conduct the discrete wavelet transform,
the mother wavelet is then scaled by powers of two. For each scale, it is shifted so as
to cover the signal. The resulting shifted and scaled functions are called the daughter
wavelets. For some scale j and shift k, the daughter wavelet is given by:
t − k2j
( )
1
ψj,k (t) = √ ψ (1.1)
2j 2j
In other words, the daughter wavelet are shifted, scaled versions of the mother
wavelet. These daughter wavelets are then convolved with the signal to produce the
detail coefficients of the discrete wavelet transform. Mallat [3] goes into great detail
about the theoretical underpinnings of the DWT and other wavelet-based techniques.
Figure 1.1 shows the structure of coefficients that result from a DWT. This struc-
ture will be useful to keep in mind, as it will form the basic building block of the
SDWT.
the signal through a low pass and high pass filter that form a Quadrature Mirror
Filter pair [4]. The results of passing the signal through the high pass filter, and then
down sampling, are called the detail coefficients. The results of passing the signal
through the low pass filter and then down sampling are called the approximation
coefficients. Each of the approximation coefficients and the detail coefficients will
3
CHAPTER 1. INTRODUCTION AND RELATED WORK
Figure 1.1: This figure shows the structure of a level 3 DWT of a length 8 signal.
When the DWT is considered as a tree, as in Crouse et al. [5], each approximation
coefficient has a single child - the top level detail coefficient - and each detail coefficient
has two children. In this figure, each circle is a DWT coefficient, and each line shows a
parent-child relationship. If the DWT of a length 8 signal was only carried out to level
2, there would be 2 trees, each with a depth of 3, instead of one tree with a depth of 4.
In this work, we will always execute the DWT to the maximum possible level.
4
CHAPTER 1. INTRODUCTION AND RELATED WORK
have half as many elements as the original signal. This procedure can be repeated on
the DWT. The only difference is that instead of down sampling the signal to match
the scaled filter at higher levels of the transform, we dilate the filter to match the
signal length. For this reason, the SWT is also sometimes called the undecimated
wavelet transform.
Once the filters are dilated past the length of the signal, it becomes necessary to
wrap the start of the signal around to the end to make up for this difference. Note
In chapter 2 we describe the shortfalls of both the DWT and the SWT for our
purposes, and develop the sequential discrete wavelet transform (SDWT), a closely
All images in this section were generated with the dwtviz library, which is discussed
in section 6.2.
5
CHAPTER 1. INTRODUCTION AND RELATED WORK
The discrete wavelet transform has been used in a wide range of signal processing
tasks including functional regression [6], clustering tasks [7], and signal denoising [5].
multiple resolutions.
Johnstone and Silverman [8] and others demonstrate that the discrete wavelet
transform on a wide class of signals has certain desirable properties. Crouse et al. [5]
noted that many of these properties were not adequately exploited by wavelet-based
sparse. This sparsity is essential in the use of the DWT in compression algorithms
such as JPEG2000 [?]. In the modeling setting, sparsity allowed Crouse et al. to
model coefficients with a spike-and-prior. Because of the sparsity of the DWT, this
proved to be a better practice than using a simple Gaussian prior, which was standard
non-zero.
Crouse at al. [5] capture this structure by introducing the idea of Markov trees,
tree. Ma et al. [9] exploit this structure by expanding Markov trees to a functional
ANOVA setting.
8
CHAPTER 1. INTRODUCTION AND RELATED WORK
Artificial neural networks are a class of models initially inspired by the workings
number of neurons, each of which is put through some generally non-linear activation
function. Fitting a neural network to data involves learning the optimal weights for
the connections between the inputs and neurons, neurons and other neurons, and
neurons and the outputs. If no non-linear functions are used, neural networks can
only represent linear combinations of the inputs, so fitting them becomes a form of
linear regression.
The neurons between the input and output are aligned into ’hidden layers’. A
network in which each node is connected to every node in the following layer is called
Simple neural networks, as described above, will only work with fixed-length in-
puts. This makes it unsuitable for any domain in which we are interested in dealing
with signals that may have variable length, like most time series signals in the medical
domain. Moreover, like the wavelet based methods described above, a simple neural
network would need the signals on which it operated to be time aligned. This is be-
cause its treatment of each element in the input depends exclusively on that elements
position in the signal, so the significance of that position must be constant across the
data set.
9
CHAPTER 1. INTRODUCTION AND RELATED WORK
input output
hidden layers
Figure 1.4: This figure illustrates a simple fully connected neural network with two
hidden layers, and input and output sizes of 2. In this model, each circle represents a
neuron and each line represents a weight. Given an input vector of size 2, the model
finds the value of each neuron in the first hidden layer by applying an activation function
to the weighted sum of the values in the input vector. It finds the values of the neurons
in the second hidden layer by applying an activation function to the weighted sum of
the values in the first hidden layer. Finally, it produces an output by applying the same
10
CHAPTER 1. INTRODUCTION AND RELATED WORK
To extend a similar idea to sequences of arbitrary length that are not aligned in
The basic idea of a recurrent neural network is that the network maintains some
state that summarizes the relevant information in the sequence seen up to the current
time. At each step, it accepts both the state from the previous timestep and the
input for that step. These two are fed into a neural network that maintains the same
architecture between steps. This neural network then produces the new state as well
In the simplest case, the neural network at each step has a single hidden layer.
The hidden state and the input at some time t are concatenated and treated as a
single input vector x, and the output from the network is split into the next network
RNNs are often used on symbolic data. For example, they are used in natural
language processing for tasks such as language modeling. In these contexts, the idea
Inference on the weights of a neural network is generally done by reducing the loss
function of interest on a data set. Most optimization methods applied to this problem
- including backpropagation [10], which is the most common - involve calculating the
gradients of the loss with respect to each weight. Activation functions found to
be effective in neural networks, like the hyperbolic tangent function and the sigmoid
function, have gradients with absolute value less than 1. The gradient of the loss with
11
CHAPTER 1. INTRODUCTION AND RELATED WORK
respect to the weights of a network before the final hidden layer must be calculated by
means of the chain rule. This means that calculating this gradient involves multiplying
Therefore, as the number of layers increases, these gradients must go to zero, i.e.
Recurrent neural networks are often trained using an algorithm called Backprop-
agation Through Time [12]. Instead of taking the gradient by applying the chain
rule through hidden layers, we have to incorporate gradients from many timesteps
to optimize the weights of the same recurring network. The effect of the vanishing
optimize early layers of the network, we lose information about how to incorporate
Much work has been done on overcoming the vanishing gradient problem in the
RNN setting. Notably, the long short term memory (LSTM) network [13] approach
has seen widespread adoption. The intuition behind the LSTM is that it has separate
mechanisms for keeping track of the effects of the immediate past and the long-term
past. This allows the network to learn about the importance of distant time points
More recent approaches have tended to focus on the notion of multiscale structure.
In other words, they seek to allow more recent information to have an immediate
12
CHAPTER 1. INTRODUCTION AND RELATED WORK
Koutnik at al. [14] introduce the Clockwork RNN (CWRNN). This RNN operates
very similarly to the simple RNN described above, but the hidden layer is partitioned
into some number of modules, each of which is updated according to its own clock
speed. As such, the network is forced to use information that has not been updated
Chung et al. [15] develop the hierarchical and multiscale long short term memory
(HMLSTM) network. This network operates much like a stacked LSTM, with the
addition of a boundary detection neuron that learns to fire when there is a boundary
at that level of the underlying data. The authors apply this network to text data,
and find that at the highest resolution it detects word boundaries, and at the second
layer it detects plausible phrase boundaries. This demonstrates that the HMLSTM
scales. It follows that the architecture develops some understanding of the multiscale
Van den Oord et al. [16] use a partially connected convolutional neural network.
Instead of using a RNN to handle sequential data, a Wavenet model with J hidden
layers uses the most recent 2J points in the sequence by dilating the receptive window
between layers. There are strong resonances between the Wavenet architecture and
Like these methods, the WRNN, introduced in 1.1.5 finds a way to structurally
encourage the use of information from the more distant past in making predictions.
13
CHAPTER 1. INTRODUCTION AND RELATED WORK
It is useful at the outset to understand exactly what we mean when we talk about
In the wavelet literature, the term ’multiresolution’ is used to refer to the fact
that each daughter wavelet captures unique information. The information captured
at each scale represents a single resolution, such that the DWT is referred to as a
[15], [14]. This tends to simply mean that there is relevant information at variably
As will become evident, we make use of both concepts in our method, through
our use of wavelet transforms and recurrent neural networks. In practice, these ideas
will therefore use the term ’multiresolution’ to refer to the general idea captured by
both terms.
Our wavelet RNN method combines work in the tradition of wavelets and neural
variation on the discrete wavelet transform (DWT) that permits of handling variable-
14
CHAPTER 1. INTRODUCTION AND RELATED WORK
length sequences. We train the Wavelet RNN to predict the next coefficient at each
layer of the SDWT. We then take the inverse SDWT (ISDWT) to acquire the one-
We were inspired in part by Stphane Mallat’s work in [18]. He argues that from a
on image samples provides a better representation of the data than the raw image.
rotations in groups of pixels, in the sense that small rotations lead to small changes
in the results of the scattering transform. Bruna et al. [19] then demonstrate that
Similarly, the WRNN takes advantage of the SDWT representation to bias the
structure relevant to each time point in the signal. As described above, an RNN
maintains an internal state that summarizes the relevant features of the signal it
has seen up to current time point. By feeding the SDWT coefficients into an RNN,
As compared to a classic RNN or LSTM [13], the WRNN does better at over-
coming the vanishing gradient problem [11]. The main reason for this is that the
15
CHAPTER 1. INTRODUCTION AND RELATED WORK
average, change more slowly and are influenced by more of the time signal than the
high resolution detail coefficients. By operating directly on the SDWT, the WRNN
motifs [20], that depend on activity at different resolutions in the history of the time
series.
We were inspired by Ribeiro et al. [21], who develop a method that can be applied
to any classifier to detect which features of a sample are important to the classifier’s
prediction of its class. In a similar vein, Selvaraju et al. [22] develops a method that
can be applied to any convolutional neural network to provide visual explanations for
classification decisions.
Towards that end, we built a tool that provides a post-hoc explanation [23] for
the model’s prediction. It does this by calculating the gradient of the predicition
with respect to each previous time point in the signal. We call this gradient the
which historical points the model “looks at” when it makes its predictions, and how
16
CHAPTER 1. INTRODUCTION AND RELATED WORK
17
Chapter 2
Transform
2.1 Motivation
After exploring a number of possible representations for time series data, we
adapted the discrete wavelet transform. As discussed above, the properties of the
discrete wavelet transform have been well studied, and it is frequently used in the
Despite the convenience of the DWT, it has one major issue that makes it un-
suitable for use in this case. It is not shift invariant, which means that an identical
18
CHAPTER 2. SEQUENTIAL DISCRETE WAVELET TRANSFORM
feature in the time series will show up differently depending on where in the time
series it occurs. Figure 2.1 demonstrates this lack of shift invariance. This is not an
obstacle for dealing with time-aligned data, and the DWT has proven to be a very
useful in functional regression for such data [6], [24]. However, we are interested in
data are rarely the same length, let alone time aligned in any meaningful sense.
Further, not all lengths of time series are equally convenient to work with using
DWTs. Time series that have a length that is a power of two tend to be easiest to
work with.
More recently, the merits of the stationary wavelet transform (SWT) have been
recognized in statistical settings [25]. The major benefit of the SWT over the DWT
is that it is shift invariant in the sense that if we consider the signal as a periodic
function, with period equal to the length of the observed signal, the coefficients cal-
culated by the SWT will be shifted versions of each other regardless of where in the
period the observed signal begins. This shift invariance is illustrated in 2.4. This
property makes the SWT an appealing choice for any task that deals with data that
is inherently periodic.
The SWT also offers a richer representation of the signal. To see how this might
manifest itself, we will consider the Haar wavelet. In wavelet transforms using the
Haar wavelet, the highest-resolution detail coefficients are proportional to the resid-
ual difference between two adjacent time points - that is, the difference not already
19
CHAPTER 2. SEQUENTIAL DISCRETE WAVELET TRANSFORM
think about prediction, which will allow us to bring the multiresolution power of the
2.2 Definition
For the purposes of this work, we develop the sequential discrete wavelet transform
(SDWT). To conduct a level J SDWT of a signal with length T , we take the level J
DWT of Xt:t+2J for t = 1, ..., T − 2J + 1. Because of the nature of the DWT, many
coefficients in the DWTs will be repeated across DWTs. This is illustrated in figure
2.3.
mation coefficients, and the same number of detail coefficients at level J. Then level
∑j−1
j of the transform will have T − i=0 2i detail coefficients.
Figure 2.3 shows how the SDWT is constructed. Conceptually we take the DWT
starting at each timestep from T to T − 2J + 1, and keep the coefficients from the
The inverse SDWT (ISDWT) is performed by extracting the first DWT tree from
the coefficients acquired through the SDWT and inverting it [17] to obtain the first 2J
elements of the signal. The element at the following timestep is obtained by extracting
22
CHAPTER 2. SEQUENTIAL DISCRETE WAVELET TRANSFORM
1 2 3 4 5
1 2 3 4 5
1 2 3 4 1 2 3 4 5
1 2 1 2 1 2 1 2 3 4 5
Figure 2.3: This graph demonstrates the relationship of the SDWT to the DWT. The
numbered circles are DWT coefficients, and the numbers represent the first timestep of
the signal for the DWT of which they were first calculated. Each tree in the figure is
a DWT, of the tree is the first DWT for which that coefficient had to be calculated.
The nodes in the top layer are the approximation coefficients, the nodes in the second
to top layer are the level 3 detail coefficients, and so forth. Note that after t = 4, each
subsequent DWT is missing only a single coefficient at each layer. To predict X14 , for
example, we would need to predict one additional coefficient at each level, and then take
23
CHAPTER 2. SEQUENTIAL DISCRETE WAVELET TRANSFORM
the second DWT tree, inverting it, and appending the last element to the existing
signal. This is repeated until there are no further coefficients in the SDWT.
Note that the SDWT is a subset of the coefficients in an SWT when the original
signal has a length that is a power of 2. Unlike the SWT, we stop taking DWTs once
we run out of timesteps in the original signal, so that there is no need to wrap the end
of the sequence back around to the front. Also unlike the SWT, it is always possible
to take the level J SDWT when the signal length is greater than 2J .
The SDWT has several desirable properties. Much like the SWT, it is invariant
to some manner of shifts. The nature of the shift invariance of the SDWT is slightly
different from that of the SWT though. In particular, the SDWT is shift invariant in
the following sense. If some subset of the time series lasts for 2j timesteps or more,
For comparison, figure 2.4 is the SDWT version of figures 2.1 and 2.2. It is worth
noting that the SDWT shift invariance property holds even for sequences that the
SWT and DWT cannot properly decompose, such as signals with an uneven number
of timesteps.
24
CHAPTER 2. SEQUENTIAL DISCRETE WAVELET TRANSFORM
2.3 Stability
We are also interested in the SDWT’s stability [17]. If a signal changes by a small
relationship is not smooth, we will encounter many difficulties. First, small errors in
predicting the SDWT coefficients may result in massive mistakes in the time-domain
predictions. Second, we would need very powerful networks to learn anything in the
SDWT representation for even a simple class of functions in the time domain.
To verify that the SDWT is stable in the manner described, we run experiments
on signals simulated from several different generative processes. In each case, we map
the relationship between the sum of differences between points in the time domain
We observe that across different classes of signals, the relationship between these
Figure 2.5 represents this relationship for a set of signals generated by the process
described in section 6.3, using a limited number of underlying ’motifs’ and minimal
noise.
26
Chapter 3
Wavelet RNN
As we have introduced the sequential discrete wavelet transform, we will use this
transform in combination with a recurrent neural network to form the wavelet recur-
by way of the SDWT, we simply need to predict one step ahead for the approximation
The wavelet RNN simply uses RNNs to make these one-step-ahead predictions at
Below, we describe two variations on this basic method. In the first we use a
single RNN to predict the coefficients at every level simultaneously. In the second we
use multiple coordinated RNNs, one for each level of the SDWT.
28
CHAPTER 3. WAVELET RNN
the SDWT at every layer, padding the beginnings of the shorter layers out with zeros
final coefficients at each layer of the SDWT of the time series up to some timestep t,
and the desired output is a J + 1 dimensional vector representing the final coefficients
of the SDWT of the time series taken up to timestep t + 1. The single layer WRNN
is trained against the mean squared loss between the actual output of the model and
In this variation, the one-step-ahead prediction of each level of the SDWT happens
simultaneously, and the prediction at any level of the SDWT is equally connected to
the preceding input at all levels. This means that we do not bias the model to exploit
the special relationship between parent and children nodes in the DWT that has been
the ones between parent and child. This flexibility seems to require more training
data than the multi network model, described in the next section, but it also tends
29
CHAPTER 3. WAVELET RNN
each layer of the SDWT independently with an RNN. Our experiments revealed that
prediction under this technique suffers from a lack of alignment between layers. This
in the time domain. Additionally, this technique in no way takes advantage of the
highly structured relationship between parent and children coefficients exploited by [5]
and [9].
Instead, we adopt the following approach. We fit a simple LSTM to the approx-
imation coefficients. Then, we fit an LSTM with a 2-dimensional input for each of
the detail coefficients, where the prediction for the DWT parent coefficient in the
layer above, as well as the actual immediate predecessor in the current level, are
concatenated to form the input. This process is laid out in figure 3.1.
The multi network WRNN tends to be difficult to train, and in some cases will
produce very noisy forecasts. However, it seems to perform better than the single
Because each network in the multi network WRNN is trained using the predictions
of the network trained on the layer of the SDWT above it, it is important to train the
networks from the top down, starting the with approximation coefficients and ending
We speculate that the multi network WRNN is prone to producing noisy forecasts
30
CHAPTER 3. WAVELET RNN
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 1 2 3 4 5 6
1 2 1 2 1 2 1 2 3 4 5 6
Figure 3.1: This figure shows how the WRNN technique is used for one-step-ahead pre-
diction. As in figure 2.3, the numbered circles are SDWT coefficients, and the numbers
represent the first timestep of the signal for the DWT of which they were first calculated.
The red arrows represent which coefficients are used as inputs into the RNN at each
layer, and the red nodes represent the predicted coefficients. Thus, at the approxima-
tion coefficient layer, the only input is the most recent approximation coefficient, and
the RNN’s internal state. At the top detail coefficient layer, the inputs are the most
recent detail coefficient at that layer, and the predicted approximation coefficient. The
indicated DWT tree is then extracted and inverted, providing a signal of length 23 , from
31
CHAPTER 3. WAVELET RNN
because the noise in the forecast for each error gets amplified as it is propagated down
through the networks. When this happens, the predictions at the approximation and
high-level detail coefficients can be quite good, leading to a the right ’big picture’
forecasts. However the prediction at low level detail coefficients will be quite bad,
leading to high variability between forecasts at different time steps. We see this effect
Figure 3.2 shows an interesting example case for under trained LSTM and Multi
WRNN models. Both models underestimate the upwards trajectory at the end of
the signal, but the WRNN method noticeably captures the higher resolution activity
around t = 375, while the LSTM method does not. We were not able to come up
with an evaluation metric that captures this difference, but anecdotal examples like
this one lend credence to our intuition that the WRNN should pick up and learn from
activity and different resolutions more easily than a traditional RNN. This example
was hand-picked to illustrate this point, but we see this behavior frequently, especially
32
Chapter 4
Model Interpretability
predictions, and by how much. To that end, we developed influence plots that show
Points with near-zero gradients can be thought of as having very little effect on the
prediction in that if they were moved, the prediction would remain nearly unchanged.
Points with large gradients can be thought of as having substantial influence on the
prediction of interest. If the gradient is positive, a higher observed value for that
experiments with humans to evaluate the effects of providing an explanation for the
model’s decisions. Due to resource limitations, we were not able to conduct such
34
CHAPTER 4. MODEL INTERPRETABILITY
experiments.
From personal experience, though, we do find that these tools help with model
diagnostics, and provide insight into the differences between different models. We
also find that in simple signals, the visual explanation provided by the influence plots
In forecasting problems, the most recent observations are the most important.
This tendency is clearly reflected in our influence plots. We are also interested in
finding out which features the models finds important when making its one-step-
ahead predictions over the duration of the time series. Our software can generate an
animation that shows these influence plots at each step of the time series. Figure 4.1
shows a static version of these animations. Each plot shows the importance of every
point in the time series for the one-step-ahead prediction at a given timestep.
Figure 4.1 shows the gradient of all past timesteps for one-step-ahead prediction
for t=33 to t=49 for the WRNN model and an LSTM network. In the WRNN case,
the influence of large sections of the time series is substantial through the predictions.
We can see this in the figure because more of the past of the signal is visible, indicating
that the gradient of the prediction with respect to more time points is significant. In
the LSTM case, the influence of past sections of the time series fades much more
quickly.
This shows, at a minimum, that the WRNN model is making use of more historical
35
CHAPTER 4. MODEL INTERPRETABILITY
Figure 4.1: This figure shows influence plots for one-step-ahead prediction from time
step t = 32 to t = 51 for both the WRNN (4.1a) and LSTM (4.1b) methods on a
simulated signal. Each historical time point is colored according the the gradient of the
prediction with respect to that time point. Blue corresponds to a positive gradient, and
red corresponds to a negative gradient. The opacity of the color corresponds to the size
of the gradient. For each set of influence plots, the opacity is normalized so that the
These plots also give us some insight into which past features of the time series
the model considers to be important for its predictions at different time points. This
setting.
We can also examine influence plots by the resolution of the time series. In figure
4.2, we look at the influence of each previous point with respect to the prediction at
each layer of the SDWT. We would expect to see that the approximation coefficient
and level 5 detail coefficient have the widest window by which they’re influenced, and
that the higher resolution coefficients are influenced by shorter term activity.
One avenue for future work suggested by these plots is the development of a
into motifs. This could in some ways be considered a multiresolution extension of the
38
Chapter 5
We compare the WRNN, in both its single network 3.1 and multinetwork 3.2
First, we use an LSTM network [13]. LSTM networks have become a popular
RNN option, and are implemented in most modern machine learning frameworks.
We use the tensorflow [27] implementation. While LSTMs are a powerful tool, they
don’t have a notion of multiple scales like the other methods do.
multiple time scales by segmenting its hidden state, and updating each segment at
varying intervals. These intervals must be specified ahead of time - they are not
inherently learned from data. We use the open source implementation in the theanets
40
CHAPTER 5. EXPERIMENTS AND RESULTS
library.
described in a very recent paper, so at the time of writing there was no open source
it. This code is described in more detail in 6.4. Unlike the CWRNN, the HMLSTM
uniformly, we fit the parameters by using the “auto.arima()” function in the forecast.
This finds the best parameters as measured by the Bayesian information criterion
(BIC). To make one step ahead predictions at each step, we fit a new ARIMA model
To evaluate these models’ performance, we use a mean squared error over all
one-step-ahead predictions.
The experiments on real data and the experiments on simulated data both com-
pare the models’ average loss on one-step-ahead predictions at every timestep of every
signal. We use two layers in the HMLSTM model. We use 1, 4, 16, and 32 as clock
speeds in the clockwork RNN. We use the ”auto.arima()” function in the forecast
predictions, we fit a new ARIMA model at each timestep. We train the HMLSTM,
41
CHAPTER 5. EXPERIMENTS AND RESULTS
the CWRNN and the LSTMS directly against squared error between the signal and
the prediction. We train the WRNN methods against squared error between the pre-
dicted SDWT coefficients and the raw SDWT coefficients. Both WRNN methods are
work and described further in section 6.3. The package generates data that share
randomly generated structure across multiple time scales. We believe this makes it
a useful dataset to test the WRNN on, as one of our goals is to be able to learn
precisely such structure. In our first experiment, we compare the quality of one-step-
ahead predictions on simulated data. These results are listed in table 5.1.
Next, we add Gaussian noise to the data and evaluate one-step-ahead predictions
on the noisy data. Table 5.2 shows results for Gaussian noise with standard deviation
equal to 1, table 5.3 shows results for standard deviation equal to 5. Adding noise
degrades prediction quality across methods. In the high noise setting, we can see that
the WRNN out preforms the other neural network based methods. This is likely due
noise.
42
CHAPTER 5. EXPERIMENTS AND RESULTS
method loss
lstm 0.314930
hmlstm 0.310478
cwrnn 0.294125
arima 0.304199
Table 5.1: Mean squared error of one step ahead prediction on simulated data.
method loss
lstm 1.314859
hmlstm 1.387885
cwrnn 1.290917
arima 0.859301
Table 5.2: Mean squared error on one step ahead prediction on simulated data with
Gaussian noise.
43
CHAPTER 5. EXPERIMENTS AND RESULTS
method loss
lstm 18.868434
hmlstm 29.632380
cwrnn 21.201608
arima 5.185149
Table 5.3: Mean squared error on one step ahead prediction on simulated data with
Gaussian noise.
Figure 5.1 shows the one-step-ahead predictions of each method evaluated in this
The data were originally collected as part of the CAST [30] trial. This trial was
heart rate information was collected from most participants both before treatment
44
CHAPTER 5. EXPERIMENTS AND RESULTS
and after treatment. We used data collected before treatment from patients randomly
assigned to each treatment group. There was pre-treatment data available for 762
patients.
Each patient in the data set is labeled with a letter indicating which treatment
option they were assigned to - e for encainide, f for flecainide and m for moricizine -
followed by a three digit number. For example, the first patient assigned to treatment
with encainide is labeled as e001. We used the first 203 patients from each treatment
group as the training data, and the remaining patients as the test set. This comes
The raw data takes the form of an annotated ECG for each patient. For our
We used the ann2rr tool from the WFDB toolbox to extract RR interval infor-
mation, and remove readings that were annotated as invalid. This gave us the heart
In figure 5.2, we can see that most of the methods learn to follow the signal for
one step ahead prediction. We were unable to achieve a reasonable fit for the Multi
WRNN method on this data set. The Multi WRNN method is especially difficult to
train because the network at each layer of the SDWT relies on the output from the
network at the previous layer. As such, it’s necessary to calibrate the layers to each
46
CHAPTER 5. EXPERIMENTS AND RESULTS
method loss
lstm 25.082348
hmlstm 24.320284
cwrnn 22.194518
arima 22.486587
Table 5.4: Mean squared error on one step ahead prediction of heart rate data.
We note that the WRNN seems to need more neurons to achieve a comparable
performance with other neural network based methods. This is likely because the
48
Chapter 6
Software
Over the course of conducting research into the subject matter covered in this
of these are open source, and the last two will be open sourced at some time in the
49
CHAPTER 6. SOFTWARE
6.1 warp
This package provides an array of functions that manipulate functions. The im-
portant functions in the package are called ’elongate’, ’compress’, ’add noise’, ’func-
parameters to describe that manipulation, and returns a new function. The function
ing parameterized distortion functions, and inverting arbitrary distortion and scaling
functions.
6.2 dwtviz
This package provides an array of visualization tools for the discrete wavelet trans-
form, the stationary wavelet transform, and the sequential discrete wavelet transform.
For examples of the kinds of visualizations this library can generate, please see
section 1.1.1.
50
CHAPTER 6. SOFTWARE
This function is built upon the warp package described in section 6.1. It generates
some number of ’motifs’, which are functions with finite, random support. These
represent the shared structure amongst the generated signals. Each motif has some
distribution over ’scale’, i.e. how long that motif can last in a signal, and ’amplitude’.
Each signal is created by optionally designating each timestep as the start of one of
the underlying motifs. Once a motif is so designated, the scale and amplitude of that
instance of the motif are determined by sampling from the appropriate distributions.
The user can determine how often motifs begin in the generated signals and how
The ’generate data’ function is available if the user is interested in viewing the
underlying motifs and the distributional parameters that determine the distributions
from which the scale, amplitude and distortion functions are sampled.
The ’generate signals’ function is available if the user is only interested in the
generated signals.
51
CHAPTER 6. SOFTWARE
6.4 hmstlm
This package is an implementation of the hierarchical multiscale long short term
memory network, described in [15]. The package also includes utility functions for
preprocessing data, and utility functions for visualizing the predictions made and
6.5 sdwt
The sdwt package performs the SDWT described in chapter 2 on any signal, to any
level. It also provides additional utilities to return SDWT coefficients in a format that
facilitates the single network WRNN. It implements the discrete wavelet transform
using autograd, to make it possible to take derivates through the SDWT and ISDWT.
6.6 wrnn
The wrnn package implements both the single network and multinetwork versions
the ideas described in chapter 4 on interpretability. All the influence plots in that
This package uses the autograd package for training the wrnn, and also for calcu-
52
CHAPTER 6. SOFTWARE
lating gradients to use in the influence plots. There is also a tensorflow [27] version
available for more computationally intensive tasks, where multiple cores or a GPU
might be needed. However influence plots can not be derived for models trained using
53
Chapter 7
in time series data, and develop tools to aid with model interpretability in forecasting
settings.
We find that the WRNN does not achieve state of the art results in one step
ahead forecasting, but we demonstrate some methods through which it can be used
to gain insight about the underlying data. We note, however, that the WRNN can
be especially unstable to train, due to the fact that it increases the problem from one
We view this work as a starting point for further exploration of the multiresolution
structure present in medical time series; there are many avenues by which it can be
expanded. First, we are interested in enriching the SDWT itself. In particular, we are
54
CHAPTER 7. CONCLUSION AND NEXT STEPS
the transform, rather than mandatorily including every resolution 2j for j ∈ {1, ..., J}.
This would also give us scope to expand the scope of each DWT in the SDWT without
Future work with the question of model interpretability in forecasting will entail
refining the set of tools described, and considering the effects of the influence plot
tistical methods to these gradients to characterize the structure underlying the data.
For example, we may be able to cluster the kinds of subsignals that are important at
different resolutions into a set of ’motifs’. We are also interested in exploring the pos-
sibility of generalizing the idea of influence plots to work with some of the mutliscale
Future work with the WRNN can proceed in several directions. First, we would
like to extend the method to multi-step-ahead prediction. We are hopeful that its
multiresolution nature will lend itself well to making appropriate predictions as the
time series, and for predicting the class that a timeseries will take on in the future.
In the medical setting, we might be interested in classifying the patient whose data
manner. In the medical domain, a single patient produces many streams of data, some
55
CHAPTER 7. CONCLUSION AND NEXT STEPS
of which are highly correlated because they are reflective of the same underlying sys-
tems. We are hopeful that we can extend the WRNN idea to capture multiresolution
56
Bibliography
http://eudml.org/doc/158469
[3] S. Mallat, A Wavelet Tour of Signal Processing: The Sparse Way. Elsevier
Science, 2008.
tical Approach for the Non-mathematician. Space & Signals Technical Pub.,
2009.
179–199, 2006.
57
BIBLIOGRAPHY
Journal of the Royal Statistical Society B, vol. 68, pp. 305–332, 2006.
[8] I. Johnstone and B. W. Silverman, “Wavelet threshold estimators for data with
Based Syst., vol. 6, no. 2, pp. 107–116, Apr. 1998. [Online]. Available:
http://dx.doi.org/10.1142/S0218488598000094
http://dx.doi.org/10.1162/neco.1997.9.8.1735
58
BIBLIOGRAPHY
koutnik14.pdf
[15] J. Chung, S. Ahn, and Y. Bengio, “Hierarchical multiscale recurrent neural net-
works,” 2016.
Mathematics, 2012.
Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp.
59
BIBLIOGRAPHY
[21] M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should i trust you?”: Explaining
tion,” 2016.
[25] G. Nason and B. Silverman, “The stationary wavelet transform and some sta-
[26] B. Kim, R. Khanna, and S. Koyejo, “Examples are not enough, learn to criti-
Systems, 2016.
60
BIBLIOGRAPHY
http://tensorflow.org/
vol. 101, no. 23, pp. e215–e220, 2000 (June 13), circulation Electronic
doi: 10.1161/01.CIR.101.23.e215.
farction: insights from the cardiac arrhythmia suppression trial (cast),” Clinical
Cardiology, 2000.
61
Vita
Noam Finkelstein was born in New York City on 19 July 1990. Following his
graduation from Tenafly High School in 2008, he attended Deep Springs College, a
several years working as a software developer and research engineer, he enrolled in the
methods in clinical medicine. In the fall of 2017, he will enroll in the PhD program
62