You are on page 1of 20

Recommended reading

QBUS64840 Predictive Analytics These lecture slides are comprehensive enough for your study.
Optional readings include
I Online textbook Section 9.1 and 9.3: introduces (very briefly)
Forecasting with Neural Networks and Deep Learning
some concepts in neural networks.
University of Sydney Business School I A comprehensive book is Deep Learning by Goodfellow,
Bengio and Courville, freely available at
https://www.deeplearningbook.org

1 41 2 41

Table of contents Learning objectives

Introduction I Understand the importance of data representation in data


analysis, and that neural network modeling and deep learning
are efficient data representation tools
I Understand some basic concepts of neural network (NN) and
Fundamental concepts deep learning (DL)

3 41 4 41

Introduction

I In regression modelling, sometimes it is advisable to add


interaction terms Xi ⇥ Xj or quadratic terms Xi2 to the model.
I These terms are examples of non-linear e↵ects: when
Introduction appropriate non-linear e↵ect terms are added into the
regression/classification model, the prediction accuracy is
better
I How to select non-linear e↵ect terms? When should they be
added?
I Sometimes, this can be done manually, but requires
domain-knowledge, trial and error: not efficient and not
always possible!

5 41 6 41
Introduction Introduction
A simple example

I In regression modelling, sometimes it is advisable to add


interaction terms Xi ⇥ Xj or quadratic terms Xi2 to the model.
I These terms are examples of non-linear e↵ects: when
appropriate non-linear e↵ect terms are added into the
regression/classification model, the prediction accuracy is
better
I How to select non-linear e↵ect terms? When should they be
added?
I Sometimes, this can be done manually, but requires
I Let’s look at the Direct Marketing dataset (provided on
domain-knowledge, trial and error: not efficient and not
Canvas)
always possible!
I There are totally 11 covariates. The response is AmountSpent
I Let’s use the first 900 observations as training data, the rest
100 as test data (in practice, data should be shu✏ed first)
6 41 7 41

Introduction First, try the full linear regression model


A simple example
import numpy as np
import pandas as pd
DM = pd.read_csv(’DirectMarketing.csv’)
import statsmodels.formula.api as smf
n = 900; n_test = 1000 - n;
lm = smf.ols(’AmountSpent~Children + Catalogs + Salary + Gender_b + Married_b + Location_b + Ownhome_b \
+ Age_y + Age_m + Hist_m + Hist_h’,DM.head(n)).fit()
predictions = lm.predict(DM.tail(n_test))
DM = pd.DataFrame.as_matrix(DM); DM = DM.astype(float); y_test = DM[n:1001,11]
MSE_lm = np.mean((predictions-y_test)**2)
print(’Root of MSE on the test data for linear regression: ’,np.sqrt(MSE_lm))

Root of MSE on the test data for linear regression: 604.499026646

The MSE of the prediction on the test data Dtest is defined as


1 X
MSE = (b
yi yi ) 2
ntest
yi 2Dtest
p
To ease comparison, let’s use the square root RMSE = MSE , to
get back to the original scale ( ).
8 41 9 41

A better linear regression model Now use a neural network model


np.random.seed(1000) # fix random seed
DM = pd.read_csv(’DirectMarketing.csv’) # import data
lm = smf.ols(’AmountSpent~Children + Catalogs + Salary + Children*Salary+ Location_b + Hist_m ’,DM.head(n)).fit() DM = pd.read_csv(’DirectMarketing.csv’)
lm.summary() DM = pd.DataFrame.as_matrix(DM); DM = DM.astype(float)
predictions = lm.predict(DM.tail(n_test)) DM_test = DM[n:1001,:]; DM_train = DM[0:n,:]
DM = pd.DataFrame.as_matrix(DM) X_train = DM_train[:,0:11]; y_train = DM_train[:,11]
DM = DM.astype(float) X_test = DM_test[:,0:11]; y_test = DM_test[:,11]
y_test = DM[n:1001,11]
MSE_lm = np.mean((predictions-y_test)**2) # standardize the data
print(’Root of MSE on the test data for linear regression: ’,np.sqrt(MSE_lm)) from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
Root of MSE on the test data for linear regression: 584.887682063 scaler.fit(X_train)
X_train = scaler.transform(X_train)
# apply same transformation to test data
X_test = scaler.transform(X_test)

# now buid the neural net model


from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(11, input_dim=11, activation=’relu’)) # the first hidden layer has 11 units, input has 11
model.add(Dense(11, activation=’relu’)) # add another hidden layer with 11 units
model.add(Dense(1, activation=’linear’)) # the output layer has 1 unit with the linear activation

# Compile model
model.compile(loss=’MSE’, optimizer=’adam’)
# Fit the model
model.fit(X_train, y_train, epochs=100, batch_size=10)
# evaluate the model
MSE_nn = model.evaluate(X_test, y_test)
print(’\n Root of MSE on the test data for neural net: ’, np.sqrt(MSE_nn))

Root of MSE on the test data for neural net: 502.117223614

10 41 11 41
Introduction Introduction
A simple example

I Neural networks and deep neural networks (called deep


learning) have become an exciting research and application
area in the last few years
So for this dataset, which model is better in terms of prediction I Deep learning is widely known for its high prediction accuracy
accuracy?
I It has been successfully applied to many large-scale industry
problems, image recognition, language processing
I Its secret is Data Representation Learning

12 41 13 41

Introduction Introduction: Representation Learning

I Neural networks and deep neural networks (called deep


learning) have become an exciting research and application
area in the last few years
I Deep learning is widely known for its high prediction accuracy
or x?
I It has been successfully applied to many large-scale industry
problems, image recognition, language processing
I Its secret is Data Representation Learning

13 41 14 41

I
Introduction: Representation Learning Introduction: Representation Learning

I We want to predict a response Y , based on raw/original I We want to predict a response Y , based on raw/original
covariates X = (X1 , ..., Xp ), using linear regression modelling covariates X = (X1 , ..., Xp ), using linear regression modelling
I Usually, before doing regression modelling, some appropriate I Usually, before doing regression modelling, some appropriate
transformation of the covariates Xi is needed: Z1 = 1 (X ), transformation of the covariates Xi is needed: Z1 = 1 (X ),
..., Zd = d (X ). ..., Zd = d (X ).
I The Zi are called predictors or features. I The Zi are called predictors or features.
I Then we model I Then we model

E(Y |X ) = 0 + 1 Z1 + ... + d Zd E(Y |X ) = 0 + 1 Z1 + ... + d Zd

I Selection of the transformations i (X ) is an art! I Selection of the transformations i (X ) is an art!


I Z = (Z1 , ..., Zd ) is a representation of X = (X1 , ..., Xp ). A I Z = (Z1 , ..., Zd ) is a representation of X = (X1 , ..., Xp ). A
better representation (in terms of predicting Y ) leads to a better representation (in terms of predicting Y ) leads to a
better prediction accurary better prediction accurary

15 41 15 41
Introduction: Representation Learning Introduction: Representation Learning

I We want to predict a response Y , based on raw/original I We want to predict a response Y , based on raw/original
covariates X = (X1 , ..., Xp ), using linear regression modelling covariates X = (X1 , ..., Xp ), using linear regression modelling
I Usually, before doing regression modelling, some appropriate I Usually, before doing regression modelling, some appropriate
transformation of the covariates Xi is needed: Z1 = 1 (X ), transformation of the covariates Xi is needed: Z1 = 1 (X ),
..., Zd = d (X ). ..., Zd = d (X ).
I The Zi are called predictors or features. I The Zi are called predictors or features.
I Then we model I Then we model

E(Y |X ) = 0 + 1 Z1 + ... + d Zd E(Y |X ) = 0 + 1 Z1 + ... + d Zd

I Selection of the transformations i (X ) is an art! I Selection of the transformations i (X ) is an art!


I Z = (Z1 , ..., Zd ) is a representation of X = (X1 , ..., Xp ). A I Z = (Z1 , ..., Zd ) is a representation of X = (X1 , ..., Xp ). A
better representation (in terms of predicting Y ) leads to a better representation (in terms of predicting Y ) leads to a
better prediction accurary better prediction accurary

15 41 15 41

Introduction: Representation Learning

Neural network modeling is a representation learning method. It


provides an efficient way to design a represenation Z = (X ) that
is e↵ective for predicting the response Y .

[Image credit: Ding Phung]


16 41 17 41

What are neural networks? What are neural networks?


They are a set of very flexible non-linear methods for
regression/classification and other tasks.
I A neural network is an interconnected assembly of simple
processing units or neurons, which communicate by sending
signals to each other over weighted connections
I A neural network is made of layers of similar neurons: an
input layer, (one or many) hidden layers, and an output layer.
I The input layer receives data from outside the network. The
output layer sends data out of the network. Hidden layers
receive/process/send data within the network.
I A neural network is said to be deep, if it has many hidden
layers. Deep neural network modelling is collectively refered to
as deep learning.
A neural network, also called artificial neural network (ANN) is a
computational model that is inspired by the network of neurons in
the human brain

18 41 19 41
What are neural networks? What are neural networks?

I A neural network is an interconnected assembly of simple I A neural network is an interconnected assembly of simple
processing units or neurons, which communicate by sending processing units or neurons, which communicate by sending
signals to each other over weighted connections signals to each other over weighted connections
I A neural network is made of layers of similar neurons: an I A neural network is made of layers of similar neurons: an
input layer, (one or many) hidden layers, and an output layer. input layer, (one or many) hidden layers, and an output layer.
I The input layer receives
I data from outside the network. The I The input layer receives data from outside the network. The
output layer sends data out of the network. Hidden layers output layer sends data out of the network. Hidden layers
receive/process/send data within the network. receive/process/send data within the network.
I A neural network is said to be deep, if it has many hidden I A neural network is said to be deep, if it has many hidden
layers. Deep neural network modelling is collectively refered to layers. Deep neural network modelling is collectively refered to
as deep learning. as deep learning.

19 41 19 41

What are neural networks? What are neural networks?

I A neural network is an interconnected assembly of simple I In a nutshell, a neural net is a multivariate function: output ⌘
processing units or neurons, which communicate by sending is a function of the inputs X = (X1 , ..., Xp )>
signals to each other over weighted connections
⌘ = f (X1 , ..., Xp )
I A neural network is made of layers of similar neurons: an
input layer, (one or many) hidden layers, and an output layer. I More precisely, this function is a layered composite function
I The input layer receives data from outside the network. The
output layer sends data out Eof the network. Hidden layers Z1 = f1 (X )
receive/process/send data within the network. Z2 = f2 (Z1 )
I A neural network is said to be deep, if it has many hidden ...
layers. Deep neural network modelling is collectively refered to ZL = fL (ZL 1)
as deep learning. ⌘ = fL+1 (ZL )

19 41 20 41

What are neural networks? Note

A neural network provides a mechanism for functional


approximation There are several variants of neural networks:
I Suppose that ftrue (X ) is a true, yet unknown, function that we I The network structure considered so far is often called
want to estimate. E.g., feed-forward neural networks, which are most suitable for
I ftrue (X ) = E(Y |X ): the conditional mean of a response Y
cross-sectional data. Can be used for time series data too.
given X
I In the next lecture, you will study recurrent neural networks,
I A neural net with the output ⌘ = f (X ) provides an
which are most suitable for time series data.
approximation of ftrue (X ), i.e. we use f (X ) to approximate
ftrue (X ).

21 41 22 41
Elements of a neural network

Fundamental concepts

23 41 24 41

Elements of a neural network Elements of a neural network

A (feedforward) neural net includes A (feedforward) neural net includes


I a set of processing units (also called neurons, nodes) I a set of processing units (also called neurons, nodes)
I weights wik , which are connection strengths from unit i to I weights wik , which are connection strengths from unit i to
unit k unit k
I a propagation rule that determines the total input Sk of unit I a propagation rule that determines the total input Sk of unit
k, from the units that send information to unit k k, from the units that send information to unit k
I the output Zk for each unit k, which is a function of the input I the output Zk for each unit k, which is a function of the input
Sk Sk
I an activation function hk that determines the output Zk based I an activation function hk that determines the output Zk based
on the input Sk , Zk = hk (Sk ) on the input Sk , Zk = hk (Sk )

25 41 25 41

Elements of a neural network Elements of a neural network

A (feedforward) neural net includes A (feedforward) neural net includes


I a set of processing units (also called neurons, nodes) I a set of processing units (also called neurons, nodes)
I weights wik , which are connection strengths from unit i to I weights wik , which are connection strengths from unit i to
unit k unit k
I a propagation rule that determines the total input Sk of unit I a propagation rule that determines the total input Sk of unit
k, from the units that send information to unit k k, from the units that send information to unit k
I the output Zk for each unit k, which is a function of the input I the output Zk for each unit k, which is a function of the input
Sk Sk
I an activation function hk that determines the output Zk based I an activation function hk that determines the output Zk based
on the input Sk , Zk = hk (Sk ) on the input Sk , Zk = hk (Sk )

25 41 25 41
Elements of a neural network Elements of a neural network

A (feedforward) neural net includes It’s useful to distinguish three types of units:
I a set of processing units (also called neurons, nodes) I input units (often denoted by X ): receive data from outside
I weights wik , which are connection strengths from unit i to the network
unit k I hidden units (often denoted by Z ): receive data from and
I a propagation rule that determines the total input Sk of unit send data to units within the network.
k, from the units that send information to unit k I output units: send data out of the network. The type of the
I the output Zk for each unit k, which is a function of the input output depends on the task (regression, binary classification or
Sk multinomial regression). In many cases, there is only one
I an activation function hk that determines the output Zk based scalar output unit.
on the input Sk , Zk = hk (Sk ) Given the signal from a set of inputs X , an NN produces an output.

25 41 26 41

Elements of a neural network Elements of a neural network

The total input sent to unit k is


X
Sk = wik Zi + w0k
i

which is a weighted sum of the outputs from all units i that are
connected to unit k, plus a bias/intercept term w0k .
Then, the output of unit k is
!
X
Zk = hk (Sk ) = hk wik Zi + w0k
i

Usually, we use the same activation function hk = h for all units.

27 41 28 41

Elements of a neural network Elements of a neural network

The total input sent to unit k is The total input sent to unit k is
X X
Sk = wik Zi + w0k Sk = wik Zi + w0k
i i

which is a weighted sum of the outputs from all units i that are which is a weighted sum of the outputs from all units i that are
connected to unit k, plus a bias/intercept term w0k . connected to unit k, plus a bias/intercept term w0k .
Then, the output of unit k is Then, the output of unit k is
I ! !
X X
Zk = hk (Sk ) = hk wik Zi + w0k Zk = hk (Sk ) = hk wik Zi + w0k
i i

Usually, we use the same activation function hk = h for all units. Usually, we use the same activation function hk = h for all units.

28 41 28 41
Elements of a neural network Neural Net as a Data Representation Learning tool
Popular activation functions:

I We want to predict a response Y , based on p raw covariates


X = (X1 , ..., Xp )0
I We want to represent X by d predictors/features
Z = (Z1 , ..., Zd )0 = (X ), before predicting Y based on Z .
I Neural network modelling is a data representation learning
method, that transforms X into

Z = (X ) = (X , w )

with the hope that predicting Y using the linear


1
Sigmoid activation function: h(S) = 1+e S regression/classification techniques based on Z is more
eS e S
I
accurate than based on X directly.
Tang activation function: h(S) = e S +e S
The idea is that we tune/train w to achieve this goal.
(
S, S >0
Rectified activation function : h(S) = max(0, S) =
0, S 0
29 41 30 41

Neural Net as a Data Representation Learning tool Neural Net as a Data Representation Learning tool

I We want to predict a response Y , based on p raw covariates


X = (X1 , ..., Xp )0
I We want to represent X by d predictors/features
Z = (Z1 , ..., Zd )0 = (X ), before predicting Y based on Z .
I Neural network modelling is a data representation learning
method, that transforms X into

Z = (X ) = (X , w )

with the hope that predicting Y using the linear


regression/classification techniques based on Z is more Graphical representation of a neural net with L = 3 hidden layers.
accurate than based on X directly. The input layer represents the raw covariates X . The last hidden
The idea is that we tune/train w to achieve this goal. layer (hidden layer 3) represents the predictors Z .

30 41 31 41

Neural Net as a Representation Learning tool Forward propagation algorithm*

Denote the final output of the neural net as

⌘= + Slides with * are highly technical. You’re encouraged to go


0 1 Z1 + ... + d Zd
through them, but these are not tested in the exams.
with =( 0.
0 , ..., d) Forward propagation algorithm for computing the output
Note that ⌘ is a function of X and depends on w and I Consider a neural net with the structure (p, `(1) , ..., `(L) , 1)
I The input layer has p covariates X1 , ..., Xp .
⌘ = ⌘(X , w , )
I L hidden layers: the first hidden layer has `(1) units, the second
hidden layer has `(2) , etc.
w is the set of weights that connect covariates X to predictors Z , I The last layer is a single output ⌘
and is the set of weights that connect Z to ⌘.
We will use ⌘(X , w , ) to approximate ftrue (X ).

32 41 33 41
Forward propagation algorithm* Forward propagation algorithm*

0 1
(j)
w0,v
B C
(j) B w (j) C
wv = B 1,v C : set of weights sends signal to unit v of layer j
B ... C
@ A
(j)
w`(j 1) ,v
0 1
(j)
S1
B C
S (j) = @ ... A : vector of total inputs to layer j, j = 1, ..., L.
I Let wuv
(j)
be the weight from unit u in the previous layer j 1 (j)
S`(j)
to unit v in layer j. Layer j = 0 is the input layer, `(0) := p.
0 1 0 1
I The total input to unit v of layer j is 1 1
BZ C
(j) B X1 C
(j 1)
`X B 1 C
(j) (j) (j) (j 1) (j) 0
(j)
Z =B C : vector of outputs from layer j, Z := B
(0)
@ ... A
C
Sv = w0v + wuv Zu = wv Z (j 1)
@ ... A
(j) Xp
u=1 Z`(j)
(j) (j)
Its output is Zv = h(Sv ).
34 41 35 41

Forward propagation algorithm* Forward propagation algorithm*

I
0 1 0 1 Pseudo-code algorithm for computing the output.
w01
(j) (j)
w11 ...
(j)
w`(j (j) 0
1) ,1 w1
B (j) (j) (j) C B (j) 0 C Input: covariates X1 , ..., Xp and weights w = (W (1) , ..., W (L) ),
Bw w12 ... w`(j 1) ,2 C B C
W (j) =B 02 C = B w2 C
B ... C B C
@ ... ... A @ ... A Output: ⌘
(j) (j) (j) (j) 0
w0,`(j) w1,`(j) ... w`(j 1) ,`(j) w`(j) I Z (0) := (1, X1 , ..., Xp )0
I For i = 1, ..., L:
be the matrix of all weights from layer j 1 to layer j.
I S (j) = W (j) (j 1)
✓ Z ◆
I Then 1
I Z (j) =
S (j) = W (j) Z (j 1) h(S (j) ))
I The final output of the network is I ⌘= 0 Z (L) .

(L) (L) 0
⌘= 0 + 1 Z1 + ... + L Z`(L) = Z (L) .

36 41 37 41

Neural net for regression Neural net for forecasting

Suppose that the response Y is numerical.


The model is

Y = ⌘(X , w , ) + ✏
Given a neural network, we now know how to compute its output ⌘
= 0 + 1 Z1 + ... + d Zd +✏
from an input vector X .
where ✏ is an error term with mean 0 and variance 2. Often, we
How is this output used for forecasting?
assume ✏ ⇠ N (0, 2 ). E
The least squares method can now be used to estimate the model
parameters ✓ = (w , , 2 ).
Note on Python: In Python, the activation function of the output
unit for regression is defined as the identity function, named linear.

38 41 39 41
Neural net for forecasting Training a neural net

Suppose that the response Y is numerical.


The model is
I Given that a neural net model has been developed, given a
Y = ⌘(X , w , ) + ✏ dataset {yi , xi = (xi1 , ..., xip )> }, i = 1, ..., n, the most difficult
= 0 + 1 Z1 + ... + d Zd +✏ task is to estimate the model parameters ✓
I Other problems in neural network modelling
where ✏ is an error term with mean 0 and variance 2. Often, we I How to select the number of hidden layers?
assume ✏ ⇠ N (0, 2 ). I How to select the number of units in each hidden layer?
I How to perform variable selection?
The least squares method can now be used to estimate the model I etc.
parameters ✓ = (w , , 2 ).
Note on Python: In Python, the activation function of the output
unit for regression is defined as the identity function, named linear.

39 41 40 41

Next...

I We look at neural net for regression in detail


I How to train a neural net model
I How to use a neural net for prediction with cross-sectional
data and time series data.

41 41
Table of contents

QBUS64840 Predictive Analytics


Neural nets for cross-sectional data
Forecasting with Neural Networks and Deep Learning 2

University of Sydney Business School


Neural networks for time series

1 32 2 32

Learning objectives

I Know the methods used to train/estimate a neural network Neural Networks for cross-sectional
model, and the difficulties in training data
I Know how to use a neural network for prediction with
cross-sectional data and time series data

3 32 4 32

Neural net for cross-sectional data Neural net for forecasting


Least squares method for training

Suppose that the response Y is numerical. I Let {yi , xi = (xi1 , ..., xip )> }, i = 1, ..., n be the training
From now on, we will use w to denote ALL the weights in the dataset.

I
neural net. The output is ⌘(X , w ). I The neural net regression model can be written as
The neural net model for regression is
yi = ⌘(xi , w ) + ✏i , i = 1, ..., n
Y = ⌘(X , w ) + ✏ I The parameters are ✓ = (w , 2 ).
where ✏ is an error term with mean 0 and variance 2. Often, we I Define the loss function to be the sum of squared errors
assume ✏ ⇠ N (0, 2 ). X ⇣ ⌘2
Loss(✓) = `i (✓), `i (✓) = yi ⌘(xi , w )
The least squares method can now be used to estimate the model
i
parameters.
I LS minimizes Loss(✓) to estimate ✓.

5 32 6 32
Difficulties in training a neural net Multimodality issue

We need to solve an optimization problem

find ✓ that minimizes Loss(✓)

Difficulties
I There are a huge number of parameters
I The surface of the loss function of often highly multimodal
I We often need big data, so computational expensive

In most cases, neural net models are trained by the Stochastic


Gradient Descent (SGD) method.

7 32 8 32

Gradient descent method for optimization Gradient descent method for optimization

Suppose that we want to minimise a function Loss(✓).


I Start from an initial ✓(0)
I Update

✓(t+1) = ✓(t) at r✓ Loss(✓(t) ), t = 0, 1, 2, ...

until some convergence condition is met.

Here r✓ Loss(✓) is the gradient vector of Loss(✓).


at > 0 is called learning rate or step size, such that

at ! 0, as t ! 1

✓(t) is guaranteed to converge to a local minima of Loss(✓).

9 32 10 32

Stochastic gradient descent Stochastic gradient descent


Pn
I For training a NN model, Loss(✓) = i=1 `i (✓),

n
X
r✓ Loss(✓) = r✓ `i (✓)
i=1

I In deep learning, we often need BIG DATA, n is HUGE. So


the gradient r✓ Loss(✓) is a BIG sum and computationally
expensive to compute. We often use stochastic gradient
descent (SGD).
I In SGD, instead of using the exact/full-data gradient
r✓ Loss(✓), we use an estimate rb ✓ Loss(✓) based on a subset
of the data (of size m ⌧ n)
X
b ✓ Loss(✓) = n
r r✓ `i (✓)
m
i2subset

11 32 12 32
How to select learning rate at ? Back-propagation algorithm

I All we need now is to compute the gradient of the neural net


I In SGD, and stochastic optimization in general, the output r✓ `(✓). This is often a very long vector
convergence depends enormously on the step size at . I At the first glance, it seems difficult to compute this gradient
I In DL in particular, a naive choice of at , such as at = 1/t, vector, but luckily...
doesn’t work at all I An efficient algorithm for computing this is called
I at must be designed adaptively in a smarter way back-propagation algorithm. This algorthim has been
I select at based on the behavior of ✓(0) , ✓(1) , ..., ✓(t 1)
so far implelemted in Python.
I popular methods: ADAM, AdaGrad, etc.
I The back-propagation algorithm is crucial for the success of
DL.

13 32 14 32

Practice recommendations

I It’s not unusual that people have to spend a few days or even
weeks and months to train a deep learning model
Practice recommendations I If you can train a deep learning model successfully, then in
most cases you beat conventional models (linear regression,
logistic, LDA, etc) in terms of prediction accuracy
I But how to train a deep learning model successfully requires
some e↵ort. There are some implementation tricks that you
might find useful in practice

15 32 16 32

Practice recommendations Practice recommendations


Data standardization Activation function

I The input/covariate data should be standardized before


training your model
I Each numerical column in the training dataset should be
standardized so that it has a zero sample mean and std. 1.
Note: part of this set, called validation set, is used for tuning
hyperparameters; the rest is still called training set. I Sigmoid activation has long been used in neural nets, but it
I Often, no need to standardize binary columns has a major drawback: the gradient vector becomes almost
zeros when the neural net is deep.
I Rectified activation function : h(S) = max(0, S) stands out to
be the option in many cases
I Designing activation functions is still a “hot” research topic.

17 32 18 32
Practice recommendations Practice recommendations

O
Fix random seed Early stopping

I All the computer can do when generating random variables is


to generate a set of pseudo-uniform random numbers using a I Because of their flexibility, DL models often overfit the data,
deterministic sequential rule even regularization priors are used.
I Start u0 I It’s often observed that, if the model has too many
I un = rule(un 1 ), n = 1, 2, ... layers/units, the training loss decreases steadily over the
I The starting number u0 is referred to as the random seed (this updates, but the validation loss starts increasing at some
is an integer number). point.
I If you fix the random seed, then you fix randomness (yes, this I Therefore the updating procedure should be stopped if the
sentence doesn’t make sense philosophically!). validation loss is not decreased after a certain iterations
I It’s advisable to fix the random seed in training deep learning (called patience)
models. Then all the results are reproducible. I Then the set of model parameters with respect to the lowest
I Typically in DL, the results will change then you change the validation loss is returned/used.
random seed

19 32 20 32

Practice recommendations
How to select the structure of the neural net?

How many hidden layers should we use? How many units in each
layer?
I This is a challenging model selection problem. There is no Neural Networks for Time Series
definite answer.
I But there are basically three shapes to choose from: left-base
pyramid (large to small layers), right-base pyramid (small to
large layers) and retangular (roughly equal size layers)
I In many cases, a rectangular-shaped neural nets work well
I For the number of hidden layers, simply start from 1 layer,
adding more until the validation loss doesn’t get smaller.

21 32 22 32

Neural Networks for Time Series: Non-linear Neural Networks for Time Series: Non-linear
Autoregression Autoregression

I For time series, we shall use lagged values of time series as I We denote by NNAR(p, k) the NNs with p lagged inputs and
inputs to a neural network and the output as the prediction k neurons in the hidden layer and with one output as the
forecast
I This means, in general, the number of input neurons of a
I For example NNAR(12, 10) model is a neural network with the
neural network is the number of the time lag and there is only
one output neuron last 12 observations (yt 1 , yt 2 , ..., yt 12 ) to fit yt at any time
step t, with 10 neurons in the hidden layer
I We will first consider feed-forward networks with one hidden
I A NNAR(p, 0) model is equivalent to ARIMA(p, 0, 0) without
layer.
the restrictions on the parameters to ensure stationary.

23 32 24 32
Neural Networks for Time Series: Training Neural Networks for Seasonal Time Series
I Consider the neural network NNAR(p, k). For a given section
of time series yt p+1 , yt p+2 , ..., yt denote the output from
NNAR(p, k) by ybt+1 = ⌘(yt p+1 , yt p+2 , ..., yt ; w ) where w
collects all the weights in the network. I Except for the lagged data as inputs, it is useful to add the
I Given training data {y1 , ..., yn }, we form the training patterns last observed values from the same seasons as inputs, for
as follow seasonal time series
I The notation NNAR(p, P, k)m means a model with inputs
y1 , y2 , ..., yp ! yp+1 : ✏p+1 = (yp+1 ⌘(y1 , y2 , ..., yp ; w ))
y2 , y3 , ..., yp+1 ! yp+2 : ✏p+2 = (yp+2 ⌘(y2 , y3 , ..., yp+1 ; w ))
y3 , y4 , ..., yp+2 ! yp+3 : ✏p+3 = (yp+3 ⌘(y3 , y4 , ..., yp+2 ; w ))
(yt 1 , yt 2 , ..., yt p , yt m , yt 2m , ..., yt Pm )

..
. and k neurons in the hidden layer
yn p , yn p+1 , ..., yn 1 ! yn : ✏n = (yn ⌘(yn p , yn p+1 , ..., yn 1 ; w )) I What is the input for NNAR(5, 4, 10)6 ?
I To estimate the weights w , we minimise
n n
X o
min Loss(w ) = ✏2i
w
i=p+1

25 32 26 32

A simple Recipe A simple Recipe

I Create the training patterns: E.g., for NNAR(4, k), each


training pattern will be five values, with the first four
I Exploratory data analysis: Apply some of the traditional time corresponding to the input neurons and the last one defining

J
series analysis methods to identify the lag values p and P the prediction as the output unit. Here is all the data pattern
(e.g. ACF). for the training data {Y1 , Y2 , ..., Yn } for a NNAR(4, k).
I Split your data into two main sections: training section
{Y1 , Y2 , ..., Yn } and validation section {Yn+1 , Yn+2 , ..., YT }. Training Pattern 1: Y1 , Y2 , Y3 , Y4 ! Y5
For example, you may use data in the first two years for Training Pattern 2: Y2 , Y3 , Y4 , Y5 ! Y6
training, the data in the third year as validation for a time Training Pattern 3: Y3 , Y4 , Y5 , Y6 ! Y7
series of three years data .. ..
I Define the neural network architecture . .
Training Pattern n 4: Yn 4 , Yn 3 , Yn 2 , Yn 1 ! Yn

I Train the neural network on these patterns


I

train
27 32 28 32

A simple Recipe The Example

I Forecast using the network on the validation set I Dataset: International Airport Arriving Passengers: in csv
{Yn+1 , Yn+2 , ..., YT } : Here you will pass in four values as the format
input layer and see what the output node gets. I Prepare Data: Xtrain, Ytrain, Xtest and Ytest with
time lag 4
Validation Pattern 1: Yn 3 , Yn 2 , Yn 3 , Yn ! Ybn+1 I Define the NN architecture:
Validation Pattern 2: Yn 2 , Yn 1 , Yn , Yn+1 ! Ybn+2 I model = Sequential()
I model.add(Dense(30, input dim=time lag,
Validation Pattern 3: Yn 1 , Yn , Yn+1 , Yn+2 ! Ybn+3 activation=’relu’))
.. .. I model.add(Dense(1))
. .
I This defines a network with 30 neurons on the hidden layer
I Then, form the validation/test error. I Test on Lecture10 Example01.py

29 32 30 32
The example: Forecasting

Hidden = 3 Hidden = 30

Hidden = 100 Hidden = 500


31 32
Recommended reading

QBUS64840 Predictive Analytics


I Online textbook Section 9.1 and 9.3: introduces (very briefly)
some concepts in neural networks.
Recurrent Neural Networks
I Deep Learning, Chapter 10 by Goodfellow, Bengio and
Dr. Minh-Ngoc Tran Courville, freely available at
University of Sydney Business School
https://www.deeplearningbook.org

1 24 2 24

Table of contents Learning objectives

Recurrent neural networks (RNN)


I Know how to do forecasting with recurrent neural networks
(RNN)
I How RNN can be used for financial time series data
Final exam

3 24 4 24

Recurrent neural network (RNN)

I There are at least two approaches to modeling time series


data.
Recurrent neural networks (RNN) I One approach is to represent time e↵ects explicitly via some
simple functions, often linear functions, of the lagged values
of the time series.
I This is the mainstream time series data analysis approach in
the statistics literature
I Well-known models: AR or ARMA, etc.

5 24 6 24
Recurrent neural network (RNN) Recurrent neural network (RNN)

I Let the time series data be {Dt = (xt , yt ), t = 1, 2, ...} where


I The alternative approach is to represent time e↵ects implicitly xt is the vector of inputs and yt the output.
via latent variables (also called hidden states) I E.g., yt : sales at time t
I hidden states are designed to store the memory of the I xt,1 = yt 1 , xt,2 : ads hours at time t 1, xt,3 : consumption
dynamics in the data expenditure index at time t 1, etc.
I updated in a recurrent manner using the information carried I For ease of understanding, it might be useful to think of xt as
over by their values from the previous time steps and the scalar; however, RNN is often efficiently used to model
information from the data at the current time step multivariate time series
I Models are called Recurrent neural networks (RNN), first I If the time series of interest has the form {yt , t = 1, 2, ...}, it
developed in cognitive science and successfully used in can be written as {(xt , yt ), t = 2, ...} with xt = yt 1 , or
computer science and other fields {(xt , yt ), t = p + 1, ...} with xt = (yt 1 , yt 2 , ..., yt p ).
States I The goal is to estimate the prediction E(yt |xt , D1:t 1 ).
hidden

HIII 7 24 8 24

Recurrent neural network (RNN) Recurrent neural network (RNN)

I Mathematically, this basic RNN model is written as


I First, let’s use a feedforward neural network (FNN) to
transform the raw input data xt into a set of hidden units ht ht = h(vxt + wht 1 + b)
(for the purpose of predicting yt ). ⌘t = 0 + 1 ht
I But we need to take into account the dynamics/serial yt = ⌘t + ✏ t
correlation of the time series data
2.
I The main idea behind RNN is to let the set of hidden units ht where the ✏t are white noise with mean 0 and variance
to feed itself using its value ht 1 from the previous time step I v , w , b, 0 and 1 are model parameters (need to be
t 1. estimated),
I Hence, RNN can be best thought of as a FNN that allows a I h(·) is a non-linear activation function (e.g. tanh or sigmoid
connection of the hidden units to their value from the previous functions)
time step, which enables the network to possess the memory. I Usually we can set h0 = 0, i.e. the neural network initially
doesn’t have any memory.

9 24 10 24

Recurrent neural network (RNN) Recurrent neural network (RNN)

Special case: If the time series of interest has the form


Graphical representation of the basic RNN model: the black square
{yt , t = 1, 2, ...} and we take xt = yt 1 as the input:
indicates the delay of one time step.
I RNN model

ht = h(vyt 1 + wht 1 + b)
⌘t = 0+ 1 ht
yt = ⌘t + ✏ t

I Forecast and variance

ybt|1:t 1 = E(yt |y1:t 1 ) = ⌘t

2
V(yt |y1:t 1) =

11 24 12 24
Recurrent neural network (RNN): training Advanced variants of RNN

I Let {Dt = (xt , yt ), t = 1, 2, ...T } be the training dataset


I The RNN considered so far is a basic RNN: simple and might
I Sum of squared errors be not flexible enough. Training a basic RNN is often
T
challenging: its gradient is either vanishing or exploding
X
SSE = (yt ybt|1:t 1)
2 I Many variants of RNN are proposed to overcome the issues
t=1 above: Long Short-Term Memory (LSTM) is one of the most
widely used RNNs, applied to large-scale industry-level
I The model parameters ✓ = (v , w , b, 0 , 1 ) are estimated by
applications: language processing, video data processing, etc.
minimizing the SSE
I We won’t discuss LSTM in detail here, but show how to use it
I Training a RNN is similar to training a Feedforward Neural
in the tutorial.
Network.

13 24 14 24

An example An example: Training

I Dataset: International Airport Arriving Passengers: in csv


format
I Prepare Data: Xtrain, Ytrain, Xtest and Ytest as we I One statement to train the model
model.fit(Xtrain, Ytrain, batch size=100, nb epoch=10,
have done for traditional neural networks
validation split=0.05)
I Define the RNN architecture: I batch size defines the batch size in stochastic optimisation
I We use one layer of LSTM. process. You may use default value
I Choose the size of time-window p = 4: This should be I nb epoch defines the number of epoches of optimisation.
determined by exploring ACF or PACF Start with a smaller value for testing
I Choose the dimension of hidden states d = 10 I validation split=0.05 means in the training process,
I model = Sequential() about 5% data from Xtrain will be used for validation
model.add(LSTM(input dim=4, output dim=10,
I Please read the program Lecture12 Example01.py
return sequences=False))
model.add(Dropout(0.2))
model.add(Dense(output dim=1)
model.add(Activation("linear"))

15 24 16 24

An example: Forecasting An example: Share Price

I Please read the program Lecture12 Example02.py where we


use two layers of LSTMs

Epoch = 1 Epoch = 20

Epoch = 100 Epoch = 200


17 24 18 24
Final exam Final exam: some advice

I Start your revision as soon as possible (is it too late? :-)


I Study carefully the final exam practice questions.
I Everything covered in the lectures and tutorials, including I In the exam, read through the whole exam and carefully read
Python, can be tested, except the technical slides with ”*” the instructions.
I This is an open book exam with some limitations (as in the I Organise your time e↵ectively, i.e. allocate time for each
midterm exam). question.
I The exam is proctored using ProctorU - please read carefully I Always answer ALL the questions. Even if you think you know
the Canvas announcements related to the final exam. nothing about a topic, you might get A FEW marks by
I Exam sample for practice will be available. providing a partially complete answer OR making some
sensible comments. Remember, any unanswered question
scores zero!
I Try to leave some time to review your answers at the end.

19 24 20 24

Consultation hours

I I still run consultation as usual until the exam day. Or send Study hard and play hard!
me an email to make appointment at other times.
I Check with your tutor/lecturer as they might extend their
All the best with your exams!
consultation hours as well.

21 24 22 24

Hope you’ll be like this after the final Don’t forget to give the teaching
exams...
team your feedback on the course! If
you can, please do it now.
You might win a Macbook!

23 24 24 24

You might also like