Lecture20 Slides

Knowledge Discovery and Data Mining
Lecture 20 - More Neural Nets
Tom Kelsey
School of Computer Science

University of St Andrews
http://www.cs.st-andrews.ac.uk/~tom/
tom@cs.st-andrews.ac.uk
Tom Kelsey ID5059-20-NN2 March 2021 1 / 35

Neural Nets - Reprise
Powerful – highly complex, nonlinear models

Easy to use
Far from easy to design
In and out of fashion depending on AI hype-levels
Crude low-level model of biological neural systems
No operational difference between NN and nonlinear
regression/classification
At this point for us, a NN is a perceptron

Neural Nets - Design
1 Select an initial network configuration - start simple-ish e.g.
one hidden layer with the number of hidden units set to half
the sum of the number of input and output units
2 Iteratively conduct a number of experiments with each
configuration, retaining the best network (in terms of
generalisation error) found
a number of experiments are required with each
configuration to avoid being fooled if training locates a local
minimum (initial weights are randomly allocated)
3 On each experiment, if under-learning occurs try adding
more neurons to the hidden layer(s)
if this doesn’t help, try adding an extra hidden layer
4 If over-learning occurs, try removing hidden units (and
possibly layers)
5 Once you have experimentally determined an effective
configuration for your networks, resample and generate
new networks with that configuration.
Gathering Data
Neural networks process numeric data - care required in the

nature of data provided.
Problems if data is in an unusual range . . .
. . . if there is missing data . . .
. . . or if data is non-numeric
Numeric data can be scaled

Missing values can be imputed e.g. replaced by means (or
something else)
Non-numeric data must be converted

Gathering Data
These are not problems restricted to NN - but some methods are

less affected e.g. trees: categorical inputs & missing data.
Gender = (Male, Female) -> (0,1), or Size = (S,M,L) -> (0,1,2)
Animal = (reptile, bird, mammal) – some sort of binary
encoding required (e.g. dummy variables)
Postcodes = (A1 1AA, . . . ,Z99 9ZZ) is a non-starter
Solution – assign expert ratings to subsets
Dates & times can be offset from a chosen base
Unconstrained text fields (such as names) should be
discarded

How much data?
Standard heuristic – n ≈ 10 × C , where C is the number of

connections
There is no correct answer – depends on the nonlinear
model, which is unknown
Also depends on variance in noise, also unknown a priori
If data is sparse – or expensive – NN is probably not the
right choice.

Outliers
NNs tolerate noise well – up to a limit

If possible, detect & remove before training
Again, no right answers, only heuristics & folklore

Other data issues
The future is not the past – changes in circumstance

invalidate historic data
Data has to cover all eventualities – a NN trained on low
incomes will predict nothing about high incomes
Easiest features are learned first – even if they aren’t the
features of interest
Unbalanced data
True disease rate is 5%
Data collected and NN trained from general population
NN used on visitors to a clinic for whom the disease rate is
60%
The NN will react over-cautiously and fail to recognize
disease in some unhealthy patients

Fitting NNs
Simple in principle:
Given weights - NN gives a y-hat
ŷ compared to y gives an error measure (RSS say)
Changing the weights can make this bigger or smaller
Want to change weights to make this smaller
Error is a function of weights - so numerically optimise to
reduce
It’s a search over multiple dimensions (dictated by number of
parameters/weights).

Fitting Method 1 – Back propagation
Simple in principle:
Set some initial weights (can’t estimate error without a
parameterised model) - software deals with this - probably
random uniform.
Calculate an initial error (based on observed versus current
predicted).
For each weight determine if increasing or decreasing the
weight increases/decreases the error.
Move a bit in the correct direction. Recalculate error with
new parameters. Repeat.
Stop at some point i.e. further weight alterations make
no/little improvement.
This is a gradient search, iterating over multiple dimensions
(dictated by number of parameters/weights).

Error Surface
Source: WikiMedia Commons

Error Surface
Source: WikiMedia Commons

Take Rθ as the resubstitution error given parameters θ1 , θ2 ...

etc. (e.g. RSS).
Create little local problems to solve at each non-input node.
A little bit of calculus gets you there: application of the chain
rule allows determination of error changes at each
non-input node.
Keep track of how Rθ (R for simplicity) changes for changes
∂R
in each parameter i.e. for i-th par.
∂θi

Create little local problems to solve at each non-input node.

Iteration r + 1:
∂R
β r+1 = β r − γ r
∂β
So, if R increases with increasing βr , decrease to create βr+1
by step γ.
Keep doing this until R gets small.

Note:
the NN starts simple (boring set of parameters), gets more
complicated as we iterate.
the step size (γ) controls how rapidly we fluctuate the
parameters (‘learning rate’).
so complexity can be controlled by stopping the
optimisation process.
one pass through all the data, changing weights, is called an
epoch.

A visual pass through
The following gives a flavour of what is involved.

Source: Bernacki & Wlodarczyk, AGH
You should note:
How compuationally intensive this can potentially be
How the objective is to alter weights at each pass by some
(carefully caculated) amount to reduce prediction error

Back propagation

Fitting Methods - there are several
You should be aware there are many ways to optimise problems

like this. We’ll only mention 3:
Back-propagation (BP)
Quasi-Newton (QN)
Conjugate Gradient Descent (CGD)

Fitting Method 2 – Quasi-Newton
Back-propagation performed at all layers simultaneously.
xt+1 = xt − hHf (xt )−1 5 f (xt )

Quasi-Newton works by exploiting the observation that, on
a quadratic error surface, one can step directly to the
minimum using the Newton step. Any error surface is
approximately quadratic "close to" a minimum.
Since the Hessian is expensive to calculate, and since the
Newton step is likely to be wrong on a non-quadratic
surface, Quasi-Newton iteratively builds up an
approximation to the inverse Hessian.
The approximation at first follows the line of steepest
descent, and later follows the estimated Hessian more
closely.

Fitting Method 3 – Conjugate Gradient Descent
Works by constructing a series of line searches across the

error surface.
It first works out the direction of steepest descent, then
locates a minimum in this direction.
The conjugate directions are chosen to try to ensure that the
directions that have already been minimized stay
minimized.
The conjugate directions are calculated on the assumption
that the error surface is quadratic. If the algorithm discovers
that the current line search direction isn’t actually downhill,
it simply calculates the line of steepest descent and restarts
the search in that direction. Once a point close to a minimum
is found, the quadratic assumption holds true and the
minimum can be located very quickly.

Fitting Methods Summary
All methods can get stuck at a local minimum

All methods can converge very slowly due to the gradients
involved being extemely small
There are complex tradeoffs in computational complexity,
convergence rates, numerical stability, heuristic choices, . . .
The specific activation function(s) are likely to affect
performance

Fitting Methods Summary
All methods can get stuck at a local minimum

All methods can converge very slowly due to the gradients
involved being extemely small
There are complex tradeoffs in computational complexity,
convergence rates, numerical stability, heuristic choices, . . .
The specific activation function(s) are likely to affect
performance

Activation functions
Each internal layer will apply a function to the weighted

sums of its inputs
There are several types in common use, each with strengths
and weaknesses
Which (if any) is preferred for a specific problem depends
on the data, the fitting method, the number & size of the
layers, etc.

Step function
Yes or no, depending on a threshold

Seems ideal for a binary classifier...
...but how do we collate outputs from several nodes?
Gradient descent methods are not applicable as the function
has zero gradient (where defined)

Linear function A = cx
Not shown in the previuous figure as it is not a good

activation function
The gradient is always c, so improvement based on gradient
won’t work
A linear function applied to a linear combination is also a
linear combination...
...So hidden layers can be replaced by a single linear formula

Sigmoid
1
1 + e−x
Sigmoid means “shaped like an S" so there are many

sigmoid functions
This one is sometimes referred to as the sigmoid function
Nonlinear, and combinations are also nonlinear
Activations won’t blow up in size, since the range is (0, 1)
Smooth gradient for all values, but often a small gradient, so
convergence may be slow

Hyperbolic tangent
tanh(x) = 2 × sigmoid(2x)
A scaled sigmoid function

Activations won’t blow up in size, since the range is (−1, 1)
Steeper derivatives, so stronger gradient

Rectified Linear Unit – ReLu
max(0, x)
Looks like a combination of step and linear, and hence the

worst of both worlds
In fact nonlinear with nonlinear combinations
Unbound, so activation can blow up
Has the advantage that many neurons don’t fire, making
activations sparse and efficient
If we choose initial weights to be random values between -1
and 1, then almost 50% of the network yields zero activation
because of the characteristic of ReLu, and the network is
lighter

The dying ReLu problem
The horizontal line in ReLu sends gradients towards zero

For activations in that region of ReLu, gradient will be 0
because of which the weights will not get adjusted during
descent
So those neurons which go into that state will stop
responding to variations in error/ input ( simply because
gradient is 0, nothing changes )
This problem can cause several neurons to just die (i.e. not
respond) making a substantial part of the network passive
Several solutions – a common one is leaky ReLu
A = 0.01x for x < 0 gives a slightly inclined line rather than
a horizontal line

Others
Hard Tanh – max(1, min(1, x))

2
LeCun’s Tanh – 1.7159 × tanh( x)
3
Complementary log-log – log(1 + exp(x))
Gaussian Error Linear Unit (GELU)
Exponential linear unit (ELU)
Scaled exponential linear unit (SELU)
S-shaped rectified linear activation unit (SReLU)
etc.

Overfitting
A NN can be a very rich class of functions with even just a

single hidden layer with a few hidden units
So we are likely to have a model with sufficient inherent
complexity to model complex systems
This presents a problem too - the model can easily overfit i.e.
learn the training dataset very well, giving a model with
poor generality
The standard problem that we have encountered throughout
our consideration of automated model selections Two
approaches are considered here

Validation
Maintain an independent dataset which is not used to

develop the model, but is used to measure model
performance/generality
Seek a model that predicts data we have not yet seen - the
use of validation or cross-validation data simulates this
scenario
Simplest method is to use a single validation dataset, and
‘stop’ fitting when the performance of the model against the
validation dataset begins to deteriorate

Weight decay
Similar to the approach in tree-methods we can balance our

raw model fit against a measure of model complexity
Using Rθ as our measure of resubstitution error with a given
set of parameters θ:
Rθ + λJ (θ)
R and J are effectively in competition, and as we are using a

gradient search, you can think of λJ as preventing us from
reaching our global minimum for R
we must estimate λ and the usual approach would be via
validation or cross-validation performance
This reveals that we have in effect just considered a more
explicit phrasing of the validation approach above

NN problems overview
Lack of interpretability: these models are effectively

black-box
Over-fitting: NNs are clearly prone to overfitting if some
proper controls are not put in place
Specification decisions: there are a bewildering array of
activation functions, combination functions, output
functions, training methods, parameters (e.g. number of
hidden units and layers), standardisations etc.
Local minima: as for standard non-linear regression, we may
require multiple fits to ensure we have not been trapped in a
sub-optimal solution by local minima in the error function
Long run-times – these models can take a very long time to
fit

Summary
Consider Neural Nets when all or most of the following apply:

Lots of cheap, numeric data in reasonable ranges
Not too many outliers
Don’t know or don’t care about the model relating input to
response
Happy to wait a long time for a predictor
Don’t care about interpretability

Lecture20 Slides

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture20 Slides

Uploaded by

Copyright:

Available Formats

Knowledge Discovery and Data Mining

Lecture 20 - More Neural Nets

School of Computer Science

Tom Kelsey ID5059-20-NN2 March 2021 1 / 35

Powerful – highly complex, nonlinear models

Tom Kelsey ID5059-20-NN2 March 2021 2 / 35

Neural networks process numeric data - care required in the

Numeric data can be scaled

Tom Kelsey ID5059-20-NN2 March 2021 4 / 35

These are not problems restricted to NN - but some methods are

Tom Kelsey ID5059-20-NN2 March 2021 5 / 35

Standard heuristic – n ≈ 10 × C , where C is the number of

Tom Kelsey ID5059-20-NN2 March 2021 6 / 35

NNs tolerate noise well – up to a limit

Tom Kelsey ID5059-20-NN2 March 2021 7 / 35

The future is not the past – changes in circumstance

Tom Kelsey ID5059-20-NN2 March 2021 8 / 35

Tom Kelsey ID5059-20-NN2 March 2021 9 / 35

Tom Kelsey ID5059-20-NN2 March 2021 10 / 35

Source: WikiMedia Commons

Tom Kelsey ID5059-20-NN2 March 2021 11 / 35

Source: WikiMedia Commons

Tom Kelsey ID5059-20-NN2 March 2021 12 / 35

Take Rθ as the resubstitution error given parameters θ1 , θ2 ...

Tom Kelsey ID5059-20-NN2 March 2021 13 / 35

Create little local problems to solve at each non-input node.

Tom Kelsey ID5059-20-NN2 March 2021 14 / 35

Tom Kelsey ID5059-20-NN2 March 2021 15 / 35

The following gives a flavour of what is involved.

Tom Kelsey ID5059-20-NN2 March 2021 16 / 35

Tom Kelsey ID5059-20-NN2 March 2021 17 / 35

You should be aware there are many ways to optimise problems

Tom Kelsey ID5059-20-NN2 March 2021 18 / 35

Back-propagation performed at all layers simultaneously.

xt+1 = xt − hHf (xt )−1 5 f (xt )

Tom Kelsey ID5059-20-NN2 March 2021 19 / 35

Works by constructing a series of line searches across the

Tom Kelsey ID5059-20-NN2 March 2021 20 / 35

All methods can get stuck at a local minimum

Tom Kelsey ID5059-20-NN2 March 2021 21 / 35

All methods can get stuck at a local minimum

Tom Kelsey ID5059-20-NN2 March 2021 22 / 35

Each internal layer will apply a function to the weighted

Tom Kelsey ID5059-20-NN2 March 2021 23 / 35

Yes or no, depending on a threshold

Tom Kelsey ID5059-20-NN2 March 2021 24 / 35

Not shown in the previuous figure as it is not a good

Tom Kelsey ID5059-20-NN2 March 2021 25 / 35

Sigmoid means “shaped like an S" so there are many

Tom Kelsey ID5059-20-NN2 March 2021 26 / 35

A scaled sigmoid function

Tom Kelsey ID5059-20-NN2 March 2021 27 / 35

Looks like a combination of step and linear, and hence the

Tom Kelsey ID5059-20-NN2 March 2021 28 / 35

The horizontal line in ReLu sends gradients towards zero

Tom Kelsey ID5059-20-NN2 March 2021 29 / 35

Hard Tanh – max(1, min(1, x))

Tom Kelsey ID5059-20-NN2 March 2021 30 / 35

A NN can be a very rich class of functions with even just a

Tom Kelsey ID5059-20-NN2 March 2021 31 / 35

Maintain an independent dataset which is not used to

Tom Kelsey ID5059-20-NN2 March 2021 32 / 35