You are on page 1of 35

Knowledge Discovery and Data Mining

Lecture 20 - More Neural Nets

Tom Kelsey

School of Computer Science


University of St Andrews
http://www.cs.st-andrews.ac.uk/~tom/
tom@cs.st-andrews.ac.uk

Tom Kelsey ID5059-20-NN2 March 2021 1 / 35


Neural Nets - Reprise

Powerful – highly complex, nonlinear models


Easy to use
Far from easy to design
In and out of fashion depending on AI hype-levels
Crude low-level model of biological neural systems
No operational difference between NN and nonlinear
regression/classification
At this point for us, a NN is a perceptron

Tom Kelsey ID5059-20-NN2 March 2021 2 / 35


Neural Nets - Design
1 Select an initial network configuration - start simple-ish e.g.
one hidden layer with the number of hidden units set to half
the sum of the number of input and output units
2 Iteratively conduct a number of experiments with each
configuration, retaining the best network (in terms of
generalisation error) found
a number of experiments are required with each
configuration to avoid being fooled if training locates a local
minimum (initial weights are randomly allocated)
3 On each experiment, if under-learning occurs try adding
more neurons to the hidden layer(s)
if this doesn’t help, try adding an extra hidden layer
4 If over-learning occurs, try removing hidden units (and
possibly layers)
5 Once you have experimentally determined an effective
configuration for your networks, resample and generate
new networks with that configuration.
Tom Kelsey ID5059-20-NN2 March 2021 3 / 35
Gathering Data

Neural networks process numeric data - care required in the


nature of data provided.
Problems if data is in an unusual range . . .
. . . if there is missing data . . .
. . . or if data is non-numeric

Numeric data can be scaled


Missing values can be imputed e.g. replaced by means (or
something else)
Non-numeric data must be converted

Tom Kelsey ID5059-20-NN2 March 2021 4 / 35


Gathering Data

These are not problems restricted to NN - but some methods are


less affected e.g. trees: categorical inputs & missing data.
Gender = (Male, Female) -> (0,1), or Size = (S,M,L) -> (0,1,2)
Animal = (reptile, bird, mammal) – some sort of binary
encoding required (e.g. dummy variables)
Postcodes = (A1 1AA, . . . ,Z99 9ZZ) is a non-starter
Solution – assign expert ratings to subsets
Dates & times can be offset from a chosen base
Unconstrained text fields (such as names) should be
discarded

Tom Kelsey ID5059-20-NN2 March 2021 5 / 35


How much data?

Standard heuristic – n ≈ 10 × C , where C is the number of


connections
There is no correct answer – depends on the nonlinear
model, which is unknown
Also depends on variance in noise, also unknown a priori
If data is sparse – or expensive – NN is probably not the
right choice.

Tom Kelsey ID5059-20-NN2 March 2021 6 / 35


Outliers

NNs tolerate noise well – up to a limit


If possible, detect & remove before training
Again, no right answers, only heuristics & folklore

Tom Kelsey ID5059-20-NN2 March 2021 7 / 35


Other data issues

The future is not the past – changes in circumstance


invalidate historic data
Data has to cover all eventualities – a NN trained on low
incomes will predict nothing about high incomes
Easiest features are learned first – even if they aren’t the
features of interest
Unbalanced data
True disease rate is 5%
Data collected and NN trained from general population
NN used on visitors to a clinic for whom the disease rate is
60%
The NN will react over-cautiously and fail to recognize
disease in some unhealthy patients

Tom Kelsey ID5059-20-NN2 March 2021 8 / 35


Fitting NNs

Simple in principle:
Given weights - NN gives a y-hat
ŷ compared to y gives an error measure (RSS say)
Changing the weights can make this bigger or smaller
Want to change weights to make this smaller
Error is a function of weights - so numerically optimise to
reduce
It’s a search over multiple dimensions (dictated by number of
parameters/weights).

Tom Kelsey ID5059-20-NN2 March 2021 9 / 35


Fitting Method 1 – Back propagation

Simple in principle:
Set some initial weights (can’t estimate error without a
parameterised model) - software deals with this - probably
random uniform.
Calculate an initial error (based on observed versus current
predicted).
For each weight determine if increasing or decreasing the
weight increases/decreases the error.
Move a bit in the correct direction. Recalculate error with
new parameters. Repeat.
Stop at some point i.e. further weight alterations make
no/little improvement.
This is a gradient search, iterating over multiple dimensions
(dictated by number of parameters/weights).

Tom Kelsey ID5059-20-NN2 March 2021 10 / 35


Error Surface

Source: WikiMedia Commons

Tom Kelsey ID5059-20-NN2 March 2021 11 / 35


Error Surface

Source: WikiMedia Commons

Tom Kelsey ID5059-20-NN2 March 2021 12 / 35


Fitting Method 1 – Back propagation

Take Rθ as the resubstitution error given parameters θ1 , θ2 ...


etc. (e.g. RSS).
Create little local problems to solve at each non-input node.
A little bit of calculus gets you there: application of the chain
rule allows determination of error changes at each
non-input node.
Keep track of how Rθ (R for simplicity) changes for changes
∂R
in each parameter i.e. for i-th par.
∂θi

Tom Kelsey ID5059-20-NN2 March 2021 13 / 35


Fitting Method 1 – Back propagation

Create little local problems to solve at each non-input node.


Iteration r + 1:
∂R
β r+1 = β r − γ r
∂β
So, if R increases with increasing βr , decrease to create βr+1
by step γ.
Keep doing this until R gets small.

Tom Kelsey ID5059-20-NN2 March 2021 14 / 35


Fitting Method 1 – Back propagation

Note:
the NN starts simple (boring set of parameters), gets more
complicated as we iterate.
the step size (γ) controls how rapidly we fluctuate the
parameters (‘learning rate’).
so complexity can be controlled by stopping the
optimisation process.
one pass through all the data, changing weights, is called an
epoch.

Tom Kelsey ID5059-20-NN2 March 2021 15 / 35


A visual pass through

The following gives a flavour of what is involved.


Source: Bernacki & Wlodarczyk, AGH
You should note:
How compuationally intensive this can potentially be
How the objective is to alter weights at each pass by some
(carefully caculated) amount to reduce prediction error

Tom Kelsey ID5059-20-NN2 March 2021 16 / 35


Back propagation

Tom Kelsey ID5059-20-NN2 March 2021 17 / 35


Fitting Methods - there are several

You should be aware there are many ways to optimise problems


like this. We’ll only mention 3:
Back-propagation (BP)
Quasi-Newton (QN)
Conjugate Gradient Descent (CGD)

Tom Kelsey ID5059-20-NN2 March 2021 18 / 35


Fitting Method 2 – Quasi-Newton

Back-propagation performed at all layers simultaneously.

xt+1 = xt − hHf (xt )−1 5 f (xt )


Quasi-Newton works by exploiting the observation that, on
a quadratic error surface, one can step directly to the
minimum using the Newton step. Any error surface is
approximately quadratic "close to" a minimum.
Since the Hessian is expensive to calculate, and since the
Newton step is likely to be wrong on a non-quadratic
surface, Quasi-Newton iteratively builds up an
approximation to the inverse Hessian.
The approximation at first follows the line of steepest
descent, and later follows the estimated Hessian more
closely.

Tom Kelsey ID5059-20-NN2 March 2021 19 / 35


Fitting Method 3 – Conjugate Gradient Descent

Works by constructing a series of line searches across the


error surface.
It first works out the direction of steepest descent, then
locates a minimum in this direction.
The conjugate directions are chosen to try to ensure that the
directions that have already been minimized stay
minimized.
The conjugate directions are calculated on the assumption
that the error surface is quadratic. If the algorithm discovers
that the current line search direction isn’t actually downhill,
it simply calculates the line of steepest descent and restarts
the search in that direction. Once a point close to a minimum
is found, the quadratic assumption holds true and the
minimum can be located very quickly.

Tom Kelsey ID5059-20-NN2 March 2021 20 / 35


Fitting Methods Summary

All methods can get stuck at a local minimum


All methods can converge very slowly due to the gradients
involved being extemely small
There are complex tradeoffs in computational complexity,
convergence rates, numerical stability, heuristic choices, . . .
The specific activation function(s) are likely to affect
performance

Tom Kelsey ID5059-20-NN2 March 2021 21 / 35


Fitting Methods Summary

All methods can get stuck at a local minimum


All methods can converge very slowly due to the gradients
involved being extemely small
There are complex tradeoffs in computational complexity,
convergence rates, numerical stability, heuristic choices, . . .
The specific activation function(s) are likely to affect
performance

Tom Kelsey ID5059-20-NN2 March 2021 22 / 35


Activation functions

Each internal layer will apply a function to the weighted


sums of its inputs
There are several types in common use, each with strengths
and weaknesses
Which (if any) is preferred for a specific problem depends
on the data, the fitting method, the number & size of the
layers, etc.

Tom Kelsey ID5059-20-NN2 March 2021 23 / 35


Step function

Yes or no, depending on a threshold


Seems ideal for a binary classifier...
...but how do we collate outputs from several nodes?
Gradient descent methods are not applicable as the function
has zero gradient (where defined)

Tom Kelsey ID5059-20-NN2 March 2021 24 / 35


Linear function A = cx

Not shown in the previuous figure as it is not a good


activation function
The gradient is always c, so improvement based on gradient
won’t work
A linear function applied to a linear combination is also a
linear combination...
...So hidden layers can be replaced by a single linear formula

Tom Kelsey ID5059-20-NN2 March 2021 25 / 35


Sigmoid

1
1 + e−x

Sigmoid means “shaped like an S" so there are many


sigmoid functions
This one is sometimes referred to as the sigmoid function
Nonlinear, and combinations are also nonlinear
Activations won’t blow up in size, since the range is (0, 1)
Smooth gradient for all values, but often a small gradient, so
convergence may be slow

Tom Kelsey ID5059-20-NN2 March 2021 26 / 35


Hyperbolic tangent

tanh(x) = 2 × sigmoid(2x)

A scaled sigmoid function


Activations won’t blow up in size, since the range is (−1, 1)
Steeper derivatives, so stronger gradient

Tom Kelsey ID5059-20-NN2 March 2021 27 / 35


Rectified Linear Unit – ReLu

max(0, x)

Looks like a combination of step and linear, and hence the


worst of both worlds
In fact nonlinear with nonlinear combinations
Unbound, so activation can blow up
Has the advantage that many neurons don’t fire, making
activations sparse and efficient
If we choose initial weights to be random values between -1
and 1, then almost 50% of the network yields zero activation
because of the characteristic of ReLu, and the network is
lighter

Tom Kelsey ID5059-20-NN2 March 2021 28 / 35


The dying ReLu problem

The horizontal line in ReLu sends gradients towards zero


For activations in that region of ReLu, gradient will be 0
because of which the weights will not get adjusted during
descent
So those neurons which go into that state will stop
responding to variations in error/ input ( simply because
gradient is 0, nothing changes )
This problem can cause several neurons to just die (i.e. not
respond) making a substantial part of the network passive
Several solutions – a common one is leaky ReLu
A = 0.01x for x < 0 gives a slightly inclined line rather than
a horizontal line

Tom Kelsey ID5059-20-NN2 March 2021 29 / 35


Others

Hard Tanh – max(1, min(1, x))


2
LeCun’s Tanh – 1.7159 × tanh( x)
3
Complementary log-log – log(1 + exp(x))
Gaussian Error Linear Unit (GELU)
Exponential linear unit (ELU)
Scaled exponential linear unit (SELU)
S-shaped rectified linear activation unit (SReLU)
etc.

Tom Kelsey ID5059-20-NN2 March 2021 30 / 35


Overfitting

A NN can be a very rich class of functions with even just a


single hidden layer with a few hidden units
So we are likely to have a model with sufficient inherent
complexity to model complex systems
This presents a problem too - the model can easily overfit i.e.
learn the training dataset very well, giving a model with
poor generality
The standard problem that we have encountered throughout
our consideration of automated model selections Two
approaches are considered here

Tom Kelsey ID5059-20-NN2 March 2021 31 / 35


Validation

Maintain an independent dataset which is not used to


develop the model, but is used to measure model
performance/generality
Seek a model that predicts data we have not yet seen - the
use of validation or cross-validation data simulates this
scenario
Simplest method is to use a single validation dataset, and
‘stop’ fitting when the performance of the model against the
validation dataset begins to deteriorate

Tom Kelsey ID5059-20-NN2 March 2021 32 / 35


Weight decay

Similar to the approach in tree-methods we can balance our


raw model fit against a measure of model complexity
Using Rθ as our measure of resubstitution error with a given
set of parameters θ:

Rθ + λJ (θ)

R and J are effectively in competition, and as we are using a


gradient search, you can think of λJ as preventing us from
reaching our global minimum for R
we must estimate λ and the usual approach would be via
validation or cross-validation performance
This reveals that we have in effect just considered a more
explicit phrasing of the validation approach above

Tom Kelsey ID5059-20-NN2 March 2021 33 / 35


NN problems overview

Lack of interpretability: these models are effectively


black-box
Over-fitting: NNs are clearly prone to overfitting if some
proper controls are not put in place
Specification decisions: there are a bewildering array of
activation functions, combination functions, output
functions, training methods, parameters (e.g. number of
hidden units and layers), standardisations etc.
Local minima: as for standard non-linear regression, we may
require multiple fits to ensure we have not been trapped in a
sub-optimal solution by local minima in the error function
Long run-times – these models can take a very long time to
fit

Tom Kelsey ID5059-20-NN2 March 2021 34 / 35


Summary

Consider Neural Nets when all or most of the following apply:


Lots of cheap, numeric data in reasonable ranges
Not too many outliers
Don’t know or don’t care about the model relating input to
response
Happy to wait a long time for a predictor
Don’t care about interpretability

Tom Kelsey ID5059-20-NN2 March 2021 35 / 35

You might also like