You are on page 1of 24

Knowledge Discovery and Data Mining

Lecture 19 - Neural Nets

Tom Kelsey

School of Computer Science


University of St Andrews
http://www.cs.st-andrews.ac.uk/~tom/
tom@cs.st-andrews.ac.uk

Tom Kelsey ID5059-19-NN March 2021 1 / 24


Neural Nets

Ought-to-knows:
1 How a general NN can be displayed graphically
2 The NN terminology exemplified by such a diagram
3 How a relatively simple single hidden-layer, two input NN
can produce a complex non-linear prediction surface
4 The form of given activation functions (both the equation
and sketch)
5 How NN weights and biases are derived

Tom Kelsey ID5059-19-NN March 2021 2 / 24


Neural Nets

Some contentions/comments to start:


NNs seem complex (indeed, a fitted one can be) and a bit
‘magical’.
They are however built (as usual) from simple components.
The problem is similar to previous:
Create a model for signal that is capable of being complex.
Fit the model (estimate parameters) so it approximates some
data (specify an objective function).
Ensure the generality of predictions by controlling
complexity in the fitting process (e.g. optimise complexity
using some measure of generalisation error).
All very familiar - let’s begin.

Tom Kelsey ID5059-19-NN March 2021 3 / 24


A simple NN as a Mathematical Formula

 

ln = β̂ 0 + β̂ 1 z1 + β̂ 2 z2 + β̂ 3 z3
(1 − p̂)
where

z1 = tanh(α̂4 + α̂5 x1 + α̂6 x2 )


z2 = tanh(α̂7 + α̂8 x1 + α̂9 x2 )
z3 = tanh(α̂10 + α̂11 x1 + α̂12 x2 )

Tom Kelsey ID5059-19-NN March 2021 4 / 24


What did all that mean?

The output is an optimal probability

1
p̂ =
1 + e− θ

θ is a linear weighted sum of zi terms, with optimal weights β̂ i


There is an additional optimal weight β̂ 0 that is an intercept or
bias term
The zi are formed by
1 weighting inputs xi with optimal α̂k
2 adding another α̂ bias term
3 taking the hyperbolic tangent of the sum

Tom Kelsey ID5059-19-NN March 2021 5 / 24


Combining building blocks

Let’s look at this graphically in R1 .


Begin with 2 inputs (so we can plot them easily).
Specify 3 nodes and use tanh activation functions and linear
combination function.
Combine these for an output surface under different
example weights.
How complex can we get?

1
L19-NNsurfaces.r
Tom Kelsey ID5059-19-NN March 2021 6 / 24
Conversion to a diagrammatic form

For ease of understanding for non-mathematicians


1 We have two input x1 and x2 sources
2 We have an input bias source
3 There is an internal layer of three z nodes, each taking in
weighted inputs and outputting tanh of the summed inputs
4 There is an internal bias source for β̂ 0
5 There is an output layer with one node, producing the
logistic function of the weighted sum of the internal layer
outputs
6 The number output is a probability between 0 and 1

Tom Kelsey ID5059-19-NN March 2021 7 / 24


Examples

Tom Kelsey ID5059-19-NN March 2021 8 / 24


Examples

Source: Google Images


Tom Kelsey ID5059-19-NN March 2021 9 / 24
Examples

Tom Kelsey ID5059-19-NN March 2021 10 / 24


Examples

Source: Google Images

Tom Kelsey ID5059-19-NN March 2021 11 / 24


Examples

Source: Google Images

Tom Kelsey ID5059-19-NN March 2021 12 / 24


Examples

Source: Google Images

Tom Kelsey ID5059-19-NN March 2021 13 / 24


Examples

Source: Google Images

Tom Kelsey ID5059-19-NN March 2021 14 / 24


Examples

Tom Kelsey ID5059-19-NN Source: Google


MarchImages
2021 15 / 24
Examples

Source: Google Images

Tom Kelsey ID5059-19-NN March 2021 16 / 24


Examples

Source: Google Images

Tom Kelsey ID5059-19-NN March 2021 17 / 24


NN components

Weights and biases: from a statistical perspective these


weights are simply parameters of a potentially non-linear
function, and the biases are the intercept terms for the linear
components.
‘Combination Functions’: in our example equations above
these are the linear combinations expressed in matrix form,
they combine the input variables or the hidden nodes.

Tom Kelsey ID5059-19-NN March 2021 18 / 24


NN components
Activation functions: these are the functions wrapping the
combination functions, and several variants are commonly
used:
Identity Function - does not alter the value of the argument.
The resulting range may be ∈ R.
Sigmoid Functions - S-shaped functions with the logistic or
hyperbolic tangent functions being common. The resulting
values will be bounded - (0, 1) or (−1, 1) respectively. The
logistic is given by:

1
φ(θ ) =
1 + e− θ
for some argument value θ.
tanh - hyperbolic tangent gives real values within (−1, 1)
Others: Gaussian functions (bell-shaped); functions bounded
below by zero but unbounded above, e.g. Exponential and
Reciprocal Functions.

Tom Kelsey ID5059-19-NN March 2021 19 / 24


NN components

Network Layers: as the hidden layers are contrivances


under control of the analyst, the number of layers and units
within these can be large. The layering is partly for
convenience, where all the nodes/units share similar
characteristics such as their activation and combination
functions. All the nodes in a layer are (generally to start)
connected to all the nodes in the next.

Tom Kelsey ID5059-19-NN March 2021 20 / 24


Main components

Layers: input, hidden, output.


Connections and weights.
Combination functions: linear.
Activation functions: Identity, tanh, exp, logistic.
Output functions: Back to response scale - Identity,
(multiple) Logistic.

Tom Kelsey ID5059-19-NN March 2021 21 / 24


Overview of our coverage

NNs are an ‘art’


Jargon is very inconsistent.
Huge number of decisions that can be made in their
construction and the results are sensitive to these.
We’ll look at the general ideas and very few specific
implementations.

Tom Kelsey ID5059-19-NN March 2021 22 / 24


Worked examples

Taken directly from jee3 at Rice University.


Excellent empirical framework
Define signal, add controlled amount of noise
Visualise for each covariate in turn
Use test data from the same population to investigate overfit
and underfit
Demonstrates the time needed to learn a NN
Uses ROC and AUC to compare models

Tom Kelsey ID5059-19-NN March 2021 23 / 24


Fitting a Neural Net

Start with arbitrary weights and biases. Define an error function.


Search for update values that reduce the error. Iterate until
convergence (hopefully).
This is numerical optimisation
Non-linear problem with large numbers of parameters.
You will not find a general analytic solution for solving the
weights.
All methods implemented are iterative numerical
approaches - trial-and-error searches.
Conceptually simple what we want to do, once we define
‘best’.

Tom Kelsey ID5059-19-NN March 2021 24 / 24

You might also like