Lecture19 Slides

Knowledge Discovery and Data Mining
Lecture 19 - Neural Nets
Tom Kelsey
School of Computer Science

University of St Andrews
http://www.cs.st-andrews.ac.uk/~tom/
tom@cs.st-andrews.ac.uk
Tom Kelsey ID5059-19-NN March 2021 1 / 24

Neural Nets
Ought-to-knows:
1 How a general NN can be displayed graphically
2 The NN terminology exemplified by such a diagram
3 How a relatively simple single hidden-layer, two input NN
can produce a complex non-linear prediction surface
4 The form of given activation functions (both the equation
and sketch)
5 How NN weights and biases are derived

Neural Nets
Some contentions/comments to start:

NNs seem complex (indeed, a fitted one can be) and a bit
‘magical’.
They are however built (as usual) from simple components.
The problem is similar to previous:
Create a model for signal that is capable of being complex.
Fit the model (estimate parameters) so it approximates some
data (specify an objective function).
Ensure the generality of predictions by controlling
complexity in the fitting process (e.g. optimise complexity
using some measure of generalisation error).
All very familiar - let’s begin.

A simple NN as a Mathematical Formula

p̂
ln = β̂ 0 + β̂ 1 z1 + β̂ 2 z2 + β̂ 3 z3
(1 − p̂)
where
z1 = tanh(α̂4 + α̂5 x1 + α̂6 x2 )

z2 = tanh(α̂7 + α̂8 x1 + α̂9 x2 )
z3 = tanh(α̂10 + α̂11 x1 + α̂12 x2 )

What did all that mean?
The output is an optimal probability
1
p̂ =
1 + e− θ
θ is a linear weighted sum of zi terms, with optimal weights β̂ i

There is an additional optimal weight β̂ 0 that is an intercept or
bias term
The zi are formed by
1 weighting inputs xi with optimal α̂k
2 adding another α̂ bias term
3 taking the hyperbolic tangent of the sum

Combining building blocks
Let’s look at this graphically in R1 .

Begin with 2 inputs (so we can plot them easily).
Specify 3 nodes and use tanh activation functions and linear
combination function.
Combine these for an output surface under different
example weights.
How complex can we get?
1
L19-NNsurfaces.r
Conversion to a diagrammatic form
For ease of understanding for non-mathematicians

1 We have two input x1 and x2 sources
2 We have an input bias source
3 There is an internal layer of three z nodes, each taking in
weighted inputs and outputting tanh of the summed inputs
4 There is an internal bias source for β̂ 0
5 There is an output layer with one node, producing the
logistic function of the weighted sum of the internal layer
outputs
6 The number output is a probability between 0 and 1

Examples

Examples
Source: Google Images

Examples

Examples

Examples

Examples

Examples

Examples
Tom Kelsey ID5059-19-NN Source: Google

MarchImages
2021 15 / 24
Examples

Examples

NN components
Weights and biases: from a statistical perspective these

weights are simply parameters of a potentially non-linear
function, and the biases are the intercept terms for the linear
components.
‘Combination Functions’: in our example equations above
these are the linear combinations expressed in matrix form,
they combine the input variables or the hidden nodes.

NN components
Activation functions: these are the functions wrapping the
combination functions, and several variants are commonly
used:
Identity Function - does not alter the value of the argument.
The resulting range may be ∈ R.
Sigmoid Functions - S-shaped functions with the logistic or
hyperbolic tangent functions being common. The resulting
values will be bounded - (0, 1) or (−1, 1) respectively. The
logistic is given by:
1
φ(θ ) =
1 + e− θ
for some argument value θ.
tanh - hyperbolic tangent gives real values within (−1, 1)
Others: Gaussian functions (bell-shaped); functions bounded
below by zero but unbounded above, e.g. Exponential and
Reciprocal Functions.

NN components
Network Layers: as the hidden layers are contrivances

under control of the analyst, the number of layers and units
within these can be large. The layering is partly for
convenience, where all the nodes/units share similar
characteristics such as their activation and combination
functions. All the nodes in a layer are (generally to start)
connected to all the nodes in the next.

Main components
Layers: input, hidden, output.

Connections and weights.
Combination functions: linear.
Activation functions: Identity, tanh, exp, logistic.
Output functions: Back to response scale - Identity,
(multiple) Logistic.

Overview of our coverage
NNs are an ‘art’

Jargon is very inconsistent.
Huge number of decisions that can be made in their
construction and the results are sensitive to these.
We’ll look at the general ideas and very few specific
implementations.

Worked examples
Taken directly from jee3 at Rice University.

Excellent empirical framework
Define signal, add controlled amount of noise
Visualise for each covariate in turn
Use test data from the same population to investigate overfit
and underfit
Demonstrates the time needed to learn a NN
Uses ROC and AUC to compare models

Fitting a Neural Net
Start with arbitrary weights and biases. Define an error function.

Search for update values that reduce the error. Iterate until
convergence (hopefully).
This is numerical optimisation
Non-linear problem with large numbers of parameters.
You will not find a general analytic solution for solving the
weights.
All methods implemented are iterative numerical
approaches - trial-and-error searches.
Conceptually simple what we want to do, once we define
‘best’.

Lecture19 Slides

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture19 Slides

Uploaded by

Copyright:

Available Formats

Knowledge Discovery and Data Mining

Lecture 19 - Neural Nets

School of Computer Science

Tom Kelsey ID5059-19-NN March 2021 1 / 24

Tom Kelsey ID5059-19-NN March 2021 2 / 24

Some contentions/comments to start:

Tom Kelsey ID5059-19-NN March 2021 3 / 24

z1 = tanh(α̂4 + α̂5 x1 + α̂6 x2 )

Tom Kelsey ID5059-19-NN March 2021 4 / 24

The output is an optimal probability

θ is a linear weighted sum of zi terms, with optimal weights β̂ i

Tom Kelsey ID5059-19-NN March 2021 5 / 24

Let’s look at this graphically in R1 .

For ease of understanding for non-mathematicians

Tom Kelsey ID5059-19-NN March 2021 7 / 24

Tom Kelsey ID5059-19-NN March 2021 8 / 24

Source: Google Images

Tom Kelsey ID5059-19-NN March 2021 10 / 24

Source: Google Images

Tom Kelsey ID5059-19-NN March 2021 11 / 24

Source: Google Images

Tom Kelsey ID5059-19-NN March 2021 12 / 24

Source: Google Images

Tom Kelsey ID5059-19-NN March 2021 13 / 24

Source: Google Images

Tom Kelsey ID5059-19-NN March 2021 14 / 24

Tom Kelsey ID5059-19-NN Source: Google

Source: Google Images

Tom Kelsey ID5059-19-NN March 2021 16 / 24

Source: Google Images

Tom Kelsey ID5059-19-NN March 2021 17 / 24

Weights and biases: from a statistical perspective these

Tom Kelsey ID5059-19-NN March 2021 18 / 24

Tom Kelsey ID5059-19-NN March 2021 19 / 24

Network Layers: as the hidden layers are contrivances

Tom Kelsey ID5059-19-NN March 2021 20 / 24

Layers: input, hidden, output.

Tom Kelsey ID5059-19-NN March 2021 21 / 24

NNs are an ‘art’

Tom Kelsey ID5059-19-NN March 2021 22 / 24

Taken directly from jee3 at Rice University.

Tom Kelsey ID5059-19-NN March 2021 23 / 24

Start with arbitrary weights and biases. Define an error function.

Tom Kelsey ID5059-19-NN March 2021 24 / 24

You might also like