7 views

Uploaded by Zheng Ronnie Li

Lecture notes

- 1273-3831-1-PB
- Adaline
- NF Separation of NaCl and MgCl2
- RTLC Manual
- Main Paper for Elbow Control
- Artificial Neural Networks
- Analysis of Mass Transfer Performance of Monoethanolamine-Based CO2 Absorption in a Packed Column Using Artificial Neural Networks
- The Integration of Diffusely-Distributed Salience Signals Into A Neural Network
- 07 Rr410212 Neural Networks and Applications
- Wang Resume
- 29. M.E. CSE. (1)
- Automatic Target Recognition, Third Edition (Tutorial Texts).pdf
- Stepwise versus Hierarchical Regression: Pros and Cons
- Productivity in Construction
- 381617.41-77276-final
- Neuro Solutions
- Catalytic residue Structure prediction
- 094094C
- jang1993.pdf
- RCL Master's Thesis

You are on page 1of 29

Multilayer Perceptrons

(MLPs)

Part III 2

Motivation

Multilayer networks are more powerful than single-layer

nets

Example: XOR problem cannot be solved by a perceptron but is

solvable with a 2-layer perceptron

1

x

1

x

2

AND

x

1

x

2

o

o

x

x

-1

-1

+1

+1

Part III 3

Power of nonlinearity

For linear neurons, a multilayer net is equivalent to a single-

layer net. This is not the case for nonlinear neurons

Why?

Part III 4

MLP architecture

Part III 5

Notations

Notation for one hidden layer

The format of presentation to an MLP is the same as to a

perceptron

x

i

j

k

w

ji

w

kj

Part III 6

Backpropagation

Derive the backpropagation (backprop) algorithm by cost

minimization via gradient descent

For the output layer at iteration n, we have

x

i

j

k

w

ji

w

kj

=

k

k k

n y n d n E

2

)] ( ) ( [

2

1

) (

=

k j

j kj k

y w n d

2

)] ( ) ( [

2

1

Part III 7

Backprop (cont.)

Then

x

i

j

k

w

ji

w

kj

kj

k

k

k

k kj

w

v

v

y

y

E

w

E

kj

k

k k

w

v

v e

= ) (

j k k

y v e ) (

=

Part III 8

Backprop (cont.)

Hence

x

i

j

k

w

ji

w

kj

kj

kj kj

w

E

n w n w

= + ) ( ) 1 (

( rule)

) ( ) ( ) ( ) ( n y v n e n w

j k k kj

+ =

) ( ) ( ) ( n y n n w

j k kj

+ =

Part III 9

Backprop (cont.)

For the hidden layer at iteration n,

x

i

j

k

w

ji

w

kj

ji

j

j

j

j ji

w

v

v

y

y

E

w

E

i j

j

j kj

k

k

j

x v y w d

y

) ( } )] ( [

2

1

{

2

=

i j

j

j

j kj

j

j kj

k

k

x v

y

y w

y w d ) (

) (

)] ( [

=

i j kj k k

k

k

x v w v y d ) ( ) ( ) ( =

i

k

kj k j

x w v ) )( (

=

Part III 10

Backprop (cont.)

Therefore,

where

The above is called the generalized rule

textbook correction

x

i

j

k

w

ji

w

kj

) ( )] ( ) ( ))[ ( ( ) ( ) 1 ( n x n w n n v n w n w

i

k

kj k j ji ji

+ = +

) ( ) ( )) ( ( ) ( n n w n v n

k

k

kj j j

=

) ( ) ( ) ( n x n n w

i j ji

+ =

Part III 11

Backprop (cont.)

Illustration of the generalized rule,

The generalized rule gives a solution to the credit (blame)

assignment problem

x

i

j

k

w

ji

w

kj

x

i k

j

j

1

Part III 12

Backprop (cont.)

For the logistic sigmoid activation, we have

hence

x

i

j

k

w

ji

w

kj

)] ( 1 )[ ( ) ( v v a v =

))] ( 1 )( ( )[ ( ) ( n y n ay n e n

k k k k

=

=

k

k kj j j j

n n w n y n ay n ) ( ) ( )] ( 1 )[ ( ) (

)] ( ) ( )][ ( 1 )[ ( n y n d n y n ay

k k k k

=

Part III 13

Backprop (cont.)

Extension to more hidden layers is straightforward. In

general we have

The rule applies to the output layer and the generalized

rule applies to hidden layers, layer by layer from the output

end.

The entire procedure is called backprop (error is back

propagated)

) ( ) ( ) ( n y n n w

i j ji

=

Part III 14

Backprop illustration

Part III 15

Learning rate control: momentum

To ease oscillating weights due to large , some inertia

(momentum) of weight update is added

In the downhill situation,

thus accelerating learning by a factor of 1/(1 )

In the oscillating situation, it smoothes weight change, thus

stabilizing oscillations

1 0 ), 1 ( ) ( ) ( ) ( < < + = n w n y n n w

ji i j ji

) ( ) (

1

) ( n y n n w

i j ji

Part III 16

Remarks on backprop

Backprop learning is local, concerning presynaptic and

postsynaptic neurons only

Local minima are possible due to nonlinearity, but infrequent

due to online update and the practice of randomizing the

order of presentations in different epochs

global

minimum

local

minimum

Part III 17

Remarks (cont.)

MLP can learn to approximate any function, given sufficient

layers and neurons (an existence proof)

At most two hidden layers are sufficient to approximate any

function. One hidden layer is sufficient for any continuous

function

Gradient descent is a simple method for cost optimization.

More advanced methods exist, such as conjugate gradient

and quasi-Newton methods (see textbook)

Much design is still needed, such as the number of hidden

layers, number of neurons for each layer, initial weights,

learning rate, stopping criteria, etc.

Part III 18

MLP Design

Besides the use of momentum, learning rate annealing can be

applied as for the LMS algorithm

Batch update versus online update

For batch update, weight adjustment occurs after one presentation of

all training patterns (one epoch)

For online update, weight adjustment occurs after the presentation of

each pattern

For backprop, online learning is commonly used

Part III 19

MLP Design (cont.)

Weight initialization: To prevent saturating neurons and

break symmetry that can stall learning, initial weights

(including biases) are typically randomized to produce zero

mean and activation potentials away from saturation parts of

the activation function

For the hyperbolic tangent activation function, avoiding saturation

can be achieved by initializing weights so that the variance equals the

reciprocal of the number of weights of a neuron

Part III 20

Hyperbolic tangent function

Part III 21

When to stop learning

One could stop after a predetermined number of epochs or

when the MSE decrease is below a given criterion

Early stopping with cross validation: keep part of the

training set, called validation subset, as a test for

generalization performance

Part III 22

Cross validation

To do it systematically

Divide the entire training sample into an estimation subset and a

validation subset (e.g. 80-20 split)

Every few epochs (e.g. 5), check validation error on the validation

subset

Stop training when validation error starts to increase

To avoid systematic bias, validation error can be measured

repeatedly for different validation subsets, taken as the

average

See next slide

Part III 23

Cross validation illustration

Part III 24

Cross validation (cont.)

Cross validation is widely used as a general approach for

model selection. Hence it can also be used to decide on the

number of hidden layers and the number of neurons for a

given hidden layer

Part III 25

MLP applications

Task: Handwritten zipcode recognition (1989)

Network description

Input: binary pixels for each digit

Output: 10 digits

Architecture: 4 layers (16x1612x8x812x4x43010)

Each feature detector encodes only one feature within a local

input region. Different detectors in the same module respond

to the same feature at different locations through weight

sharing. Such a layout is called a convolutional net

Part III 26

Zipcode recognizer architecture

Part III 27

Zipcode recognition (cont.)

Performance: trained on 7300 digits and tested on 2000 new

ones

Achieved 1% error on the training set and 5% error on the test set

If allowing rejection (no decision), 1% error on the test set

The task is not easy (see a handwriting example)

Remark: constraining network design is a way of

incorporating prior knowledge about a specific problem

Backprop applies whether or not the network is constrained

Part III 28

Letter recognition example

The convolutional net has been subsequently applied to a

number of pattern recognition tasks with state-of-the-art

results

Handwritten letter recognition

Part III 29

Other MLP applications

ALVINN (automatic land vehicle in a neural network)

Later success of Stanley (won $2M Darpa Grand Chellenge in 2005)

NETtalk, a speech synthesizer

GloveTalk, which converts hand gestures to speech

- 1273-3831-1-PBUploaded bymaheswaribalaji2010
- AdalineUploaded byDahlia Devapriya
- NF Separation of NaCl and MgCl2Uploaded byztwardowski
- RTLC ManualUploaded bytemperremper
- Main Paper for Elbow ControlUploaded byAhmed Khalifa
- Artificial Neural NetworksUploaded byCuong Polyme
- Analysis of Mass Transfer Performance of Monoethanolamine-Based CO2 Absorption in a Packed Column Using Artificial Neural NetworksUploaded byMasroorAbro
- The Integration of Diffusely-Distributed Salience Signals Into A Neural NetworkUploaded byhoorie
- 07 Rr410212 Neural Networks and ApplicationsUploaded byandhracolleges
- Wang ResumeUploaded bymichelleyuu
- 29. M.E. CSE. (1)Uploaded byCibil.Rittu1719 MBA
- Automatic Target Recognition, Third Edition (Tutorial Texts).pdfUploaded byLucas Alves
- Stepwise versus Hierarchical Regression: Pros and ConsUploaded byMitzi Lewis
- Productivity in ConstructionUploaded byDore101
- 381617.41-77276-finalUploaded bySumeet Saini
- Neuro SolutionsUploaded byjohn tan
- Catalytic residue Structure predictionUploaded byBrijesh Singh Yadav
- 094094CUploaded bygurugeudayangi
- jang1993.pdfUploaded bychehaidia
- RCL Master's ThesisUploaded byRob Lonsinger
- Robots - RnaUploaded byandibautista
- jeas_0610_352Uploaded byedisonmech_erf
- Pid 3851843Uploaded byNikola Mijailović
- Grade 10 Polynomials Grouping factoringUploaded byengineerdude5
- RapidMiner RapidMinerInAcademicUse EnUploaded byLarisa Moldovan
- A Program to Estimate the Dominant Lyapunov Exponent of Noisy Nonlinear Systems From Time Series DataUploaded byBrook Rene Johnson
- We Ka ManualUploaded byPepepotamoHipopotamo
- cscan-438Uploaded byAbdelkrim AB
- Lippmann-ASSP-87.pdfUploaded byrobertomenezess
- eqrUploaded byabdulnajib

- ImplicitUploaded byMulangira Mulangira
- chp_1Uploaded byMichael Vladislav
- Cantor Set - WikipediaUploaded byeldelmoño luci
- Visibility Graph AlgorithmUploaded byDan Schiopu
- Computer Programming Aps 17012013Uploaded byJose Miranda
- w6Uploaded byPooja Sinha
- matlabUploaded byManu Gupta
- [CS3244] Final Poster Group 19Uploaded byAnonymous qNZVXg
- Chapter 3 SolutionsUploaded byDan Dinh
- L_Linear Factors Theorem and Conjugate Zeros TheoremUploaded bycitisolo
- Scoring Rubric for Math IIUploaded byBuried_Child
- UAG.pdfUploaded byalin444444
- M_Sc Math I to Iv sem syllabai.pdfUploaded byKaranveer Dhingra
- Basic Numeracy SimplificationUploaded byupscportal.com
- 2 D TransformationsUploaded byJeevan Mali
- 07Uploaded byVaishnavi Mittal
- DP_Application of Derivatives.pdfUploaded bySumeet Sharma
- SPE-140935-PA (Gas Lift Optimization Using Proxy Functions in Reservoir Simulation)Uploaded byAmr Al Sayed
- Graphing MatlabUploaded byDanishAman
- Sachin JavaUploaded bysat
- 01.Number System.pdfUploaded bySelva Kumar
- Maths - Chapter 6 IntegersUploaded byAnn Wong
- 15 A supernodal approach to sparse partial pivoting.pdfUploaded byMaritza L M Capristano
- Kcet Mathematics Cet-2012Uploaded byiitmmk1981
- answersUploaded byapi-195130729
- XII_Mathematics_MS_2018-19.pdfUploaded bySushant Bagul
- 9709_s03_qp_1Uploaded byRuslan Agani
- Bv Cvxbook Extra ExercisesUploaded byAnonymous WkbmWCa8M
- CreutzUploaded byCharlie Pinedo
- AQA-MFP3-W-QP-JUN06Uploaded byVũ Lê