This action might not be possible to undo. Are you sure you want to continue?

**Single Layer Feedforward Networks
**

Perceptrons

• By Rosenblatt (1962)

– For modeling visual perception (retina)

– A feedforward network of three layers of units:

Sensory, Association, and Response

– Learning occurs only on weights from A units to R units

(weights from S units to A units are fixed).

– Each R unit receives inputs from n A units

– For a given training sample s:t, change weights between A

and R only if the computed output y is different from the

target output t (error driven)

S A R

w

SA

w

AR

Perceptrons

• A simple perceptron

– Structure:

• Sing output node with threshold function

• n input nodes with weights w

i

, i = 1 – n

– To classify input patterns into one of the two classes

(depending on whether output = 0 or 1)

– Example: input patterns: (x

1

, x

2

)

• Two groups of input patterns

(0, 0) (0, 1) (1, 0) (-1, -1);

(2.1, 0) (0, -2.5) (1.6, -1.6)

• Can be separated by a line on the (x

1

, x

2

) plane x

1

- x

2

= 2

• Classification by a perceptron with

w

1

= 1, w

2

= -1, threshold = 2

Perceptrons

• Implement threshold

by a node x

0

– Constant output 1

– Weight w

0

= - threshold

– A common practice in

NN design

(1.6, -1.6)

(-1, -1)

Perceptrons

• Linear separability

– A set of (2D) patterns (x

1

, x

2

) of two classes is linearly

separable if there exists a line on the (x

1

, x

2

) plane

• w

0

+ w

1

x

1

+ w

2

x

2

= 0

• Separates all patterns of one class from the other class

– A perceptron can be built with

• 3 input x

0

= 1, x

1

, x

2

with weights w

0

, w

1

, w

2

– n dimensional patterns (x

1

,…, x

n

)

• Hyperplane w

0

+ w

1

x

1

+ w

2

x

2

+…+ w

n

x

n

= 0 dividing the

space into two regions

– Can we get the weights from a set of sample patterns?

• If the problem is linearly separable, then YES (by

perceptron learning)

• Examples of linearly separable classes

- Logical AND function

patterns (bipolar) decision boundary

x1 x2 output w1 = 1

-1 -1 -1 w2 = 1

-1 1 -1 w0 = -1

1 -1 -1

1 1 1 -1 + x1 + x2 = 0

- Logical OR function

patterns (bipolar) decision boundary

x1 x2 output w1 = 1

-1 -1 -1 w2 = 1

-1 1 1 w0 = 1

1 -1 1

1 1 1 1 + x1 + x2 = 0

x

o o

o

x: class I (output = 1)

o: class II (output = -1)

x

x o

x

x: class I (output = 1)

o: class II (output = -1)

Perceptron Learning

• The network

– Input vector i

j

(including threshold input = 1)

– Weight vector w = (w

0

, w

1

,…, w

n

)

– Output: bipolar (-1, 1) using the sign node function

• Training samples

– Pairs (i

j

, class(i

j

)) where class(i

j

) is the correct classification of i

j

• Training:

– Update w so that all sample inputs are correctly classified (if

possible)

– If an input i

j

is misclassified by the current w

class(i

j

) · w · i

j

< 0

change w to w + Δw so that (w + Δw) · i

j

is closer to class(i

j

)

¹

´

¦

÷

> ·

=

otherwise 1

0 if 1

j

i w

output

¿

=

= · =

n

k

j k k j

i w i w net

0

,

Perceptron Learning

Where η > 0 is the learning rate

• Justification

Perceptron Learning

) ( toward moves new

1 ) ( if 0

1 ) ( if 0

) ( ) (

0 since

) (

) ) ( ( ) (

j

j

j

j j j j j k

j

j j j j

j j j j k

i class net

i class

i class

i i i class i w i x w

i i

i i i class i w

i i i class w i x w

¬

¹

´

¦

÷ = <

= >

· · · = · ÷ · · +

> ·

· · · + · =

· · · + = · · +

q q

q

q q

• Perceptron learning convergence theorem

– Informal: any problem that can be represented by a

perceptron can be learned by the learning rule

– Theorem: If there is a such that

for all P training sample patterns , then for

any start weight vector , the perceptron learning rule

will converge to a weight vector such that for all p

( and may not be the same.)

– Proof: reading for grad students (Sec. 2.4)

1

w ) ( ) (

1

p p

i class w i f = ·

)} ( , {

p p

i class i

0

w

*

w

) ( ) (

*

p p

i class w i f = ·

1

w

*

w

• Note:

– It is a supervised learning (class(i

j

) is given for all sample input i

j

)

– Learning occurs only when a sample input misclassified (error

driven)

• Termination criteria: learning stops when all samples are correctly

classified

– Assuming the problem is linearly separable

– Assuming the learning rate (η) is sufficiently small

• Choice of learning rate:

– If η is too large: existing weights are overtaken by Δw =

– If η is too small (≈ 0): very slow to converge

– Common choice: η = 1.

• Non-numeric input:

– Different encoding schema

ex. Color = (red, blue, green, yellow). (0, 0, 1, 0) encodes “green”

Perceptron Learning

j j

i i class · · ) ( q

• Learning quality

– Generalization: can a trained perceptron correctly classify

patterns not included in the training samples?

• Common problem for many NN learning models

– Depends on the quality of training samples selected.

– Also to some extent depends on the learning rate and

initial weights

– How can we know the learning is ok?

• Reserve a few samples for testing

Perceptron Learning

Adaline

• By Widrow and Hoff (~1960)

– Adaptive linear elements for signal processing

– The same architecture of perceptrons

– Learning method: delta rule (another way of error driven),

also called Widrow-Hoff learning rule

Try to reduce the mean squared error (MSE) between the net

input and the desired out put

Adaline

• Delta rule

– Let i

j

= (i

0,j

, i

1,j

,…, i

n,j

) be an input vector with desired output d

j

– The squared error

•

• Its value determined by the weights w

l

– Modify weights by gradient descent approach

•

• Change weights in the opposite direction of

2

,

2

) ( ) (

¿

÷ = ÷ =

l

j l l j j j

i w d net d E

k

w E c c /

j k

l

j j j k j l l j k

i net d i i w d w

, , ,

) ( ) ( · ÷ = · ÷ =

¿

q q A

Adaline Learning Algorithm

• Delta rule in batch mode

– Based on mean squared error over all P samples

• E is again a function of w = (w

0

, w

1

,…, w

n

)

• the gradient of E:

• Therefore

¿

=

÷ =

P

p

p p

net d

P

E

1

2

) (

1

] ) [(

2

)] ( ) [(

2

,

1

1

p k

P

p

p p

p p

P

p

k

p p

k

i net d

P

net d

w

net d

P w

E

· ÷ ÷ =

÷

c

c

÷ =

c

c

¿

¿

=

=

=

c

c

÷ =

i

i

w

E

w q A

] ) [(

,

1

p k

P

p

p p

i net d · ÷

¿

=

q

Adaline Learning

• Notes:

– Weights will be changed even if an input is classified

correctly

– E monotonically decreases until the system reaches a state

with (local) minimum E (a small change of any w

i

will

cause E to increase).

– At a local minimum E state, , but E is not

guaranteed to be zero (net

j

!= d

j

)

• This is why Adaline uses threshold function rather than

linear function

i w E

i

¬ = c c 0 /

Adaline Learning

• Examples of linearly inseparable classes

- Logical XOR (exclusive OR) function

patterns (bipolar) decision boundary

x1 x2 output

-1 -1 -1

-1 1 1

1 -1 1

1 1 -1

No line can separate these two classes, as can be seen from the

fact that the following linear inequality system has no solution

because we have w

0

< 0 from

(1) + (4), and w

0

>= 0 from

(2) + (3), which is a

contradiction

o

x o

x

x: class I (output = 1)

o: class II (output = -1)

¦

¹

¦

´

¦

< + +

> ÷ +

> + ÷

< ÷ ÷

(4)

(3)

(2)

(1)

0

0

0

0

2 1 0

2 1 0

2 1 0

2 1 0

w w w

w w w

w w w

w w w

Linear Separability Again

Why hidden units must be non-linear?

• Multi-layer net with linear hidden layers is equivalent to a

single layer net

– Because z1 and z2 are linear unit

z1 = a1* (x1*v11 + x2*v21) + b1

z1 = a2* (x1*v12 + x2*v22) + b2

– net

y

= z1*w1 + z2*w2

= x1*u1 + x2*u2 + b1+b2 where

u1 = (a1*v11+ a2*v12)w1, u2 = (a1*v21 + a2*v22)*w2

net

y

is still a linear combination of x1 and x2.

Y

z2

z1 x1

x2

w1

w2

v11

v22

v12

v21

threshold = 0

– XOR can be solved by a more complex network with

hidden units

Threshold = 1

Y

z2

z1 x1

x2

2

2

2

2

-2

-2

(-1, -1) (-1, -1) -1

(-1, 1) (-1, 1) 1

(1, -1) (1, -1) 1

(1, 1) (-1, -1) -1

Threshold = 0

Summary

• Single layer nets have limited representation power

(linear separability problem)

• Error driven seems a good way to train a net

• Multi-layer nets (or nets with non-linear hidden

units) may overcome linear inseparability problem,

learning methods for such nets are needed

• Threshold/step output functions hinders the effort to

develop learning methods for multi-layered nets

Perceptrons

• By Rosenblatt (1962)

– For modeling visual perception (retina) – A feedforward network of three layers of units: Sensory, Association, and Response – Learning occurs only on weights from A units to R units (weights from S units to A units are fixed). – Each R unit receives inputs from n A units – For a given training sample s:t, change weights between A and R only if the computed output y is different from the target output t (error driven) S

wSA

A

wAR

R

x2) plane x1 . 1) (1.x2 = 2 Classification by a perceptron with w1 = 1. -1). x2) • Two groups of input patterns (0. 0) (-1. i = 1 – n – – To classify input patterns into one of the two classes (depending on whether output = 0 or 1) Example: input patterns: (x1.1. (2.6) Can be separated by a line on the (x1. threshold = 2 • • . -1. -2. w2 = -1.Perceptrons • A simple perceptron – Structure: • • Sing output node with threshold function n input nodes with weights wi.5) (1.6. 0) (0. 0) (0.

Perceptrons (-1. -1.6.threshold – A common practice in NN design .6) • Implement threshold by a node x0 – Constant output 1 – Weight w0 = . -1) (1.

x2) of two classes is linearly separable if there exists a line on the (x1.…. xn) – Can we get the weights from a set of sample patterns? • . w2 Hyperplane w0 + w1 x1 + w2 x2 +…+ wn xn = 0 dividing the space into two regions If the problem is linearly separable. then YES (by perceptron learning) – A perceptron can be built with • • – n dimensional patterns (x1. x2 with weights w0. w1. x2) plane • • w0 + w1 x1 + w2 x2 = 0 Separates all patterns of one class from the other class 3 input x0 = 1. x1.Perceptrons • Linear separability – A set of (2D) patterns (x1.

Logical OR function patterns (bipolar) decision boundary x1 x2 output w1 = 1 -1 -1 -1 w2 = 1 -1 1 1 w0 = 1 1 -1 1 1 1 1 1 + x1 + x2 = 0 x o o x: class I (output = 1) o: class II (output = -1) x x o x x: class I (output = 1) o: class II (output = -1) .Logical AND function o patterns (bipolar) decision boundary x1 x2 output w1 = 1 -1 -1 -1 w2 = 1 -1 1 -1 w0 = -1 1 -1 -1 1 1 1 -1 + x1 + x2 = 0 .• Examples of linearly separable classes .

w1. j – Output: bipolar (-1. wn ) net w i j wk ik . 1) using the sign node functionk 0 • Training samples • Training: 1 if w i j 0 output 1 otherwise – Pairs (ij .….Perceptron Learning • The network – Input vector ij (including threshold input = 1) n – Weight vector w = (w0. class(ij)) where class(ij) is the correct classification of ij – Update w so that all sample inputs are correctly classified (if possible) – If an input ij is misclassified by the current w class(ij) · w · ij < 0 change w to w + Δw so that (w + Δw) · ij is closer to class(ij) .

Perceptron Learning Where η > 0 is the learning rate .

Perceptron Learning • Justification ( w xk ) i j ( w class(i j ) i j ) i j w i j class(i j ) i j i j since i j i 0 ( w xk ) i j w i j class(i j ) i j i j 0 if class(i j ) 1 0 if class(i ) 1 j new net moves toward class(i j ) .

) – Proof: reading for grad students (Sec. then for any start weight vector w0. class(i p )} .4) .• Perceptron learning convergence theorem – Informal: any problem that can be represented by a perceptron can be learned by the learning rule – Theorem: If there is a w1 such that f (i p w1 ) class(i p ) for all P training sample patterns {i p . 2. the perceptron learning rule will converge to a weight vector w* such that for all p f (i p w* ) class(i p ) ( w* and w1 may not be the same.

– Different encoding schema ex. green. (0.Perceptron Learning • Note: – It is a supervised learning (class(ij) is given for all sample input ij) – Learning occurs only when a sample input misclassified (error driven) • Termination criteria: learning stops when all samples are correctly classified – Assuming the problem is linearly separable – Assuming the learning rate (η) is sufficiently small • Choice of learning rate: – If η is too large: existing weights are overtaken by Δw = class(i j ) i j – If η is too small (≈ 0): very slow to converge – Common choice: η = 1. blue. Color = (red. 1. yellow). 0. 0) encodes “green” • Non-numeric input: .

Perceptron Learning • Learning quality – Generalization: can a trained perceptron correctly classify patterns not included in the training samples? • Common problem for many NN learning models – Depends on the quality of training samples selected. – Also to some extent depends on the learning rate and initial weights – How can we know the learning is ok? • Reserve a few samples for testing .

Adaline • By Widrow and Hoff (~1960) – Adaptive linear elements for signal processing – The same architecture of perceptrons – Learning method: delta rule (another way of error driven). also called Widrow-Hoff learning rule Try to reduce the mean squared error (MSE) between the net input and the desired out put .

j. j (d j net j ) ik . in.Adaline • Delta rule – Let ij = (i0.j ) be an input vector with desired output dj – The squared error • E (d j net j )2 (d j wl il . i1. j ) ik .…. j l .j. j )2 l • Its value determined by the weights wl – Modify weights by gradient descent approach • • Change weights in the opposite direction of E / wk wk (d j wl il .

Adaline Learning Algorithm .

wn ) • the gradient of E: E 2 P [( d p net p ) (d p net p )] wk P p 1 wk 2 P [( d p net p ) ik . w1. p ] P p 1 E P [( d net ) i ] p • Therefore wi p k. p wi p 1 .Adaline Learning • Delta rule in batch mode – Based on mean squared error over all P samples 1 E P (d p net p ) 2 p 1 P • E is again a function of w = (w0.….

E / w i 0 i .Adaline Learning • Notes: – Weights will be changed even if an input is classified correctly – E monotonically decreases until the system reaches a state with (local) minimum E (a small change of any wi will cause E to increase). but E is not guaranteed to be zero (netj != dj) • This is why Adaline uses threshold function rather than linear function . – At a local minimum E state.

as can be seen from the fact that the following linear inequality system has no solution w0 w1 w2 0 w0 w1 w2 0 w w w 0 w0 w1 w2 0 1 2 0 (1) (2) (3) (4) because we have w0 < 0 from (1) + (4). and w0 >= 0 from (2) + (3).Logical XOR (exclusive OR) function x patterns (bipolar) decision boundary x1 x2 output -1 -1 -1 -1 1 1 1 -1 1 1 1 -1 o o x x: class I (output = 1) o: class II (output = -1) No line can separate these two classes.Linear Separability Again • Examples of linearly inseparable classes . which is a contradiction .

u2 = (a1*v21 + a2*v22)*w2 nety is still a linear combination of x1 and x2.Why hidden units must be non-linear? • Multi-layer net with linear hidden layers is equivalent to a single layer net x1 v12 v21 v11 z1 w1 threshold = 0 Y v22 x2 z2 w2 – Because z1 and z2 are linear unit z1 = a1* (x1*v11 + x2*v21) + b1 z1 = a2* (x1*v12 + x2*v22) + b2 – nety = z1*w1 + z2*w2 = x1*u1 + x2*u2 + b1+b2 where u1 = (a1*v11+ a2*v12)w1. .

-1) (-1. 1) z2 (-1.– XOR can be solved by a more complex network with hidden units Threshold 1 x1 -2 -2 2 z1 2 Threshold 0 Y 2 x2 (-1. -1) (-1. 1) (1. 1) (1. -1) (1. -1) 2 -1 1 1 -1 . -1) (-1.

Summary • Single layer nets have limited representation power (linear separability problem) • Error driven seems a good way to train a net • Multi-layer nets (or nets with non-linear hidden units) may overcome linear inseparability problem. learning methods for such nets are needed • Threshold/step output functions hinders the effort to develop learning methods for multi-layered nets .

- 156
- Srs Presentation.ppt (1)
- New Microsoft Office Word Document
- New Microsoft Office Word 97 - 2003 Document
- cv
- AMCAT Workshop Notice
- Feasibility Report Template 1.0
- 04.Project Billing System
- 14649377 Brain Computer Interface
- 28494647 SRS for Library Management System
- Report
- prg1
- 14656712 Programming Project Ideas With Sample Project Report on Library Management

Sign up to vote on this title

UsefulNot useful- Feedforward neural network.docx
- Perceptron
- Hand Gesture Recognition
- backprop
- SYS5130 Notes
- 04-ann-2x3
- Multi Label Neural Networks
- Info Jm2017
- On the Problem of Calibration G.K. Shukla Technometrics
- SAS_10.cov
- Steganography
- HW4 Solution
- 1 Analysis of Variance of Multiple Factors (1)
- 11 Multiple Regression
- OIS3660 HW2 2012S.xlsx
- Implementation of Garch Models
- Data Mining of Neural Spikes for Brain Machine Interfaces
- NN-Ch2

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.