Professional Documents
Culture Documents
⎧1 z ≥ 0
where θ ( z ) = ⎨
⎩0 else
Beyond Linear Separability
• Since a single perceptron unit
X2
can only define a single linear
boundary, it is limited to
solving linearly separable
problems.
X1
• A problem like that illustrated
by the values of the XOR
boolean function cannot be
solved by a single perceptron
unit.
Multi-Layer Perceptron
• What about if we consider
more than one linear
separator and combine their
outputs; can we get a more
powerful classifier?
• Yes. The introduction of
"hidden" units into these
networks make them much
more powerful:
– they are no longer limited to
linearly separable problems.
• Earlier layers transform the
problem into more tractable
problems for the latter layers.
Example: XOR problem
X2
X1
See explanations…
Explanations
• To see how having hidden units can help, let us see how a two-layer perceptron
network can solve the XOR problem that a single unit failed to solve.
• We see that each hidden unit defines its own "decision boundary" and the
output from each of these units is fed to the output unit, which returns a
solution to the whole problem. Let's look in detail at each of these boundaries
and its effect.
• If we focus on the first decision boundary we see only one of the training
points (the one with feature values (1,1)) is in the half space that the normal
points into.
– This is the only point with a positive distance and thus the output is 1 from
the perceptron unit.
– The other points have negative distance and the output is 0 from the
perceptron unit.
– Those are shown in the shaded column in the table.
Example: XOR problem
X2
X1
Example: XOR problem
X2
X1
X1
Multi-Layer Perceptron Learning
• Any set of training points can be separated by a three-layer
perceptron network.
• “Almost any” set of points is separable by two-layer perceptron
network.
• We could also use the mean squared error (MSE), which simply
divides the sum of the squared errors by the number of training
points instead of just 2. Since the number of training points is a
constant, the value for which we get the minimum is not affected.
Training
Gradient Descent
We've seen that the simplest
Online version: method for minimizing a
We consider each time only differentiable function is
the error for one data item gradient descent
(or ascent if we're maximizing).
1
E = ( y (x m , w ) − y m ) 2 Recall that we are trying to find
2
the weights that lead to a
∇ w E = ( y (x m , w ) − y m )∇ w y (x m , w ) minimum value of training error.
∂E
δ4 =
∂z4
∂E ∂z5 ∂E ∂z6 z4 will influence E, only indirectly
= + through z5 and z6.
∂z5 ∂z4 ∂z6 ∂z4
∂z5 ∂z6
= δ5 + δ6
∂z4 ∂z 4
= ...
∂E ∂E ∂z5 ∂E ∂z6 ∂z5 ∂z6
δ4 = = + = δ5 + δ6
∂z 4 ∂z5 ∂z 4 ∂z6 ∂z 4 ∂z 4 ∂z 4
∂z5 ∂y4 ∂z6 ∂y4
= δ5 + δ6
∂y4 ∂z 4 ∂y4 ∂z 4
∂ (... + w4→5 y4 + ...) ∂y4 ∂ (... + w4→6 y4 + ...) ∂y4
= δ5 + δ6
∂y4 ∂z 4 ∂y4 ∂z 4
∂y4 ∂y4
= δ 5 w4→5 + δ 6 w4→6
∂z 4 ∂z 4
d ( z4 ) d ( z4 )
= δ 5 w4→5 + δ 6 w4→6
dz 4 dz 4
d ( z4 )
= (δ 5 w4→5 + δ 6 w4→6 )
dz 4
= s ( z 4 )(1 − s ( z 4 ))(δ 5 w4→5 + δ 6 w4→6 )
= y4 (1 − y4 )(δ 5 w4→5 + δ 6 w4→6 )
Generalized Delta Rule
δ j = y j (1 − y j ) ∑ δ k w j →k
k∈downstream( j )
wi → j = wi → j − ηδ j yi
Generalized Delta Rule
For an output unit we have
∂E
δn =
∂z n Δwi→n = −ηδ n yi
∂E ∂yn
=
∂yn ∂z n
ds ( z n ) ∂E
=
dz n ∂yn
∂E
= yn (1 − yn )
∂yn
⎛ 1 m 2 ⎞
∂⎜ ( yn − y ) ⎟
2
= yn (1 − yn ) ⎝ ⎠ = yn (1 − yn )( yn − y m )
∂yn
Backpropagation Algorithm
1. Initialize weights to small random values
2. Choose a random sample training item, say (xm, ym)
3. Compute total input zj and output yj for each unit (forward prop)
4. Compute δn for output layer δn = yn(1-yn)(yn-ynm)
5. Compute δj for all preceding layers by backprop rule
6. Compute weight change by descent rule (repeat for all weights)
-1 x1 x2 -1
w03 = w03 − rδ 3 (−1) w02 = w02 − rδ 2 (−1) w01 = w01 − rδ1 (−1)
w13 = w13 − ηδ 3 y1 w12 = w12 − ηδ 2 x1 w11 = w11 − ηδ1 x1
w23 = w23 − ηδ 3 y2 w22 = w22 − ηδ 2 x2 w21 = w21 − ηδ1 x2
Input Representation
• An issue has to do with the representation of discrete data (also
known as "categorical" data).
• We could think of representing these as either unary or binary
numbers.
– Binary numbers
are generally a bad
choice;
– Unary is much
preferable