You are on page 1of 141



 




 



   

  

  




















 
 
 







  







  




  

  

  

   

 

 


































x y
xy


y v

Linear Transformations

1
6 Hopfield Network Questions
Initial
Condition Recurrent Layer

p
Sx1 W n(t + 1) a(t + 1) a(t)
SxS
Sx1 Sx1 D Sx1

1 b
Sx1
S S

a(0) = p a(t + 1) = satlins (Wa(t) + b)

The network output is repeatedly multiplied by the weight


matrix W.
What is the effect of this repeated operation?
Will the output converge, go to infinity, oscillate?
In this chapter we want to investigate matrix multiplication,
which represents a general linear transformation.
2
6 Linear Transformations

A transformation consists of three parts:


1. A set of elements X = {x i}, called the domain,
2. A set of elements Y = {yi}, called the range, and
3. A rule relating each x i X to an element yi Y.

A transformation is linear if:


1. For all x 1, x 2 X, A(x 1 + x 2 ) = A(x 1) + A(x 2 ),
2. For all x X, a , A(a x ) = a A(x ) .

3
6 Example - Rotation
Is rotation linear?
A(ax )
A(x ) x = aA(x ) ax

1.
A(x ) x

x 1 +x 2 A(x 1 + x 2)
A(x 1)
x2

2. x1 A(x 2)

4
6 Matrix Representation - (1)
Any linear transformation between two finite-dimensional
vector spaces can be represented by matrix multiplication.

Let {v1, v2, ..., vn} be a basis for X, and let {u1, u2, ..., um} be
a basis for Y.
n m
x = xiv i y = yiu i
i=1 i=1

Let A:XY
A(x ) = y

n m

A x jv j = yiu i
j = 1 i=1

5
6 Matrix Representation - (2)
Since A is a linear operator,
n m

xjA(vj) = yiu i
j=1 i=1

Since the u i are a basis for Y,


m (The coefficients aij will make
A(vj) = aij u i up the matrix representation of
i=1 the transformation.)

n m m

x j aij u i = yiu i
j=1 i=1 i=1

6
6 Matrix Representation - (3)
m n m

u i aij x j = yiu i
i=1 j=1 i=1

m n

ui aij x j yi = 0
i=1 j=1

Because the u i are independent,

a 11 a 12 a 1n x 1 y1
n
a 21 a 22 a 2n x 2
aij x j = yi =
y2
j=1
This is equivalent to a m1 a m2 a mn x n ym
matrix multiplication.

7
6 Summary

A linear transformation can be represented by matrix


multiplication.
To find the matrix which represents the transformation we
must transform each basis vector for the domain and then
expand the result in terms of the basis vectors of the range.

m
A(vj) = aij u i
i=1

Each of these equations gives us


one column of the matrix.
8
6 Example - (1)
Stand a deck of playing cards on edge so that you are looking
at the deck sideways. Draw a vector x on the edge of the deck.
Now skew the deck by an angle , as shown below, and note
the new vector y = A(x). What is the matrix of this transforma-
tion in terms of the standard basis set?

s2
x y = A(x)
x y = A(x)

s1

9
6 Example - (2)
To find the matrix we need to transform each of the basis vectors.
m
A(vj) = aij u i
i=1

We will use the standard basis vectors for both


the domain and the range.
2
A (s j) = aij s i = a1 j s 1 + a2 j s 2
i=1

10
6 Example - (3)
We begin with s1:
If we draw a line on the bottom card and then skew the
deck, the line will not change.

A(s1)
s1
2
A (s 1) = 1s 1 + 0s 2 = a i1 s i = a 11 s 1 + a 21 s 2
i=1

This gives us the first column of the matrix.


11
6 Example - (4)
Next, we skew s2:

tan()
A(s2)
s2

2
A (s 2) = tan ( ) s 1 + 1 s 2 = ai2 s i = a 12 s 1 + a 22 s 2
i=1

This gives us the second column of the matrix.


12
6 Example - (5)

The matrix of the transformation is:

A = 1 tan( )
0 1

13
6 Change of Basis
Consider the linear transformation A:XY. Let {v1, v2, ..., vn} be
a basis for X, and let {u1, u2, ..., um} be a basis for Y.
n m
x = xiv i y = yiu i
i=1 i=1

A(x ) = y

The matrix representation is:


a 11 a 12 a 1n x 1 y1
a 21 a 22 a 2n x 2 y2
=

a m1 a m2 a mn x n ym

Ax = y
14
6 New Basis Sets
Now lets consider different basis sets. Let {t1, t2, ..., tn} be a
basis for X, and let {w1, w2, ..., wm} be a basis for Y.
n m
x = x'i t i y = y'iw i
i=1 i=1

The new matrix representation is:

a' 11 a' 12 a' 1n x' 1 y' 1


a' 21 a' 22 a' 2n x' 2 y' 2
=


a' m1 a' m2 a' mn x' n y' m

A'x' = y'

15
6 How are A and A' related?
Expand ti in terms of the original basis vectors for X.
t 1i
n
t 2i
ti = t ji v j ti =


j=1
t ni

Expand w i in terms of the original basis vectors for Y.

w1i
m
w2i
wi = w ji u j wi =


j=1
w mi

16
6 How are A and A' related?

Bt = t1 t 2 t n x = x' 1 t1 + x' 2 t2 + + x' n t n = B t x'

Bw = w 1 w 2 wm y = B w y'

Ax = y ABt x' = B w y'

1
[ B w AB t ] x' = y'
1
A' = [ B w AB t ]
A'x' = y'
Similarity
Transform
17
6 Example - (1)
Take the skewing problem described previously, and find the
new matrix representation using the basis set {s1, s2}.

t2 t1
s2 t1 = 0.5 s 1 + s 2

t2 = -s1 + s2
s1

0.5
t1 =
1
B t = t 1 t 2 = 0.5 1 B w = B t = 0.5 1
1 1 1 1 1
t2 =
1 (Same basis for
domain and range.)
18
6 Example - (2)

1
A' = [ B w ABt ] = 2 3 2 3 1 tan 0.5 1
2 3 1 3 0 1 1 1

A' = ( 2 3 ) tan + 1 ( 2 3 ) tan


( 2 3 ) tan ( 2 3 ) tan + 1

For = 45:

A' = 53 23 A = 11
2 3 1 3 0 1

19
6 Example - (3)
Try a test vector: x = 0.5 x' = 1
1 0

y = Ax = 1 1 0.5 = 1.5 y' = A'x' = 53 23 1 = 53


0 1 1 1 2 3 1 3 0 2 3

t2 t1 = x y = A( x )
s2

s1

Check using reciprocal basis vectors:


1
y' = B 1 y = 0.5 1 1.5 = 2 3 2 3 1.5 = 5 3
1 1 1 2 3 1 3 1 2 3
20
6 Eigenvalues and Eigenvectors
Let A:XX be a linear transformation. Those vectors
z X, which are not equal to zero, and those scalars
which satisfy

A(z) = z

are called eigenvectors and eigenvalues, respectively.

s2
x y = A(x) Can you find an eigenvector
for this transformation?

s1
21
6 Computing the Eigenvalues
Az = z

[A I ]z = 0 [A I] = 0

Skewing example (45):

1 1 1 = 1
A = 11 = 0 2
(1 ) = 0
0 1 0 1 2 = 1

1 1 z = 0 0 1 z = 0 1 z 11 = 0 z 21 = 0 z1 = 1
1
0 1 0 00 0 0 z 21 0 0

For this transformation there is only one eigenvector.


22
6 Diagonalization

Perform a change of basis (similarity transformation) using


the eigenvectors as the basis vectors. If the eigenvalues are
distinct, the new matrix will be diagonal.

{z 1, z 2, , z n} Eigenvectors
B = z 1 z2 z n
{ 1, 2, , n} Eigenvalues

1 0 0
1 0 2 0
[ B AB ] =


0 0 n

23
6 Example
A = 11
1 1

1 1 1 = 0
= 0 2
2 = ( ) ( 2 ) = 0 1 1 z = 0
1 1 2 = 2 1 1 0

1 = 0 1 1 z = 1 1 z1 1 = 0 z21 = z 11 z1 = 1
1
1 1 1 1 z2 1 0 1

2 = 2 1 1 z = 1 1 z 12 = 0 z 22 = z 12 z2 = 1
1
1 1 1 1 z 22 0 1

A' = [ B AB ] = 1 2 1 2 1 1 1 1 = 0 0
1
Diagonal Form:
12 12 1 1 1 1 0 2

24
7

Supervised Hebbian Learning

1
7 Hebbs Postulate
When an axon of cell A is near enough to excite a cell B and
repeatedly or persistently takes part in firing it, some growth
process or metabolic change takes place in one or both cells such
that As efficiency, as one of the cells firing B, is increased.
D. O. Hebb, 1949
Dendrites

B Cell Body
Axon

Synapse

2
7 Linear Associator
Inputs Linear Layer

p n a
Rx1
W Sx1 Sx1
SxR

R S
a = purelin (Wp)

R
a = Wp ai = wij p j
j=1

Training Set:
{p 1, t 1} , {p 2 , t 2} , , {pQ , t Q}
3
7 Hebb Rule
w ijnew = w ijold + f i ( a iq )g j ( p jq )

Presynaptic Signal

Postsynaptic Signal

Simplified Form:
w ijnew = w ijold + a iq p jq

Supervised Form:
w ijnew = w ijold + t iq p jq

Matrix Form:
new old T
W = W + tq pq

4
7 Batch Operation
Q

T T T T
W = t1 p1 + t2 p2 ++ tQ pQ = tq pq (Zero Initial
q=1 Weights)

Matrix Form:
T
p1 P = p1 p 2 pQ
T
W = t 1 t 2 tQ p 2 = TP T

pQ
T
T = t1 t 2 tQ

5
7 Performance Analysis
Q
Q
a = Wp k = t q p Tq p k = tq ( p Tq p k )
q = 1 q=1

Case I, input patterns are orthogonal.


T
( pq pk ) = 1 q = k
= 0 qk

Therefore the network output equals the target:


a = Wp k = t k

Case II, input patterns are normalized, but not orthogonal.

t q ( pq p k )
T
a = Wp k = tk +
qk
Error
6
7 Example
Banana Apple Normalized Prototype Patterns

1 1 0.5774 0.5774

p1 = 1 p2 = 1 p 1 = 0.5774 , t 1 = 1 p 2 = 0.5774 , t 2 = 1

1 1 0.5774 0.5774

Weight Matrix (Hebb Rule):

W = TP = 1 1 0.5774 0.5774 0.5774 = 1.1548 0 0


T
0.5774 0.5774 0.5774

Tests:
0.5774
Banana Wp 1 = 1.1548 0 0 0.5774 = 0.6668
0.5774

0.5774
Apple Wp 2 = 1.1548 0 0 0.5774 = 0.6668
0.5774
7
7 Pseudoinverse Rule - (1)
Performance Index: Wp q = t q q = 1, 2, , Q


2
F(W ) = || t q Wp q ||
q=1

Matrix Form: WP = T

T = t1 t 2 tQ P = p1 p2 p Q

2 2
F ( W ) = || T WP || = || E ||

e ij
2 2
|| E || =
i j

8
7 Pseudoinverse Rule - (2)
WP = T

Minimize:
2 2
F ( W ) = || T WP || = || E ||

If an inverse exists for P, F(W) can be made zero:


W = TP 1

When an inverse does not exist F(W) can be minimized


using the pseudoinverse:
W = TP +

+ T 1 T
P = (P P) P
9
7 Relationship to the Hebb Rule
Hebb Rule
W = TP T

Pseudoinverse Rule
W = TP +
+ T 1 T
P = (P P) P

If the prototype patterns are orthonormal:

PT P = I

+ T 1 T T
P = (P P) P = P

10
7 Example
+
1 1
+ 1 1
p1 = 1 , t1 = 1 2
p = ,
1 2 t = 1 W = TP = 1 1 1 1

1 1 1 1

1
+ T 1 T
P = (P P) P = 3 1 1 1 1 = 0.5 0.25 0.25
13 1 1 1 0.5 0.25 0.25

W = TP = 1 1 0.5 0.25 0.25 = 1 0 0


+

0.5 0.25 0.25

1 1
Wp 1 = 1 0 0 1 = 1 Wp 2 = 1 0 0 1 = 1
1 1

11
7 Autoassociative Memory

p1,t1 p2,t2 p3,t3


T
p1 = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Inputs Sym. Hard Limit Layer

p n a T T T
W W = p1 p1 + p2 p2 + p3 p3
30x1 30x1 30x1
30x30

30 30
a = hardlims (Wp)
12
7 Tests
50% Occluded

67% Occluded

Noisy Patterns (7 pixels)

13
7 Variations of Hebbian Learning
new old T
Basic Rule: W = W + tq pq

new old T
Learning Rate: W = W + tq pq

new old T old old T


Smoothing: W = W + tq pq W = (1 )W + tq pq

new old T
Delta Rule: W = W + ( tq aq ) pq

new old T
Unsupervised: W = W + aq pq

14
8

Performance Surfaces

1
8 Taylor Series Expansion

d ( )
F ( x ) = F ( x ) + F x ( x x )
dx x = x

2
1 d ( ) ( x x ) +
2
+ --- F x
2 d x2
x = x

n
1 d ( ) ( x x ) n +
+ ----- F x
n! d x n
x = x

2
8 Example
x
F( x) = e

Taylor series of F(x) about x* = 0 :

x 0 0 1 0 2 1 0 3
F( x ) = e = e e ( x 0 ) + --- e ( x 0 ) --- e ( x 0 ) +
2 6

1 2 1 3
F ( x ) = 1 x + --- x --- x +
2 6

Taylor series approximations:


F ( x ) F0 ( x ) = 1

F( x ) F1( x ) = 1 x

1 2
F ( x ) F 2 ( x ) = 1 x + --- x
2
3
8 Plot of Approximations

F2 ( x )
3

2 F1 ( x )

1
F0 ( x )

-2 -1 0 1 2

4
8 Vector Case

F ( x ) = F ( x 1, x 2, , x n )


F ( x ) = F ( x ) + F(x ) ( x 1 x 1 ) + F(x) ( x 2 x 2 )
x1 x=x x 2 x=x

2
1
++ F( x ) ( x x ) + --- F ( x ) ( x x )2
xn x = x
n n 2 x2 x = x
1 1
1

2
1 ( x 1 x 1 ) ( x 2 x 2 ) +
+ --- F(x )
2 x 1 x 2 x=x

5
8 Matrix Form
F ( x ) = F ( x ) + F ( x ) ( x x )
T

x=x
1
+ --- ( x x ) 2F ( x ) ( x x ) +
T
2
x=x

Gradient Hessian
2 2 2
F(x) F( x ) F(x )
F(x) x 21 x 1 x 2 x 1 x n
x1
2 2 2
F(x) F( x) F(x ) F(x )
F ( x ) = x 2 2 F ( x ) = x 2 x 1 x 22 x 2 x n




F(x)
2

2

2
xn F( x) F( x ) F(x )
x n x 1 x n x 2 x 2n

6
8 Directional Derivatives

First derivative (slope) of F(x) along xi axis: F ( x ) x i

(ith element of gradient)

Second derivative (curvature) of F(x) along xi axis: 2 F ( x ) x 2i

(i,i element of Hessian)

T
p F ( x )
First derivative (slope) of F(x) along vector p: -----------------------
p

T
Second derivative (curvature) of F(x) along vector p: p 2 F ( x ) p
------------------------------
2
p

7
8 Example
2 2
F ( x ) = x 1 + 2x 1 x 2 + 2x 2

x = 0.5 p = 1
0 1


F(x)
x1 2x 1 + 2x 2
F ( x ) = = = 1
x = x 2x 1 + 4x 2 1
F(x)
x2 x = x
x = x

1
T 1 1
p F ( x ) 1 0
----------------------- = ------------------------ = ------- = 0
p 2
1
1
8
8 Plots
Directional
Derivatives
2

20

15
1
1.4
10
1.3
5 x2 0 1.0

0 0.5
2
1 2
-1
0.0
0 1
0
-1
x2 -2 -2
-1
x1
-2
-2 -1 0 1 2

x1

9
8 Minima
Strong Minimum
The point x* is a strong minimum of F(x) if a scalar > 0 exists,
such that F(x*) < F(x* + x) for all x such that > ||x|| > 0.

Global Minimum
The point x* is a unique global minimum of F(x) if
F(x*) < F(x* + x) for all x 0.

Weak Minimum
The point x* is a weak minimum of F(x) if it is not a strong
minimum, and a scalar > 0 exists, such that F(x*) F(x* + x)
for all x such that > ||x|| > 0.
10
8 Scalar Example
4 2 1
F ( x ) = 3x 7x --- x + 6
2
8

Strong Maximum
6

2 Strong Minimum

Global Minimum

0
-2 -1 0 1 2
11
8 Vector Example
4 2 2 2
F ( x ) = ( x 2 x 1 ) + 8x 1 x 2 x 1 + x 2 + 3 F ( x ) = ( x 1 1.5x 1 x 2 + 2x 2 )x 1
2 2

1.5

1 1

0.5

0 0

-0.5

-1 -1

-1.5

-2 -2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1 0 1 2

12 8

6
8

4
2

0 0
2 2
1 2 1 2
0 1 0 1
0 0
-1 -1
-1 -1
-2 -2 -2 -2
12
8 First-Order Optimality Condition
1 T 2 ( )
F ( x ) = F ( x + x ) = F ( x ) + F ( x ) x +
T

x + --- x F x
x=x 2 x=x
x = x x

For small x: If x* is a minimum, this implies:


F ( x + x ) F ( x ) + F ( x )
T T
x F ( x ) x 0
x = x x = x

If F ( x )
T
x > 0 then F ( x x ) F ( x ) F ( x ) T x < F ( x )
x=x x = x

T
But this would imply that x* is not a minimum. Therefore F ( x ) x = 0
x = x

Since this must be true for every x, F ( x )


= 0
x=x
13
8 Second-Order Condition
If the first-order condition is satisfied (zero gradient), then
1 T
F ( x + x ) = F ( x ) + --- x 2F ( x ) x +
2 x = x
T
A strong minimum will exist at x* if x 2F ( x )
x > 0 for any x 0.
x=x

Therefore the Hessian matrix must be positive definite. A matrix A is positive definite if:
T
z Az > 0 for any z 0.

This is a sufficient condition for optimality.

A necessary condition is that the Hessian matrix be positive semidefinite. A matrix A is


positive semidefinite if:
T
z Az 0 for any z.
14
8 Example
2 2
F( x ) = x 1 + 2x 1 x 2 + 2x 2 + x 1

2x 1 + 2x 2 + 1
F ( x ) = = 0 x = 1
2x 1 + 4x 2 0.5

(Not a function of x
2F ( x ) = 2 2
2 4 in this case.)

To test the definiteness, check the eigenvalues of the Hessian. If the eigenvalues
are all greater than zero, the Hessian is positive definite.

2F ( x ) I = 2 2 2
= 6 + 4 = ( 0.76 ) ( 5.24 )
2 4

= 0.76, 5.24 Both eigenvalues are positive, therefore strong minimum.


15
8 Quadratic Functions
1 T T
F ( x ) = --- x Ax + d x + c (Symmetric A)
2

Gradient and Hessian:


Useful properties of gradients:
T T
( h x ) = ( x h ) = h
T T
x Qx = Qx + Q x = 2 Qx (for symmetric Q )

Gradient of Quadratic Function:


F ( x ) = Ax + d

Hessian of Quadratic Function:


2 F ( x ) = A
16
8 Eigensystem of the Hessian
Consider a quadratic function which has a stationary
point at the origin, and whose value there is zero.
1 T
F ( x ) = --- x Ax
2
Perform a similarity transform on the Hessian matrix,
using the eigenvalues as the new basis vectors.
B = z 1 z 2 zn

Since the Hessian matrix is symmetric, its eigenvectors


are orthogonal. 1 T
B = B

1 0 0
T 0 2 0
A' = [ B AB ] = = A = BB T

0 0 n
17
8 Second Directional Derivative
T T
p 2 F ( x ) p p Ap
------------------------------ = ---------------
2 2
p p

Represent p with respect to the eigenvectors (new basis):


p = Bc

i c 2i
p Ap c B ( B B ) Bc c c i = 1
T T T T T
---------------
2
= -------------------------------------------
T T
- = -------------
T
- = -------------------
n
-
p c B Bc c c
c2i
i=1

T
p Ap
min --------------- max
2
p
18
8 Eigenvector (Largest Eigenvalue)
0
0


p = z max T T
c = B p = B zmax = 0
1
0


0
n

T
z max Az max
i c 2i
z1
-------------------------------
2
=1
n
- = max
- = i------------------- z2 (min)
z max
c 2i (max)
i=1

The eigenvalues represent curvature


(second derivatives) along the eigenvectors
(the principal axes).
19
8 Circular Hollow
1 T
F ( x ) = x 1 + x 2 = --- x 2 0 x
2 2
2 0 2

2F ( x ) = 2 0 1 = 2 z1 = 1 2 = 2 z2 = 0
02 0 1

(Any two independent vectors in the plane would work.)


2

2
0

0
2 -1

1 2
0 1
0
-1
-1
-2 -2
-2 -2 -1 0 1 2

20
8 Elliptical Hollow
1 T
F ( x ) = x 1 + x 1 x 2 + x 2 = --- x 2 1 x
2 2
2 1 2

2F ( x ) = 2 1 1 = 1 z1 = 1 2 = 3 z2 = 1
1 2 1 1

0
1

0
2 -1

1 2
0 1
0
-1
-1
-2 -2
-2 -2 -1 0 1 2

21
8 Elongated Saddle
1 2 3 1 2 1 T
F ( x ) = --- x 1 --- x 1 x 2 --- x 2 = --- x 0.5 1.5 x
4 2 4 2 1.5 0.5

2F ( x ) = 0.5 1.5 1 = 1 z1 = 1 2 = 2 z2 = 1
1.5 0.5 1 1

-4 0

-8
2
-1
1 2
0 1
0
-1
-1
-2 -2 -2
-2 -1 0 1 2

22
8 Stationary Valley
1 2 1 2 1 T
F( x ) = --- x 1 x 1 x 2 + --- x 2 = --- x 1 1 x
2 2 2 1 1

2F ( x ) = 1 1 1 = 1 z1 = 1 2 = 0 z2 = 1
1 1 1 1

0
1

0
2 -1

1 2
0 1
0
-1
-1
-2 -2
-2 -2 -1 0 1 2

23
8 Quadratic Function Summary
If the eigenvalues of the Hessian matrix are all positive, the
function will have a single strong minimum.
If the eigenvalues are all negative, the function will have a
single strong maximum.
If some eigenvalues are positive and other eigenvalues are
negative, the function will have a single saddle point.
If the eigenvalues are all nonnegative, but some
eigenvalues are zero, then the function will either have a
weak minimum or will have no stationary point.
If the eigenvalues are all nonpositive, but some
eigenvalues are zero, then the function will either have a
weak maximum or will have no stationary point.

x = A d
1
Stationary Point:
24

You might also like