You are on page 1of 2

Following similar conventions to that of the ML class, let ai = the activation level of unit i of layer , a( ) = the column vector

whose i-th entry is ai , g() = the activation function, usually a sigmoid, ij = the weight for connecting unit j of layer x = a(1) = the input vector, y = the target vector. For convenience, we denote the size of each layer (excluding the bias unit) by N ( ) . When the i-th unit is not a bias unit (i.e. i = 0 if we follow the ML classs convention), dene the net input to (or excitation level of) unit i of a hidden or output layer as
N( ( ) zi
1)

( )

( )

( )

to unit i of layer + 1,
( )

( ) = the weight matrix whose (i, j)-th entry is ij ,

=
j=0

ij

( 1) ( 1) ; aj

1 i N ( ),

2.
( )

(Be careful with the range of the indices i, j, in the above.) That is, zi ( ) we supply to g in order to obtain the activation ai if 2 and i = 0:
N(
1)

is the argument

( ) ai

( ) g(zi )

= g
j=0

ij

( 1) ( 1) ; aj

1 i N ( ),

2.

In neural network literature, the net input is usually denoted as neti ( ), but we follow the notation of the ML class here. We have suppressed indices to training samples in the above. When there are more than one samples, let x(s), y(s), a( ) (s) and z ( ) (s) denote the relevant quantities for the s-th sample. This deviates slightly from the notations of the ML class, where what we call yi (s) here (i.e. (s) the i-th entry of the output vector for sample s), for example, is denoted by yi instead. In this short note, superscripts are always reserved for layer indices and subscripts are reserved for row/column indices of vectors and matrices. The index s is neither superscripted nor subscripted. Suppose there are L layers in total, including the input and output layers. Let J be the training error for the network. For instance, if we want to evaluate the sum of squared errors, we may take J = #samples a(L) (s) y(s) 2 . In the ML class, J is taken as s=1 1 J= m = 1 m
m N (L)

yk (s) log ak (s) (1 yk (s)) log 1 ak (s)


s=1 k=1 m

(L)

(L)

2m

E(s) +
s=1

2m

(say),

where m is the number of training samples (i.e. the size of the training set), is the unrolled vector obtained by stacking the columns of all ( ) together, and E(s) is the training error (without regularization) for sample s. (Again, be careful with the range of the indices.) To train the neural network, we mean to nd the weights that minimize J. Many optimization methods can be used to nd the (locally) optimal weights. However, most of them (gradient descent, CG or BFGS, to name a few) require calculations of not only J, but also the gradient of J. Thus we need some way to evaluate J ) . Practically, the error (
ij

function can be broken down into training errors for individual samples, and penalties for individual weights. Therefore the gradient is evaluated and accumulated sample-by-sample and penalty-by-penalty. For instance, in last example, we have J
( ) ij

1 = m

E(s)
( ) ij

s=1

() . m ij

If we can evaluate the gradient of E(s) for each s, the overall training error can also be obtained easily. The backpropagation algorithm is essentially a way to compute, by the chain rule, the gradients E(s) of the training error E(s) for a single sample s. Let us rst dene ( )
ij

i (s) =

( )

E(s)
( ) zi

(with

2 and 1 i N ( ) ).
( ) ( ) ( )

For convenience, we drop the index s in the sequel. Let ( ) = (1 , 2 , . . . , N ( ) ) . By the chain rule, E
( ) ij

E
( +1) zi

zi

( +1)

( ) ij

= i

( +1) ( ) aj

( 1, 1 i N ( +1) and 0 j N ( ) ).

(Note that in the above, the index j may refer to the bias unit, but i does not.) So, we are able to compute E ) provided that we know how to compute ( +1) . But how do we compute (
ij

( +1)

? For the output layer, the quantity can be computed directly: i


(L)

E
(L) zi

E dai
(L) ai

(L)

(L) dzi

E
(L) ai

g (zi ) (1 i N (L) ).

(L)

For hidden layers, by the chain rule, we have


( +1) j

E
( +1) zj

N(

+2)

=
i=1

E
( +2) zi

zi

( +2)

aj

( +1)

N(

+2)

( +1) aj

( +1) zj

=
i=1

( +2) ( +1) ij g

(zj

( +1)

) (2 +1 L1).

(Note: Be very careful with the indices here. In the above, both i, j are nonzero. Hence the column of ( +1) that corresponds to the bias unit i.e. the rst column if we follow the ML classs convention is not involved in the calculation.) So, given that we have already evaluated and stored the values of z ( +1) and a( +1) for each layer in the forward pass, we can use the above recurrence relation to compute ( +1) in a backward manner (where + 1 runs from L 1 down to 2). 2

You might also like