You are on page 1of 103

Chapter 3: Supervised

Learning: Neural Network


Contents from lesson plan:
• Introduction to perceptron learning, Model representation
• Gradient checking, Back propagation algorithm
• Multi-class classification, and Application- classifying digits
• Support vector machines
Nonlinear systems
Artificial neural network - representation of a neuron
•In an artificial neural network, a neuron is a logistic unit
• Feed input via input wires
• Logistic unit does computation
• Sends output down output wires
•That logistic computation is just like our previous logistic regression hypothesis calculation
•Very simple model of a neuron's computation
• Often good to include an x0 input - the bias unit
• This is equal to 1
•This is an artificial neuron with a sigmoid (logistic) activation function
• Ɵ vector may also be called the weights of a model
•The above diagram is a single neuron

•Here, input is x1, x2 and x3 

• We could also call input activation on the first layer - i.e. (a11, a21 and a31 )

• Three neurons in layer 2 (a12, a22 and a32 )


• Final fourth neuron which produces the output
• Which again we could call a13 
Neural networks - notation
 ai(j) - activation of unit i in layer j 

So, a12 - is the activation of the 1st unit in the second layer


By activation, we mean the value which is computed and output by that node
 Ɵ(j) - matrix of parameters controlling the function mapping from layer j to layer j + 1
Parameters for controlling mapping from one layer to the next
If network has 
sj units in layer j  and 
sj+1 units in layer j + 1 
Then Ɵj will be of dimensions [sj+1 X  sj + 1]
Because
sj+1 is equal to the number of units in layer (j + 1)
is equal to the number of units in layer j, plus an additional unit
Looking at the Ɵ matrix

Column length is the number of units in the following layer


Row length is the number of units in the current layer + 1 (because we have to map the bias unit)
So, if we had two layers - 101 and 21 units in each
Then Ɵj would be = [21 x 102]
What are the computations which occur?
We have to calculate the activation for each node
That activation depends on
The input(s) to the node
The parameter associated with that node (from the Ɵ vector associated with that layer)
Below we have an example of a network, with the associated calculations for the four nodes below
Something conceptually important is that
Every input/activation goes to every node in following layer
Which means each "layer transition" uses a matrix of parameters with the following significance
For the sake of consistency with later nomenclature, we're using j,i and l as our variables here
(although later in this section we use j to show the layer we're on)

Ɵj il
• j (first of two subscript numbers)= ranges from 1 to the number of units in layer l+1
• i (second of two subscript numbers) = ranges from 0 to the number of units in layer l
• l is the layer you're moving FROM
This is perhaps more clearly shown in my slightly over the top example below
For example
Ɵ131 = means
1 - we're mapping to node 1 in layer l+1
3 - we're mapping from node 3 in layer l
1 - we're mapping from layer 1
Model representation II
Model representation II
Back propagation
Simple expressions and interpretation of
the gradient
• Consider a simple multiplication function of two numbers f(x,y)=xy. It is a
matter of simple calculus to derive the partial derivative for either input:

• Interpretation: They indicate the rate of change of a function with respect


to that variable surrounding an infinitesimally small region near a particular
point:
• The derivative on each variable tells you the sensitivity of the whole
expression on its value. For example, if x=4,y=−3 then f(x,y)=−12 and the
derivative on

• This tells us that if we were to increase the value of this variable by a tiny
amount, the effect on the whole expression would be to decrease it (due to
the negative sign), and by three times that amount. This can be seen by
rearranging the previous equation to

• Analogously, since ∂f/∂y=4, we expect that increasing the value of y by


some very small amount h would also increase the output of the function
(due to the positive sign), and by 4h.
For more understanding go through below link

https://google-developers.appspot.com/machine-learning/crash-course/backprop-scroll/

https://playground.tensorflow.org/

https://www.youtube.com/watch?v=I2I5ztVfUSE
Sample example trace

You might also like