You are on page 1of 15

CSD311: Artificial Intelligence

Neural n/ws practical aspects

I Encoding output.
I Data pre-processing.
I Ordering exposure of instances.
I Weight initialization.
I Stochastic or batch.
I Choice of activation function.
I Convergence of gradient descent.
I Effect of η(t) on convergence. How to choose η(t).
Encoding output

I While many ways to code are possible the standard way to


encode output for classification is to have one output neuron
for each class. Ideally, the output should be 1 for the
predicted class and 0 for the others.
I Normally, the output values are in the range 0 to 1 and the
maximum value neuron is chosen as the class label.
I Softmax is a way to get a probability distribution over the
output labels and the highest probability label is chosen as the
prediction. This uses an exponential activation function for
the output neuron and normalizes the total to 1.0. If there are
net
C classes then zk = PCe kneti
i=1 e
I If the output prediction is a function value (as in RL
applications) then there is only one output neuron that
outputs the predicted function value.
Data pre-processing and weight initialization

I Actual values of attributes can vary widely in magnitude. Due


to measurement units and even otherwise.
I This can lead to very high/low weights and problems of
convergence during training.
I So, each data attribute is typically normalized to range
between −1.0 and 1.0.
I This is an important step and should be done unless their are
good reasons not to do it.
I In a neural n/w the weights are the actual model. They are
usually initialized at the start.
I Frequently, they are initialized randomly between −1.0 and
1.0 avoiding 0.0.
Exposure of instances

I Ideally, exposure should be randomized. That is in every


epoch the order in which learning items are fed to the n/w
should be different.
I Most training happens in mini-batch mode with
randomization.
Choice of activation function

I Required properties of an activation fn. a) non-linear b)


saturating at both ends c) smooth and continuous d) linearity
with small argument values.
I Above properties imply the function should be some form of
sigmoid.
I Avoid always positive/negative activation functions. So tanh
is better than standard sigmoid which is always positive.
I Recommended f (x) = 1.7 tanh 23 x. Also f (±1) = ±1.
I Try adding a small linear term to the activation function to
negotiate flat regions.
Choice of activation function-2

The activation function on the left


is:
e b net − e −b net
a( )
e b net + e −b net

with a = 1.716 and b = 23 .


The above function has f 0 (0) ≈ 0.5
and it is almost linear in the range
−1 < net < 1 and extrema of f 00 (.)
occur approx. at net ≈ ±2
Stochastic or batch I
I In practice stochastic is the norm. Batch is easier to analyse and
conditions for convergence are better understood.
Stochastic advantages:
I Stochastic is usually faster. Especially when there is some
redundancy in the data set stochastic training is faster (esp.
using exposure control).
I Gradient is noisy so better chance of not getting stuck in a
local minimum. However, this can lead to slower convergence
at the end. For batch, gradients are more stable and this may
mean the descent remains in the same basin but better
convergence towards the end.
I Stochastic is also better when data is changing slowly over
time - so iid assumption does not hold exactly.
Batch advantages:
I Conditions for convergence well understood.
I Theoretical analysis of weight dynamics and convergence rates
is simpler.
Stochastic or batch II

I Acceleration techniques (e.g. 2nd order methods, momentum


term in update) are usually only usable in batch mode.
Choosing right η(t)

Figure: How η affects gradient descent - just 1 wt, quadratic approx.


(From LaCunn et. al.)
Understanding error weight surface and optimal η

Figure: Error-weight surface and optimal value of η. (From LaCunn et.


al.)
Calculating ηopt
∂E
For the update we have: w(t + 1) = w(t) − η(t) ∂w .
Expand E(w ) in a Taylor series expansion around current value wc :
dE d E2
E(w ) = E(wc ) + (w − wc ) dw |w =wc + 12 (w − wc )2 dw 2 |w =wc + · · ·

d E 2
Since E is a quadratic in w , dw 2 is a constant and higher order
terms are 0. Differentiating both sides w.r.t w :
dE dE d E 2
dw = dw |w =wc + (w − wc ) dw 2 |w =wc

LHS is 0 at w = wmin . This gives:


2
d E
wmin = wc − dE −1
dw |w =wc × ( dw 2 |w =wc )

Comparing the above with the update equation we get:


d 2E −1 .
ηopt = ( dw 2 |w =wc )

Assuming the E vs w graph is symmetric the largest η that will


allow convergence is η < 2ηopt
If E is not quadratic in w then higher order terms will be non-zero
and ηopt calculated above is an approx. and will require some
iterations to converge.
In higher dimensions finding ηopt is harder since now we have a
2E
Hessian matrix Hij = ∂w∂i ∂w j
which gives the curvatures of the E
versus w surface.
By diagonalizing H we line up the axes along the eigenvectors and
H is now aligned with the coordinate axes. This leads to ηopti = λ1i
and each wi update can be treated independently. If we want a
2
single η then for convergence we need η < λmax .
Example data for effect of η

Figure: Example data Gaussians centred at (−0.4, −0.8), (0.4, 0.8).


(From LaCunn et. al.)
Picture of convergence with different η

Figure: Trajectory in weight space and log(error) with epochs with


η = 1.5 and η = 2.5 where ηmax = 2.38. (From LaCunn et. al.)

You might also like