AI Neural Networks Practical Aspects

CSD311: Artificial Intelligence
Neural n/ws practical aspects
I Encoding output.
I Data pre-processing.
I Ordering exposure of instances.
I Weight initialization.
I Stochastic or batch.
I Choice of activation function.
I Convergence of gradient descent.
I Effect of η(t) on convergence. How to choose η(t).
Encoding output
I While many ways to code are possible the standard way to

encode output for classification is to have one output neuron
for each class. Ideally, the output should be 1 for the
predicted class and 0 for the others.
I Normally, the output values are in the range 0 to 1 and the
maximum value neuron is chosen as the class label.
I Softmax is a way to get a probability distribution over the
output labels and the highest probability label is chosen as the
prediction. This uses an exponential activation function for
the output neuron and normalizes the total to 1.0. If there are
net
C classes then zk = PCe kneti
i=1 e
I If the output prediction is a function value (as in RL
applications) then there is only one output neuron that
outputs the predicted function value.
Data pre-processing and weight initialization
I Actual values of attributes can vary widely in magnitude. Due

to measurement units and even otherwise.
I This can lead to very high/low weights and problems of
convergence during training.
I So, each data attribute is typically normalized to range
between −1.0 and 1.0.
I This is an important step and should be done unless their are
good reasons not to do it.
I In a neural n/w the weights are the actual model. They are
usually initialized at the start.
I Frequently, they are initialized randomly between −1.0 and
1.0 avoiding 0.0.
Exposure of instances
I Ideally, exposure should be randomized. That is in every

epoch the order in which learning items are fed to the n/w
should be different.
I Most training happens in mini-batch mode with
randomization.
Choice of activation function
I Required properties of an activation fn. a) non-linear b)

saturating at both ends c) smooth and continuous d) linearity
with small argument values.
I Above properties imply the function should be some form of
sigmoid.
I Avoid always positive/negative activation functions. So tanh
is better than standard sigmoid which is always positive.
I Recommended f (x) = 1.7 tanh 23 x. Also f (±1) = ±1.
I Try adding a small linear term to the activation function to
negotiate flat regions.
Choice of activation function-2
The activation function on the left

is:
e b net − e −b net
a( )
e b net + e −b net
with a = 1.716 and b = 23 .

The above function has f 0 (0) ≈ 0.5
and it is almost linear in the range
−1 < net < 1 and extrema of f 00 (.)
occur approx. at net ≈ ±2
Stochastic or batch I
I In practice stochastic is the norm. Batch is easier to analyse and
conditions for convergence are better understood.
Stochastic advantages:
I Stochastic is usually faster. Especially when there is some
redundancy in the data set stochastic training is faster (esp.
using exposure control).
I Gradient is noisy so better chance of not getting stuck in a
local minimum. However, this can lead to slower convergence
at the end. For batch, gradients are more stable and this may
mean the descent remains in the same basin but better
convergence towards the end.
I Stochastic is also better when data is changing slowly over
time - so iid assumption does not hold exactly.
Batch advantages:
I Conditions for convergence well understood.
I Theoretical analysis of weight dynamics and convergence rates
is simpler.
Stochastic or batch II
I Acceleration techniques (e.g. 2nd order methods, momentum

term in update) are usually only usable in batch mode.
Choosing right η(t)
Figure: How η affects gradient descent - just 1 wt, quadratic approx.

(From LaCunn et. al.)
Understanding error weight surface and optimal η
Figure: Error-weight surface and optimal value of η. (From LaCunn et.

al.)
Calculating ηopt
∂E
For the update we have: w(t + 1) = w(t) − η(t) ∂w .
Expand E(w ) in a Taylor series expansion around current value wc :
dE d E2
E(w ) = E(wc ) + (w − wc ) dw |w =wc + 12 (w − wc )2 dw 2 |w =wc + · · ·
d E 2
Since E is a quadratic in w , dw 2 is a constant and higher order
terms are 0. Differentiating both sides w.r.t w :
dE dE d E 2
dw = dw |w =wc + (w − wc ) dw 2 |w =wc
LHS is 0 at w = wmin . This gives:

2
d E
wmin = wc − dE −1
dw |w =wc × ( dw 2 |w =wc )
Comparing the above with the update equation we get:

d 2E −1 .
ηopt = ( dw 2 |w =wc )
Assuming the E vs w graph is symmetric the largest η that will

allow convergence is η < 2ηopt
If E is not quadratic in w then higher order terms will be non-zero
and ηopt calculated above is an approx. and will require some
iterations to converge.
In higher dimensions finding ηopt is harder since now we have a
2E
Hessian matrix Hij = ∂w∂i ∂w j
which gives the curvatures of the E
versus w surface.
By diagonalizing H we line up the axes along the eigenvectors and
H is now aligned with the coordinate axes. This leads to ηopti = λ1i
and each wi update can be treated independently. If we want a
2
single η then for convergence we need η < λmax .
Example data for effect of η
Figure: Example data Gaussians centred at (−0.4, −0.8), (0.4, 0.8).

(From LaCunn et. al.)
Picture of convergence with different η
Figure: Trajectory in weight space and log(error) with epochs with

η = 1.5 and η = 2.5 where ηmax = 2.38. (From LaCunn et. al.)

AI Neural Networks Practical Aspects

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AI Neural Networks Practical Aspects

Uploaded by

Copyright:

Available Formats

CSD311: Artificial Intelligence

Neural n/ws practical aspects

I While many ways to code are possible the standard way to

I Actual values of attributes can vary widely in magnitude. Due

I Ideally, exposure should be randomized. That is in every

I Required properties of an activation fn. a) non-linear b)

The activation function on the left

with a = 1.716 and b = 23 .

I Acceleration techniques (e.g. 2nd order methods, momentum

Figure: How η affects gradient descent - just 1 wt, quadratic approx.

Figure: Error-weight surface and optimal value of η. (From LaCunn et.

LHS is 0 at w = wmin . This gives:

Comparing the above with the update equation we get:

Assuming the E vs w graph is symmetric the largest η that will

Figure: Example data Gaussians centred at (−0.4, −0.8), (0.4, 0.8).

Figure: Trajectory in weight space and log(error) with epochs with

You might also like