Professional Documents
Culture Documents
I Encoding output.
I Data pre-processing.
I Ordering exposure of instances.
I Weight initialization.
I Stochastic or batch.
I Choice of activation function.
I Convergence of gradient descent.
I Effect of η(t) on convergence. How to choose η(t).
Encoding output
d E 2
Since E is a quadratic in w , dw 2 is a constant and higher order
terms are 0. Differentiating both sides w.r.t w :
dE dE d E 2
dw = dw |w =wc + (w − wc ) dw 2 |w =wc