You are on page 1of 2

4

posed weight initialization as New Weight Initialization (NWI) routines and label
them as NWI1 a a a =1, a d NWI½ to mea a a =
½.

4 Function Approximation Tasks and Experiment Design

4.1 Function Approximation Tasks


The 8 function approximation tasks taken from literature [4,6,7,8] are considered. The
functions approximation tasks are presented in Table 1.

Table 1. Function approximation problems and the associated network architecture. (I: Number
of inputs, H: Number of hidden nodes and O: number of output nodes)
Task Function I H O
1 1
𝐹1 𝑥 6
F1: 𝑥 0.3 2 0.01 𝑥 0.9 2 0.4 1 20 1
Where x (0,1,1).
3 1 𝑥 2 e− − +1
𝑥
𝐹2 𝑥, 𝑦 10 𝑥 3 𝑦 5 e− −
F2: 5 2 15 1
1 − +1 −
e
3
Where x,y ( 3,3).
𝐹3 𝑥, 𝑦 e sin
F3: 2 10 1
Where x,y ( 1,1).
1 sin 2𝑥 3𝑦
𝐹4 𝑥, 𝑦
F4: 3.5 sin 𝑥 𝑦 2 64 1
Where x,y ( 2,2).
𝐹5 𝑥, 𝑦 42.659 0.1 𝑥 0.05 𝑥 4 10 𝑥 2 𝑦 2 5𝑦 4
F5: 2 17 1
Where x,y ( 0.5,0.5).
𝐹6 𝑥, 𝑦 1.9 1.35 e sin 13 𝑥 0.6 2 e sin 7𝑦
F6: 2 18 1
Where x,y (0,1).
𝐹7 𝑥, 𝑦 sin 2𝜋 𝑥 2 𝑦2
F7: 2 24 1
Where x,y ( 1,1).
2
𝐹 𝑥1 , … , 𝑥6 10 sin 𝜋𝑥1 𝑥2 20 𝑥3 0.5 10𝑥4 5𝑥5 0𝑥6
F8: 6 30 1
Where x1, , x6 ( 1,1).

4.2 Experiment Design


The training algorithm used is a variant of the standard back-propagation algorithm
[9,10], known as the improved resilient back-propagation with weight back-tracking
algorithm. The resilient back-propagation algorithm was proposed in [11,12]. The
improved resilient back-propagation with weight back-tracking algorithm (iRPROP+)
was proposed in [13]. The iRPROP+ algorithm is a first order (gradient based) non-
linear optimization technique that has performance that is comparable if not better
performance to conjugate gradient, BFGS and Levenberg-Marquardt (second order
methods) for the training of ANNs [7,8,13].
5

For each learning task, a set of 1000 data points were generated from the domain
of the inputs using uniform random number generator. Together with the correspond-
ing output values of the function, constitute the data set for experiments. Both the
inputs as well as the outputs are scaled to the interval [-1,1] (or, min-max normaliza-
tion). All experiments are conducted on these scaled variables and the results are re-
ported for the same. The 1000 tuple data set is divided in two parts, 500 tuples consti-
tute the training set (TRS) and the other set of 500 tuples constitutes the test data set
(TES).
The architecture was fixed by exploratory experiments in which the number of
nodes in the hidden layer was varied from 1 to 100, and the smallest network that
provided a satisfactory error of training was taken as the architecture for further ex-
perimentation. The architecture summary is a part of Table 1. The exploratory exper-
iments were conducted for 500 epochs of training.
For each learning task (function approximation problem) an ensemble of 50 initial
networks is created by initialization of weights / thresholds using each of the weight
initialization routines, namely: (a) URW, (b) NWI1, and (c) NWI½. Thus, for each
problem we have 150 networks while there are 8 tasks, thus in all 1200 networks are
trained. The training is conducted for 2000 epochs for the detailed experiments for
comparison of weight initialization techniques.
For the measurement of the errors of training (over the training data set) and the
generalization error (over the test data set) we use the mean squared error measure
(MSE). Since for each weight initialization technique, the ensemble of networks con-
tains 50 networks (for each problem), the average of the MSEs for the ensemble is
reported as MMSE. We also report the median of the ensemble MSE as MeMSE, as
the median is deemed to be a more robust estimator of central tendency [14].
The values of the MMSE and MeMSE allows us to compare the different initiali-
zation techniques but to see whether the difference in the MMSE values are statisti-
ca ca , a dS d -test [15,16] at a significance
level of 0.05 and similarly to assess the statistical significance of the difference of
MeMSE for the different weight initialization techniques we use the one-tailed Wil-
c a -sum test [15,16] at a significance level of 0.05.

5 Results

The experimental results for training (over the training data set, TRS) and the general-
ization error (over the test data set, TES) is summarized in Table 2 and Table 3, re-
spectively. From these tables, it is clear that the proposed weight initialization rou-
tines have better training behavior (that is, achieve lower values of MMSE and MeM-
SE) and also have better generalization ability. From Table 1, we observe that in 6 of
the tasks, NWI½ performs the best during training while in 2 tasks NWI1 has the best
performance. A similar trend is observed from the generalization error summary (Ta-
ble 3). Though, in this case for F8, on the basis of MMSE, NWI1 performs better than
NWI½. A similar trend is shown on the basis of the ratios calculated. The ratios reflect

You might also like