You are on page 1of 2

New Weight Initialization Methods for ANNs Validated

on Function Approximation Tasks

Rijul Singh Malik1, Apoorvi Sood2, and Pravin Chandra3[0000-0002-6555-3832]

University School of Information, Communication & Technology,


Guru Gobind Singh Indraprastha University, Delhi (India) - 110078
1rijulsingh1@gmail.com; 2soodapoorvi@yahoo.com;
3pchandra@gmail.com,3chandra.pravin@gmail.com

Abstract. The choice of initial weights / thresholds plays an important role in


the training and generalization behavior of artificial neural networks. In this pa-
per we consider two weight initialization techniques and compare them with the
generally used method of initializing weights/thresholds to uniform random
weights. The comparison is performed on a set of 8 function approximation
problems. The experimental results demonstrate the efficiency and efficacy of
the proposed method of weight / threshold initialization.

Keywords: Weight Initialization, Artificial Neural Network Training, Network


Initialization.

1 Introduction

The training and generalization capability of Artificial Neural Networks (ANNs) has
been shown to be dependent on the initial choice of the weights and thresholds [1].
The generally used convention is to initialize all weights and thresholds by uniform
sampling of the range [ , ] a a a b [2]. Generally,
a = 1 0.5. Ra d ialization of weights and thresholds (together for
the sake of brevity shall be called weights) allows the weights to evolve independent-
ly in training (that is, breaks the symmetry of weights). Small weights in magnitude
lead to stability in training but too small weights can lead to very slow convergence of
the ANN during training [2,3]. Also, very large value of the weight magnitude can
lead to premature saturation of the sigmoidal activation functions used and in this case
the derivative of activation functions vanishes or are very small again leading to very
slow convergence of ANN training algorithm [2,3].
On close examination of the randomness criteria of the initial weights, it is ob-
served that weights need not be random. The requirement that initial weights are dif-
ferent is sufficient to break the weight symmetry. Therefore, in this paper we demon-
strate the efficacy and the efficiency of using a weight initialization method in which
the network thresholds are allocated in a deterministic manner while the weights are
initialized by a method that takes into account the fan in as well as the fan out from a
node. The demonstration of the suitability of the weight initialization method is
2

demonstrated by comparing the training and generalization errors for 8 function ap-
proximation tasks. All supervised learning task can be considered as a regression
problem and the function approximation tasks are equivalent to regression tasks with-
out the error of estimation or measurement [4]. That is, the function approximation
may be considered as a regression problem without any error in the measurement of
the independent variables and is therefore a suitable task for demonstration of learn-
ing methodology.
The paper is organized as follows: Section 2 describes the architecture design of
the ANNs used in the experiments. Section 3 describes the weight initialization rou-
tines. Section 4 describes the design of experiments. The results are presented in Sec-
tion 5 while the conclusions are presented in Section 6.

2 ANN Architecture

The architecture of ANN used in this work is called the feed-forward architecture
without short-cut connections. That is, the nodes in the layers in-between the input
and the output layer are connected to preceding and succeeding layer node(s) only.
The universal approximation results for feed-forward ANNs require at least one layer
of hidden layer with sufficient number of non-linear (sigmoidal nodes) (see [5] for a
survey and further references). Since the minimal number of hidden layer required for
the universal approximation property is one, in this work we use networks with only
one hidden layer of sigmoidal nodes. Fig. 1 shows the schematic diagram of a single
hidden layer network used in this work. In general, more than one node may be pre-
sent in the output layer.

Fig. 1. The schematic diagram of a single hidden layer feed-forward network with I inputs and
one output (O).

T a d dd a d ,
d dd a d ,
dd (a ) d a d s the thresholds
th
( a ) d ( ). T dd a node (with xj being
the jth input) is:

You might also like