A Fast and Accurate Online Sequential Learning Algorithm For Feedforward Networks

See
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/6666254
A Fast and Accurate Online Sequential Learning

Algorithm for Feedforward Networks
Article in IEEE Transactions on Neural Networks · December 2006

DOI: 10.1109/TNN.2006.880583 · Source: PubMed
CITATIONS READS
762 264
4 authors, including:
Nanying Liang Narasimhan Sundararajan

Institute for Infocomm Research Nanyang Technological University
10 PUBLICATIONS 995 CITATIONS 233 PUBLICATIONS 6,431 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Research relating to diseases of the brain View project
Complex-valued Neural Network View project
All content following this page was uploaded by Narasimhan Sundararajan on 26 June 2017.
The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 6, NOVEMBER 2006 1411
A Fast and Accurate Online Sequential Learning

Algorithm for Feedforward Networks
Nan-Ying Liang, Guang-Bin Huang, Senior Member, IEEE, P. Saratchandran, Senior Member, IEEE, and
N. Sundararajan, Fellow, IEEE
Abstract—In this paper, we develop an online sequential learning uses the past data together with the new data and per-
learning algorithm for single hidden layer feedforward networks forms a retraining, thus consuming a lot of time. There are many
(SLFNs) with additive or radial basis function (RBF) hidden industrial applications where online sequential learning algo-
nodes in a unified framework. The algorithm is referred to as
online sequential extreme learning machine (OS-ELM) and can rithms are preferred over batch learning algorithms as sequential
learn data one-by-one or chunk-by-chunk (a block of data) with learning algorithms do not require retraining whenever a new
fixed or varying chunk size. The activation functions for additive data is received.
nodes in OS-ELM can be any bounded nonconstant piecewise The back-propagation (BP) algorithm and its variants have
continuous functions and the activation functions for RBF nodes been the backbone for training SLFNs with additive hidden
can be any integrable piecewise continuous functions. In OS-ELM,
the parameters of hidden nodes (the input weights and biases of
nodes. It is to be noted that BP is basically a batch learning
additive nodes or the centers and impact factors of RBF nodes) algorithm. Stochastic gradient descent BP (SGBP)[11] is one
are randomly selected and the output weights are analytically of the main variants of BP for sequential learning applications.
determined based on the sequentially arriving data. The algorithm In SGBP, network parameters are learned at each iteration
uses the ideas of ELM of Huang et al. developed for batch learning on the basis of first-order information of instantaneous value
which has been shown to be extremely fast with generalization
of the cost function using the current training pattern. SGBP
performance better than other batch training methods. Apart
from selecting the number of hidden nodes, no other control suffers from slow training error convergence as large number
parameters have to be manually chosen. Detailed performance of training data may be required. To overcome this deficiency,
comparison of OS-ELM is done with other popular sequential researchers have proposed to use second order information
learning algorithms on benchmark problems drawn from the in the network parameter learning process, such as the recur-
regression, classification and time series prediction areas. The sive Levenberg–Marquardt algorithm [12], [13]. Even though
results show that the OS-ELM is faster than the other sequential
algorithms and produces better generalization performance. second-order methods can shorten the overall convergence
time, they may need more time for processing each data and
Index Terms—Extreme learning machine (ELM), growing and this may pose problems in sequential learning if the data arrives
pruning RBF network (GAP-RBF), GGAP-RBF, minimal resource
allocation network (MRAN), online sequential ELM (OS-ELM), quickly. The network size of SGBP needs to be predefined and
resource allocation network (RAN), resource allocation network fixed.
via extended kalman filter (RANEKF), stochastic gradient descent Sequential learning algorithms have also become popular for
back-propagation (SGBP). feedforward networks with RBF nodes. These include resource
allocation network (RAN) [14] and its extensions [15]–[19].
Different from SGBP, the number of RBF hidden nodes in RAN
I. INTRODUCTION [14] and its variants [15]–[19] is not predefined. RAN [14] and
RANEKF [15] determines whether to add a new node based on
O VER THE past two decades, single hidden layer feedfor-

ward neural networks (SLFNs) have been discussed thor-
oughly by many researchers [1]–[10]. Two main architectures
the novelty of incoming data. Besides growing nodes based on
the novelty, MRAN [16], [17], GAP-RBF [18], and GGAP-RBF
[19] can also prune insignificant nodes from the networks. RAN,
exist for SLFN, namely: 1) those with additive hidden nodes, RANEKF, and MRAN require many control parameters to be
and 2) those with radial basis function (RBF) hidden nodes. tuned and in the case of large problems the learning speed may
For many of the applications using SLFNs, training methods are be slow [18], [19]. Although GAP-RBF [18] and GGAP-RBF
usually of batch-learning type. Batch learning is usually a time [19] have tried to simplify the sequential learning algorithms
consuming affair as it may involve many iterations through the and increase the learning speed, they need the information about
training data. In most applications, this may take several min- the input sampling distribution or input sampling range, and
utes to several hours and further the learning parameters (i.e., the learning may still be slow for large applications. Also, all
learning rate, number of learning epochs, stopping criteria, and the aforementioned sequential learning algorithms handle data
other predefined parameters) must be properly chosen to en- one by one only and cannot handle data on a chunk (block of
sure convergence. Also, whenever a new data is received batch data) by chunk basis. It is worth noting that all the BP-based
and RAN-based sequential learning algorithms can only handle
Manuscript received September 23, 2005; revised May 17, 2006. specific types of hidden (additive or RBF) nodes and not both.
The authors are with the School of Electrical and Electronic Engineering,
Nanyang Technological University, Singapore 639798, Singapore (e-mail:
In this paper, a sequential learning algorithm referred to as
gbhuang@ieee.org). online sequential extreme learning machine (OS-ELM) that
Digital Object Identifier 10.1109/TNN.2006.880583 can handle both additive and RBF nodes in a unified framework
1045-9227/$20.00 © 2006 IEEE
1412 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 6, NOVEMBER 2006
is introduced. OS-ELM can learn the training data not only of OS-ELM. Section IV highlights the difference between
one-by-one but also chunk-by-chunk (with fixed or varying OS-ELM and other popular sequential learning algorithms
length) and discard the data for which the training has already such as SGBP, GAP-RBF, GGAP-RBF, RAN, RANEKF, and
been done. It is a versatile sequential learning algorithm in the MRAN. Performance evaluation of OS-ELM is shown in
following sense. Section V based on the benchmark problems in the areas of re-
1) The training observations are sequentially (one-by-one or gression, classification, and time-series prediction. Conclusions
chunk-by-chunk with varying or fixed chunk length) pre- based on the study are highlighted in Section VI.
sented to the learning algorithm.
2) At any time, only the newly arrived single or chunk of ob-
II. REVIEW OF ELM
servations (instead of the entire past data) are seen and
learned. This section briefly reviews the batch ELM developed by
3) A single or a chunk of training observations is discarded Huang et al. [20]–[22], [27], [30] to provide the necessary back-
as soon as the learning procedure for that particular (single ground for the development of OS-ELM in Section III. A brief
or chunk of) observation(s) is completed. mathematical description of SLFN incorporating both additive
4) The learning algorithm has no prior knowledge as to how and RBF hidden nodes in a unified way is given first.
many training observations will be presented.
OS-ELM originates from the batch learning extreme learning A. Mathematical Description of Unified SLFN
machine (ELM) [20]–[22], [27], [30] developed for SLFNs
with additive and RBF nodes. The performance of ELM has The output of an SLFN with hidden nodes (additive or RBF
been evaluated on a number of benchmark problems from nodes) can be represented by
the function regression and classification areas. Results show
that compared with other gradient–descent-based learning
algorithms (including BP algorithms) ELM provides better
generalization performance at higher learning speed and the (1)
learning phase in many applications is completed within sec-
onds [20]–[23].
In OS-ELM with additive nodes, the input weights (of the where and are the learning parameters of hidden nodes and
connections linking the input nodes to hidden nodes) and biases the weight connecting the th hidden node to the output node.
are randomly generated and based on this the output weights is the output of the th hidden node with respect to
are analytically determined. Similarly, in OS-ELM with RBF the input . For additive hidden node with the activation func-
nodes, the centers and widths of the nodes are randomly gen- tion (e.g., sigmoid and threshold),
erated and fixed and then, based on this, the output weights is given by
are analytically determined. Unlike other sequential learning
algorithms which have many control parameters to be tuned,
OS-ELM only requires the number of hidden nodes to be spec- (2)
ified.
The performance of the proposed OS-ELM is evaluated by where is the weight vector connecting the input layer to the
comparing it with other sequential learning algorithms such as th hidden node and is the bias of the th hidden node.
SGBP, RAN, RANEKF, MRAN, GAP-RBF, and GGAP-RBF. denotes the inner product of vectors and in .
Experimental results on benchmark problems from regression, For RBF hidden node with activation function
classification, and time-series prediction problems show that (e.g., Gaussian), is given by
the proposed OS-ELM produces better generalization per-
formance at a very fast learning speed. For regression, the
benchmark problems considered are three higher dimensional (3)
real-world problems from the University of California at Irvine
(UCI) machine learning repository [24], i.e.: 1) abalone—de- where and are the center and impact factor of th RBF
termination of the age of abalone using the abalone database; node. indicates the set of all positive real values. The RBF
2) Auto-MPG—determination of the fuel consumption of network is a special case of SLFN with RBF nodes in its hidden
different models of cars using the Auto-MPG database; 3) layer. Each RBF node has its own centroid and impact factor,
California housing—estimation of the median house prices and its output is given by a radially symmetric function of the
in the California area using California housing database. For distance between the input and the center.
classification, the comparison is done based on the following
real world benchmark problems [24], i.e.: 1) image segment
B. ELM
problem, 2) satellite image problem; and 3) DNA Problem. For
time-series prediction, the Mackey–Glass chaotic time series In supervised batch learning, the learning algorithms use a
[25] is used. finite number of input–output samples for training. For , arbi-
The paper is organized as follows. Section II gives a brief trary distinct samples . Here, is a
review of the batch ELM. Section III presents the derivation input vector and is a target vector. If an SLFN with
LIANG et al.: FAST AND ACCURATE ONLINE SEQUENTIAL LEARNING ALGORITHM FOR FEEDFORWARD NETWORKS 1413
hidden nodes can approximate these samples with zero error, Let us consider a vector
it then implies that there exist , and such that , the th column
of , in Euclidean space , where and is
any interval of . Following the same proof method of Tamura
(4) and Tateishi [28, p. 252] and Huang [26, Th. 2.1], it can be
easily proved that vector does not belong to any subspace
Equation (4) can be written compactly as whose dimension is less than . Hence, from any interval
it is possible to choose bias values for
the hidden neurons such that the corresponding vectors
(5) span . This means that for any
randomly generated and based on any continuous
where probability distribution, can be made full-rank with
probability one.
Theorem II.1 implies that the SLFNs with randomly gen-
erated additive or RBF hidden nodes can learn distinct sam-
ples with zero error. In real applications, the number of hidden
.. .. (6)
. . nodes will always be less than the number of training sam-
ples and, hence, the training error cannot be made exactly
zero but can approach a nonzero training error . The following
.. .. theorem formally states this fact [27].
. and . (7) Theorem II.2: Given any small positive value and acti-
vation function which is infinitely differentiable
in any interval, there exists such that for arbitrary
is called the hidden layer output matrix of the network [26];
distinct input vectors , for any
the th column of is the th hidden node’s output vector with
randomly generated according to any continuous
respect to inputs and the th row of is the
probability distribution with
output vector of the hidden layer with respect to input .
probability one.
The ELM algorithm is based on the following two principles.
Proof: According to Theorem II.1, for any , we
1) When the number of training samples equals the number
have . Thus, there should exist
of hidden nodes, i.e., , one can randomly assign
the parameters of hidden nodes (the input weights and bi- which makes .
ases for additive hidden nodes or the centers and impact According to Theorems II.1 and II.2, the hidden node param-
factors for RBF) and based on this analytically calculate eters and (input weights and biases or centers and impact
the output weights by simply inverting and realize zero factors) of SLFNs need not be tuned during training and may
training error. Calculation of the output weights is done in a simply be assigned with random values. Equation (5) then be-
single step here. There is no need for any lengthy training comes a linear system and the output weights are estimated as
procedure where the network parameters are adjusted in-
teractively with appropriately chosen control parameters
(learning rate and learning epochs, etc.).
(8)
2) When the number of training samples is greater than the
number of hidden nodes, i.e., , one can still ran-
where is the Moore–Penrose generalized inverse [29] of
domly assign the parameters of hidden nodes and calculate
the hidden layer output matrix . There are several ways to
the output weights by using a pseudoinverse of to give
calculate the Moore–Penrose generalized inverse of a matrix,
a small nonzero training error . Here also the output
including orthogonal projection method, orthogonalization
weights’ calculation is done in a single step and does not
method, iterative method, and singular value decomposition
need lengthy training procedure. These have been formally
(SVD) [29]. The orthogonal projection method can be used
stated in the following theorems [27].
when is nonsingular and . How-
Theorem II.1: Let an SLFN with additive or RBF hidden
ever, may tend to become singular in some applications.
nodes and an activation function which is infinitely dif-
Thus orthogonal projection method may not perform well in
ferentiable in any interval of be given1 Then, for arbi-
all applications. The orthogonalization method and iterative
trary distinct input vectors , and
method have limitations since searching and iterations are used.
randomly generated with any continuous proba-
The SVD can always be used to calculate the Moore–Penrose
bility distribution, respectively, the hidden layer output matrix
generalized inverse of , and thus is used in the most imple-
is invertible with probability one.
mentations of ELM.
Proof: Instead of repeating the rigorous proof provided by
It should be noted that the aforementioned theorems assume
Huang et al. [27], the basic idea of the proof can be summarized
the availability of complete training data and use (8) for calcu-
as follows.
lating the output weights. ELM is thus a batch learning method.
1Detail discussions on threshold networks have been given in [22]. Universal approximation capability of ELM has been analyzed
in [30] in an incremental method2 and it has been shown that needs to consider only the problem of minimizing
single SLFNs with randomly generated additive or RBF nodes where
with a widespread of piecewise continuous activation functions
can universally approximate any continuous target function on
any compact subspace of the Euclidean space . In the imple- .. ..
mentation of ELM, the activation functions for additive nodes . .
can be any bounded nonconstant piecewise continuous func-
tions and the activation functions for RBF nodes can be any in- and
tegrable piecewise continuous functions.
.. (11)
III. OS-ELM .
The batch ELM described previously assumes that all the
training data ( samples) is available for training. However, in By Theorem II.1, the solution to minimizing is
real applications, the training data may arrive chunk-by-chunk given by where .
or one-by-one (a special case of chunk) and, hence, the batch Suppose now that we are given another chunk of data
ELM algorithm has to be modified for this case so as to make it , where denotes the number of observa-
online sequential. tions in this chunk; the problem then becomes minimizing
The output weight matrix given in (8) is a
least-squares solution of (5). Here, we consider the case where
rank the number of hidden nodes. Under this condi- (12)
tion, of (8) is given by
where (13), shown at the bottom of the page, holds.
Considering both chunks of training data sets and , the
(9) output weight becomes
This is also called the left pseudoinverse of from the fact that
. If tends to become singular, one can make it
(14)
nonsingular by choosing smaller network size or increasing
data number in the initialization phase of OS-ELM. Substi-
where
tuting (9) into (8), becomes
(10) (15)
Equation (10) is called the least-squares solution to . For sequential learning, we have to express as a function
Sequential implementation of the least-squares solution of (10) of , and and not a function of the data set .
results in the OS-ELM. Now can be written as
Given a chunk of initial training set and
, if one considers using the batch ELM algorithm, one
2Huang et al. [30] learn the data in batch mode but the hidden nodes of the
network grow one by one. However, the sequential learning algorithm proposed (16)
in this paper has fixed network architecture and learns training data in sequential
mode. and
.. ..
. .
and
.. (13)
.
rather than is used to compute from

in (20). The update formula for is derived using the
Woodbury formula [32]
(17) (23)
Combining (14) and (17), is given by Let , then the equations for updating
can be written as
(24)
(18) Equation (24) gives the recursive formula for .
where is given by Remark 1: From (24), it can be seen that the sequential imple-
mentation of the least-squares solution (10) is similar to recur-
sive least-squares algorithm in [31]. Hence, all the convergence
(19) results of recursive least-squares (RLS) can be applied here.
Remark 2: From the derivation of OS-ELM, it can be
Generalizing the previous arguments, as new data arrives, a re- seen that OS-ELM and ELM can achieve the same learning
cursive algorithm for updating the least-squares solution, which performance (training error and generalization accuracy)
is similar to the recursive least-squares algorithm [31], can be when rank . In order to make rank and
written as follows. When th chunk of data set rank the number of initialization data should not
be less than the hidden node number .
Now, the OS-ELM can be summarized as follows.
Proposed OS-ELM Algorithm: First, select the type of node
(additive or RBF), the corresponding activation function , and
the hidden node number . The data
is received, where and denotes the number of ob- arrives sequentially.
servations in the th chunk, we have OS-ELM consists of two phases, namely an initialization
phase and a sequential learning phase. In the initialization
phase, the appropriate matrix is filled up for use in the
learning phase. The number of data required to fill up
should be at least equal to the number of hidden nodes. Ac-
(20) cording to Theorem II.1, rank if the first training
data are distinct. For example, if there are ten nodes, ten
where (21), shown at the bottom of the page, and training samples are enough. If the first training data are not
distinct, more training data may be required. However, in most
cases the number of training data required in the initialization
phase can be equal or close to . Following the initialization
.. (22) phase, learning phase commences either on a one-by-one or
.
chunk-by-chunk (with fixed or varying size) basis as desired.
Once a data is used, it is discarded and not used any more.
Step 1) Initialization Phase: Initialize the learning
both hold. using a small chunk of initial training data
.. .. (21)
. .
from the given training set (28)
. where ,
a) Assign random input weights and bias which is the case analyzed and demonstrated in [33].
(for additive hidden nodes) or center and Remark 4: In order to handle the case where and/or
impact factor (for RBF hidden nodes), are singular or near singular and to make
. OS-ELM more robust, and can be
b) Calculate the initial hidden layer output matrix calculated using SVD as done in all our implementations of
OS-ELM.
Remark 5: If , then OS-ELM becomes the batch
ELM. Thus, batch ELM can be considered as a special case of
.. .. OS-ELM when all the training observations are present in one
. .
learning iteration.
(25) IV. COMPARISON OF OS-ELM WITH OTHER SEQUENTIAL

c) Estimate the initial output weight LEARNING ALGORITHMS
, where and
In this section, the similarities and differences between
.
OS-ELM and some of the other well-known sequential learning
d) Set .
algorithms are presented.
Step 2) Sequential Learning Phase: Present the th
chunk of new observations A. One-by-One Versus Chunk-by-Chunk Learning Modes
In real applications, the data presented to the learning al-
gorithm may be one-by-one or chunk-by-chunk where the
chunk size may vary. Sequential learning algorithms like
where denotes the number of observations in SGBP [11], RAN [14], RANEKF [15], MRAN [16], [17],
the th chunk. GAP-RBF [18], GGAP-RBF [19], and OS-ELM can be used
a) Calculate the partial hidden layer output matrix only in the one-by-one mode. SGBP can in principle work for
for the th chunk of data , as chunk-by-chunk but difficulty may arise as their learning speed
shown in (26), at the bottom of the page. is too slow for them to complete the first chunk before the
b) Set . next one arrives. However, SGBP works well for one-by-one
learning mode. OS-ELM works well for one-by-one and also
c) Calculate the output weight for chunk-by-chunk learning modes. Further, the chunk can be
varying and need not be fixed.
B. Selection of Parameters
(27)
The control parameters used in the sequential learning algo-
d) Set . Go to Step 2). rithms RAN, RANEKF, MRAN, GAP-RBF, and GGAP-RBF
Remark 3: The chunk size does not need to be constant, i.e., include distance parameters and impact factor
the number of training observations in the th chunk adjustment parameter . Besides these, MRAN uses some
does not need to be the same as . When the training data is growing and pruning parameters. GAP-RBF and GGAP-RBF
received one-by-one instead of chunk-by-chunk , (27) need an estimate of the input sampling distributions or ranges
has the following simple format (Sherman–Morrison formula of sampling areas. (Refer to [19] for details.)
[32]): For SGBP, the algorithm control parameters are network size,
learning rate and momentum constant, and they need to be ad-
justed depending on each problem. The only control parameter
to be selected for OS-ELM is the size of the network.
.. .. (26)
. .
TABLE I
SPECIFICATION OF BENCHMARK DATA SETS
C. Activation Function Types model selection in the literature. It is problem specific and has
RAN, RANEKF, MRAN, GAP-RBF, and GGAP-RBF use to be predetermined.
only RBF nodes and SGBP uses only additive hidden nodes. For OS-ELM, only the parameter of the optimal number of
However, OS-ELM can work for both additive and RBF hidden hidden units needs to be determined. SGBP requires determina-
nodes. Unlike SGBP which is gradient–descent-based and tion of the optimal number of the hidden units, learning rate, and
can only work for differentiable activation function, OS-ELM momentum constant. For RAN, RANEKF, MRAN, GAP-RBF,
can work for nondifferentiable activation functions as well. In and GGAP-RBF, control parameters including distance parame-
OS-ELM, feedforward networks with additive hidden nodes ters , impact factor adjustment parameter have
and RBF networks have been unified in the sense that they can to be determined. For MRAN, the growing and pruning thresh-
be trained sequentially with the same learning algorithm. olds need to be determined. For GGAP-RBF, the input sampling
distribution has to be estimated.
V. PERFORMANCE EVALUATION OF OS-ELM For the sake of simplicity, we mainly discuss the procedure of
selecting the optimal number of hidden nodes for the proposed
The performance of OS-ELM is evaluated on the benchmark OS-ELM algorithm and SGBP. The procedure makes use of the
problems described in Table I which includes three regres- training and validation set methods. The training data set is sep-
sion applications (auto-MPG, abalone, California housing) arated into two nonoverlapped subsets: One for training and the
[24], three classification applications (image segment, satellite other for validation. The optimal number of hidden units is se-
image, DNA) [24] and one time series prediction application lected as the one which results in the lowest validation error.
[25]. OS-ELM is first compared with other popular sequential An example of model selection of optimal hidden unit number
learning algorithms such as SGBP, RAN, RANEKF, MRAN, for OS-ELM with sigmoidal activation function [OS-ELM (sig-
GAP-RBF, and GGAP-RBF in one-by-one learning mode and moid)] and SGBP is shown in Fig. 1 in auto-MPG case. In that
then the performance evaluation of OS-ELM in chunk-by-chunk figure, the top two curves correspond to training and validation
learning mode is conducted. All the simulations have been con- error (averaged over 50 trials) for SGBP and the bottom two
ducted in MATLAB 6.5 environment running on an ordinary PC curves are for OS-ELM (sigmoid). As observed from Fig. 1, the
with 3.0 GHZ CPU. Both the Gaussian RBF activation function lowest validation error is achieved when the number of hidden
and the sigmoidal additive nodes of OS-ELM (sigmoid) and SGBP are within the ranges
activation function [15, 35] and [13, 24], respectively. Therefore, one can choose
have been used in the simulations of ELM3 and OS-ELM. The the optimal hidden unit numbers for OS-ELM and SGBP (in
Gaussian RBF activation function is also used in the simula- auto-MPG case) from these ranges. It can also be seen that RMS
tions of RAN, RANEKF, MRAN, GAP-RBF, and GGAP-RBF error curves for OS-ELM are quite smooth compared to SGBP.
and the sigmoidal activation function used in SGBP. In our It implies that OS-ELM is less sensitive to the network size.
simulations, the input and output attributes of regression ap- Fig. 2 details such behavior in a single trial for OS-ELM with
plications are normalized into the range [0,1] while the input 25 hidden units and for SGBP with 13 hidden units. In addition,
attributes of classification applications are normalized into the the number of training data for initialization has been taken
range [-1,1]. For both ELM and OS-ELM with additive hidden as for regression problems, for
nodes, the input weights and biases are randomly chosen from classification problems, and for times series
the range [-1,1]. For both ELM and OS-ELM with RBF hidden problems where is the network size.
nodes the centers are randomly chosen from the range [-1,1]. In our paper, the optimal number of hidden units for OS-ELM
The impact width is chosen from the range [0.2,4.2] for all and SGBP has been selected for all benchmark problems. The
problems except for OS-ELM in image segment and DNA optimal learning parameters for SGBP and the optimal control
cases. For these two cases in order to make nonsingular the parameters for the RAN-based sequential algorithms are also
range should be [3, 11] and [20, 60], respectively. selected.
A. Model Selection B. Performance Evaluation of OS-ELM: One-by-One Case
The estimation of optimal architecture of the network and the We first evaluate and compare the performance of the
optimal learning parameters of the learning algorithm is called proposed OS-ELM with other one-by-one learning mode
3Source codes and some references of ELM can be found at www.ntu.edu.sg/ algorithms: SGBP, RAN, RANEKF, MRAN GAP-RBF, and
home/egbhuang/ GGAP-RBF. For each problem, the results are averaged over
Fig. 1. Model selection for auto-MPG case.
Fig. 2. Learning evolution for auto-MPG case.
50 trials. The average training time, the average training and based on the information collected using all the block groups in
testing RMSE for regression and time-series prediction appli- California from the 1990 census. For all the problems studied
cations and the average training and testing classification rate here, the training and testing data are randomly selected for
for classification problems are shown. each trial.
1) Regression Problems: Three benchmark problems have Table II summarizes the results for regression problems in
been studied here, namely: auto-MPG, abalone, and California terms of the training time, training RMSE, testing RMSE, and
housing [24]. The auto-MPG problem is to predict the fuel the number of hidden units for each algorithm. The number
consumption (miles per gallon) of different models of cars. of hidden units for OS-ELM (sigmoid), OS-ELM (RBF), and
The abalone problem is the estimation of the age of abalone SGBP was determined based on the model selection procedure
from the physical measurements and the California housing while for RAN, RANEKF, MRAN, and GGAP-RBF it is gen-
problem is to predict the median California housing price erated automatically by the algorithms.
TABLE II
COMPARISON BETWEEN OS-ELM AND OTHER SEQUENTIAL ALGORITHMS ON REGRESSION APPLICATIONS
TABLE III
COMPARISON BETWEEN OS-ELM AND OTHER SEQUENTIAL ALGORITHMS ON CLASSIFICATION APPLICATIONS
As observed from Table II, the performance of OS-ELM same scene in four different spectral bands. The database is a
(sigmoid) and OS-ELM (RBF) is similar to each other except (tiny) subarea of a scene, consisting of 82 100 pixels. Each
that OS-ELM (RBF) requires twice training time taken by data in the database corresponds to a region of 3 3 pixels. The
OS-ELM (sigmoid). Comparing with other algorithms, we aim is to classify of the central pixel in a region into the six cat-
can see that the training time taken by both OS-ELMs and egories, namely: red soil, cotton crop, grey soil, damp grey soil,
SGBP is much less than RAN, RANEKF, MRAN, GAP-RBF, soil with vegetation stubble, and very damp grey soil using 36
and GGAP-RBF. However, out of all learning algorithms, spectral values for each region. The training and test sets are
OS-ELMs obtained the lowest testing root-mean-square error fixed according to [24], but the order of training set is randomly
(RMSE). shuffled for each trial.
2) Classification Problems: For classification studies, three The database “Primate splice-junction gene sequences
benchmark problems have been considered, namely: image seg- (DNA) with associated imperfect domain theory” is known
mentation, satellite image, and DNA [24]. The image segmenta- as the DNA problem. Splice junctions are points on a DNA
tion problem consists of a database of images drawn randomly sequence at which “superfluous” DNA is removed during the
from seven outdoor images and consists of 2310 regions of 3 3 process of protein creation in higher organisms. The aim of
pixels. The aim is to recognize each region into one of the seven the DNA problem is, given a sequence of DNA, to recognize
categories, namely: brick facing, sky, foliage, cement, window, the boundaries between exons (the parts of the DNA sequence
path, and grass using 19 attributes extracted from each square retained after splicing) and introns (the parts of the DNA
region. Training and testing data sets are randomly drawn from sequence that are spliced out). This consists of three subtasks:
the database. recognizing exon/intron boundaries (referred to as EI sites),
The satellite image problem consists of a database generated intron/exon boundaries (IE sites), and neither ( sites). A given
from landsat multispectral scanner. One frame of landsat mul- sequence of DNA consists of 60 elements (called “nucleotides”
tispectral scanner imagery consists of four digital images of the or “base-pairs”). The symbolic variables representing nu-
TABLE IV
COMPARISON BETWEEN OS-ELM AND OTHER SEQUENTIAL ALGORITHMS ON MACKEY–GLASS TIME-SERIES APPLICATION
TABLE V
PERFORMANCE COMPARISON OF ELM AND OS-ELM ON REGRESSION APPLICATIONS
cleotides were replaced by three binary indicator variables, for 0.2, 0.1, and 17. Integrating the equation over
thus resulting in 180 binary attributes. The training and test the time interval by the trapezoidal rule yields
sets are also fixed according to [24], but order of training set is
randomly shuffled for each trial. During our simulations, RAN
and RANEKF produced large number of hidden nodes for
problems, which resulted in system memory overflow or large
training time. It is also found that for these cases, it is difficult to
estimate input sampling distribution for GGAP-RBF algorithm. (30)
For the problem satellite image and DNA, no study has been
done with GAP-RBF due to the complexity in estimating the The time series is generated under the condition 0.3
input sampling ranges of higher dimensional and binary input for and predicted with 50 sample steps ahead
cases. Thus, the results of some RAN-based algorithms could using the four past samples: , and
not be provided for these cases. As observed from Table III, . Hence, the th input–output instance is
OS-ELM achieves the best generalization performance with
extremely fast learning speed compared with MRAN. Although
SGBP complete training at fastest speed in these cases, its
generalization performance are much worse than OS-ELM. In this simulation, and the training observations is
3) Time-Series Prediction Problem: The need to time se- from 1 to 4000 and the testing observations from 4001
ries prediction arises in many real-world problems such as de-
to 4500.
tecting arrhythmia in heartbeats, stock market indices, etc. One Performance comparison results are given in Table IV. For
of the classical benchmark problems in literature is the chaotic this problem, SGBP is excluded because model selection pro-
Mackey–Glass differential delay equation given by Mackey and cedure indicated that it is unsuitable for this application. The
Glass [25]:
validation errors were high even after trying on a wide number
of hidden units. As observed from Table IV, OS-ELMs achieves
the lowest training and testing RMSE as well as the training
(29)
time.
TABLE VI
PERFORMANCE COMPARISON OF ELM AND OS-ELM ON CLASSIFICATION APPLICATIONS
TABLE VII
PERFORMANCE COMPARISON OF ELM AND OS-ELM ON MACKEY–GLASS TIME SERIES-APPLICATION
C. Performance Evaluation of OS-ELM: Chunk-by-Chunk of satellite image problem, the training time reduces from 300
Tables V–VII show the performance comparison of ELM s (using one-by-one learning mode) to 20 s (using twnty-by-
implemented in the chunk-by-chunk learning mode. For the twenty learning mode). One interesting exception to this is the
chunk-by-chunk learning mode, we have considered both a DNA problem. For this problem, the training time in the chunk
fixed chunk size of 20 as well as a randomly varying chunk size mode of twnty-by-twenty and [10,30] is even smaller than batch
between 10–30, indicated by in these tables. For the mode. The reason for this is that: For the batch mode operation,
purpose of comparisons, results of chunk size 1 and also the the training data set is large and needs more RAM space. When
chunk size of the entire training set (the original batch ELM) this space exceeds a limit (for example, cache limit), the oper-
have also been presented. ations slow down. This has been verified by the actual experi-
For regression problems, it can be seen from Table V that the ments executed on higher performance computers.
accuracies obtained by ELM and OS-ELM when implemented Table VII shows the comparison results for time series predic-
in different chunk size modes using different activation function problem. The prediction accuracies for ELM and OS-ELM
tions are nearly the same. This is consistent with the earlier the- using different activation function and training in the different
oretical analysis in Section III. As far as the training time is con- modes are close to each other. It should also be noted that the
cerned, the sequential operation of one-by-one takes the highest batch mode takes more time than the chunk mode which may
time followed by the chunk of [10, 30], 20 20, and finally the be due to the same reasons as seen in DNA problem.
batch mode. This implies that the one-by-one sequential mode In summary, the comparison results indicate that OS-ELM
takes the longest while batch the shortest time and any chunk can be implemented to suit the way the data arrives without
mode operation falls in between. If the chunk size is large, it ap- sacrificing the accuracy.
proaches the time taken for batch mode operation.
Table VI shows similar results for the classification problems.
The classification accuracies for ELM and OS-ELM using dif- VI. CONCLUSION
ferent activation functions and training in the different modes
are similar. It can also be seen that the training time of OS-ELM In this paper, a fast and accurate online sequential learning
reduces with increase in the chunk size. For example, in the case algorithm (OS-ELM) has been developed for SLFNs with both
additive and RBF hidden nodes in a unified way. The activa- [16] L. Yingwei, N. Sundararajan, and P. Saratchandran, “A sequental
tion functions for additive nodes can be any bounded noncon- learning scheme for function approximation using minimal radial
basis function (RBF) neural networks,” Neural Comput., vol. 9, pp.
stant piecewise continuous functions and the activation func- 461–478, 1997.
tions for RBF nodes can be any integrable piecewise contin- [17] L. Yingwei, N. Sundararajan, and P. Saratchandran, “Performance
uous functions. Also, the algorithm can handle data arriving evaluation of a sequental minimal radial basis function (RBF) neural
or chunk-by-chunk with varying chunk size. Apart from se- network learning algorithm,” IEEE Trans. Neural Netw., vol. 9, no. 2,
pp. 308–318, Mar. 1998.
lecting the number of hidden nodes, no other control parameter [18] G.-B. Huang, P. Saratchandran, and N. Sundararajan, “An efficient se-
has to be chosen. Performance of OS-ELM is compared with quential learning algorithm for growing and pruning RBF (GAP-RBF)
other well-known sequential learning algorithms on real world networks,” IEEE Trans. Syst., Man, Cybern., B, Cybern., vol. 34, no. 6,
pp. 2284–2292, Nov. 2004.
benchmark regression, classification and time-series problems. [19] G.-B. Huang, P. Saratchandran, and N. Sundararajan, “A generalized
The results indicate that OS-ELM produces better generaliza- growing and pruning RBF (GGAP-RBF) neural network for function
tion performance with lower training time. Under the mild con- approximation,” IEEE Trans. Neural Netw., vol. 16, no. 1, pp. 57–67,
dition rank the generalization performance of the Jan. 2005.
[20] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine:
OS-ELM approaches that of the batch ELM. A new learning scheme of feedforward neural networks,” in Proc.
Int. Joint Conf. Neural Netw. (IJCNN2004), Budapest, Hungary, July
25–29, 2004, vol. 2, pp. 985–990.
ACKNOWLEDGMENT [21] G.-B. Huang and C.-K. Siew, “Extreme learning machine: RBF
network case,” in Proc. 8th Int. Conf. Control, Autom., Robot.,
The author would like to thank the Associate Editor and the Vis. (ICARCV 2004), Kunming, China, Dec. 6–9, 2004, vol. 2, pp.
1029–1036.
anonymous reviewers for their invaluable suggestions which [22] G.-B. Huang, Q.-Y. Zhu, K. Z. Mao, C.-K. Siew, P. Saratchandran, and
have been incorporated to improve the quality of the paper. N. Sundararajan, “Can threshold networks be trained directly?,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 53, no. 3, pp. 187–191, Mar.
2006.
[23] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Real-time learning capa-
REFERENCES bility of neural networks,” IEEE Trans. Neural Netw., vol. 17, no. 4,
[1] D. E. Rumellhart, G. E. Hinton, and R. J. Williams, “Learning internal pp. 863–878, Jul. 2006.
representations by error propagation,” Parallel Distrib. Process.: Ex- [24] C. Blake and C. Merz, UCI Repository of Machine Learning Databases
planations Microstructure of Cognition, vol. 1, pp. 318–362, 1986. Dept. Inf. Comp. Sci., Univ. California, Irvine, CA, 1998 [Online].
[2] S. Ferrari and R. F. Stengel, “Smooth function approximation using Available: http://www.ics.uci.edu/~mlearn/MLRepository.html
neural networks,” IEEE Trans. Neural Netw., vol. 16, no. 1, pp. 24–38, [25] M. C. Mackey and L. Glass, “Oscillation and chaos in physiological
Jan. 2005. control systems,” Science, vol. 197, pp. 287–289, 1977.
[3] C. Xiang, S. Q. Ding, and T. H. Lee, “Geometrical interpretation and [26] G.-B. Huang, “Learning capability and storage capacity of two-hidden-
architecture selection of {MLP},” IEEE Trans. Neural Netw., vol. 16, layer feedforward networks,” IEEE Trans. Neural Netw., vol. 14, no. 2,
no. 1, pp. 84–96, Jan. 2005. pp. 274–281, Mar. 2003.
[4] G.-B. Huang and H. A. Babri, “Upper bounds on the number of hidden [27] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine:
neurons in feedforward networks with arbitrary bounded nonlinear Theorey and applications,” Neurocomput., 2006, to be published.
activation functions,” IEEE Trans. Neural Netw., vol. 9, no. 1, pp. [28] S. Tamura and M. Tateishi, “Capabilities of a four-layered feedforward
224–229, Jan. 1998. neural network: Four layers versus three,” IEEE Trans. Neural Netw.,
[5] G.-B. Huang, Y.-Q. Chen, and H. A. Babri, “Classification ability of vol. 8, no. 2, pp. 251–255, Mar. 1997.
single hidden layer feedforward neural networks,” IEEE Trans. Neural [29] C. R. Rao and S. K. Mitra, Generalized Inverse of Matrices and its
Netw., vol. 11, no. 3, pp. 799–801, May 2000. Applications. New York: Wiley, 1971.
[6] K. Z. Mao and G.-B. Huang, “Neuron selection for RBF neural network [30] G.-B. Huang, L. Chen, and C.-K. Siew, “Universal approximation
classifier based on data structure preserving criterion,” IEEE Trans. using incremental constructive feedforward networks with random
Neural Netw., vol. 16, no. 6, pp. 1531–1540, Nov. 2005. hidden nodes,” IEEE Trans. Neural Netw., vol. 17, no. 4, pp. 879–892,
[7] J. Park and I. W. Sandberg, “Universal approximation using radial- Jul. 2006.
basis-function networks,” Neural Comput., vol. 3, pp. 246–257, 1991. [31] E. K. P. Chong and S. H. Żak, An Introduction to Optimization. New
[8] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, “Multilayer feedfor- York: Wiley, 2001.
ward networks with a nonpolynomial activation function can approxi- [32] G. H. Golub and C. F. V. Loan, Matrix Computations, 3rd ed. Balti-
mate any function,” Neural Netw., vol. 6, pp. 861–867, 1993. more, MD: The Johns Hopkins Univ. Press, 1996.
[9] T.-Y. Kwok and D.-Y. Yeung, “Objective functions for training new [33] G.-B. Huang, N.-Y. Liang, H.-J. Rong, P. Saratchandran, and N. Sun-
hidden units in constructive neural networks,” IEEE Trans. Neural dararajan, “On-line sequential extreme learning machine,” presented
Netw., vol. 8, no. 5, pp. 1131–1148, Sep. 1997. at the IASTED Int. Conf. Comput. Intell. (CI 2005), Calgary, AB,
[10] R. Meir and V. E. Maiorov, “On the optimality of neural-network ap- Canada, Jul. 4–6, 2005.
proximation using incremental algorithms,” IEEE Trans. Neural Netw.,
vol. 11, no. 2, pp. 323–337, Mar. 2000.
[11] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient backprop,”
Lecture Notes Comput. Sci., vol. 1524, pp. 9–50, 1998.
[12] L. S. Ngia, J. Sjöberg, and M. Viberg, “Adaptive neural nets filter using
a recursive levenberg-marquardt search direction,” in Proc. Asilomar
Conf. Signals, Syst., Comput., Nov. 1998, vol. 1–4, pp. 697–701. Nan-Ying Liang received the B.Eng. degree in
[13] V. S. Asirvadam, S. F. McLoone, and G. W. Irwin, “Parallel and sepa- biomedical engineering from Jilin University,
rable recursive Levenberg-Marquardt training algorithm,” in Proc. 12th Changchun, China, in July 2002. Currently, she is
IEEE Workshop Neural Netw. Signal Process., Sep. 2002, no. 4–6, pp. working toward the Ph.D. degree at the School of
129–138. Electrical and Electronic Engineering, Nanyang
[14] J. Platt, “A resource-allocating network for function interpolation,” Technological University, Singapore.
Neural Comput., vol. 3, pp. 213–225, 1991. Her research interests include neural networks,
[15] V. Kadirkamanathan and M. Niranjan, “A function estimation ap- biomedical signal processing, and brain computer
proach to sequential learning with neural networks,” Neural Comput., interfaces.
vol. 5, pp. 954–975, 1993.
Guang-Bin Huang (M’98–SM’04) received the N. Sundararajan (S’73–M’74–SM’84–F’96) re-

B.Sc. degree in applied mathematics and the M.Eng. ceived the B.E. degree in electrical engineering with
degree in computer engineering from Northeastern first-class honors from the University of Madras,
University, Shenyang, P. R. China, in 1991 and 1994, Chepauk, Chennai, India, in 1966, the M.Tech.
respectively, and the Ph.D. degree in electrical en- degree from the Indian Institute of Technology,
gineering from Nanyang Technological University, Madras, India, in 1968, and the Ph.D. degree in
Singapore, in 1999. During undergraduate period, he electrical engineering from the University of Illinois
also concurrently studied in the Wireless Communi- at Urbana-Champaign, Urbana, in 1971.
cation Department at Northeastern University. From 1972 to 1991, he was working in the Indian
From June 1998 to May 2001, he worked as a Space Research Organization, Trivandrum, India,
Research Fellow in Singapore Institute of Manu- starting from Control System Designer to Director,
facturing Technology (formerly known as Gintic Institute of Manufacturing Launch Vehicle Design Group, contributing to the design and development of
Technology), Singapore, where he led/implemented several key industrial the Indian satellite launch vehicles. He has also worked as a NRC Research
projects. From May 2001, he has been working as an Assistant Professor at Associate at NASA-Ames, Moffett Field, CA, in 1974 and as a Senior Research
the School of Electrical and Electronic Engineering, Nanyang Technological Associate at NASA-Langley, Hampton, VA, in 1981–86. From February 1991,
University, Singapore. His current research interests include machine learning, he has been working at the School of Electrical and Electronic Engineering,
bioinformatics, and networking. Nanyang Technological University, Singapore, first as an Associate Professor
Dr. Huang is an Associate Editor of Neurocomputing and IEEE (from 1991 to August 1999), and presently as a Professor. He was a “Prof.
TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS. I.G. Sarma Memorial ARDB Professor” (an endowed visiting professor) from
November 2002 to February 2003, at the School of Computer Science and
Automation, Indian Institute of Science, Bangalore, India. He has published
more than 130 papers and also four books titled Fully Tuned Radial Basis
P. Saratchandran (M’87–SM’96) received the Function neural networks for flight control (Kluwer: Norwell, MA, 2001),
Ph.D. degree in the area of control engineering from Radial Basis Function Neural Networks with Sequential Learning (World
Oxford University, London, U.K. Scientific: Singapore, 1999) Parallel Architectures for Artificial Neural Net-
He is an Associate Professor with the Nanyang works (IEEE Press: Piscataway, NJ, 1998), and Parallel Implementations of
Technological University, Singapore. He has several Backpropagation Neural Networks (World Scientific: Singapore, 1996). He
publications in refereed journals and conferences is listed in Marquis Who’s Who in Science and Engineering, and Men of
and has authored four books titled Fully Tuned RBF Achievement, International Biographical Center, Cambridge, U.K. His research
Networks for Flight Control (Kluwer: Norwell, MA, interests are in the areas of aerospace control, neural networks, and parallel
2002), Radial Basis Function Neural Networks with implementations of neural networks.
Sequential Learning (World Scientific: Singapore, Dr. Sundararajan is a an Associate Fellow of American Institute of Aero-
1999) Parallel Architectures for Artificial Neural nautics and Astronautics (AIAA) and also a Fellow of the Institution of Engi-
Networks (IEEE Press: Piscataway, NJ, 1998), and Parallel Implementations neers, (IES) Singapore. He was an Associate Editor for IEEE TRANSACTIONS
of Backpropagation Neural Networks (World Scientific: Singapore, 1996). His ON CONTROL SYSTEMS TECHNOLOGY, IFAC Journal on Control Engineering
interests are in neural networks, bioinformatics, and adaptive control. Practice (CEP), IEEE ROBOTICS AND AUTOMATION MAGAZINE, and for Con-
Dr. Saratchandran is an Editor for Neural Parallel and Scientific Computa- trol-Theory and Advanced Technology (C-TAT), Japan. He was also a member
tions. He is listed in the Marquis Who’s Who in the World and in the Leaders in of the Board of Governors (BoG) for the IEEE Control System Society (CSS)
the World, International Biographics Centre, Cambridge, U.K. for 2005. He has contributed as a program committee member in a number of in-
ternational conferences and was the General Chairman for the 6th International
Conference on Automation, Robotics, Control, and Computer Vision (ICARCV
2000) held in Singapore in December 2000.
View publication stats

A Fast and Accurate Online Sequential Learning Algorithm For Feedforward Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Fast and Accurate Online Sequential Learning Algorithm For Feedforward Networks

Uploaded by

Copyright:

Available Formats

See

A Fast and Accurate Online Sequential Learning

Article in IEEE Transactions on Neural Networks · December 2006

Nanying Liang Narasimhan Sundararajan

SEE PROFILE SEE PROFILE

Research relating to diseases of the brain View project

Complex-valued Neural Network View project

A Fast and Accurate Online Sequential Learning

O VER THE past two decades, single hidden layer feedfor-

rather than is used to compute from

from the given training set (28)

(25) IV. COMPARISON OF OS-ELM WITH OTHER SEQUENTIAL

Fig. 1. Model selection for auto-MPG case.

Fig. 2. Learning evolution for auto-MPG case.

Guang-Bin Huang (M’98–SM’04) received the N. Sundararajan (S’73–M’74–SM’84–F’96) re-

View publication stats

You might also like