Professional Documents
Culture Documents
- Peshawar]
On: 20 June 2014, At: 00:37
Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954
Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH,
UK
To cite this article: I. Kanellopoulos & G. G. Wilkinson (1997) Strategies and best
practice for neural network image classification, International Journal of Remote
Sensing, 18:4, 711-725, DOI: 10.1080/014311697218719
Taylor & Francis makes every effort to ensure the accuracy of all the
information (the “Content”) contained in the publications on our platform.
However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or
suitability for any purpose of the Content. Any opinions and views expressed
in this publication are the opinions and views of the authors, and are not the
views of or endorsed by Taylor & Francis. The accuracy of the Content should
not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions,
claims, proceedings, demands, costs, expenses, damages, and other liabilities
whatsoever or howsoever caused arising directly or indirectly in connection
with, in relation to or arising out of the use of the Content.
This article may be used for research, teaching, and private study purposes.
Any substantial or systematic reproduction, redistribution, reselling, loan, sub-
licensing, systematic supply, or distribution in any form to anyone is expressly
forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014
int. j. remote sensing, 1997 , vol. 18 , no. 4 , 711 ± 725
Strategies and best practice for neural network image classi® cation
1. Introduction
Arti® cial neural networks ® rst began to be used for the classi® cation of remotely
sensed imagery around 1988, with the ® rst journal papers appearing one to two
years later (for example, Key et al . 1989, Benediktsson et al . 1990, Lee et al . 1990 ).
Since that time, the number of reports of experimental tests of neural network
classi® ers in peer-reviewed journals has grown signi® cantly, subjectively appearing
to be at an exponential rate. Moreover, it is now rare for conferences devoted to
remote sensing not to contain special sessions devoted to neural networks. Whilst
the rapidly growing interest in the use of neural networks in remote sensing indicates
a widespread and healthy interest in the exploration of new techniques, it is also
evident that progress is hampered by lack of information on proven methodologies
and implementation techniques.
At the Joint Research Centre, Ispra, Italy, we have been actively investigating
the use of neural networks for remotely sensed image classi® cation for over ® ve
years. During that time we have explored many di erent types of networks, used
many di erent data sets, and investigated a number of hybrid architectures and
systems. Some of the results of this work have been reported in earlier journal papers,
at conferences, and in our own technical reports. However, we have not attempted
so far to extract the key ® ndings from this considerable body of experimental work
and to present it in one article with the aim of making the experience easily accessible
to future researchers. This paper attempts to do precisely this, and, can be considered
to be the culmination of a number of experiments leading towards the goal of high
accuracy image classi® cation. The material presented herein should not be regarded
as a comprehensive review of neural networks in remote sensing throughout the
world: such a review has already been performed elsewhere ( Paola and Schowengerdt
1995 ). Our aim here is to stress some of the more interesting ® ndings of our own
research, and to make some recommendations for strategies and `best practice’ in
using neural networks in remote sensing from our point of view.
0143 ± 1161/97 $12.0 0 Ñ 1997 Taylo r & Francis Ltd
712 I. Kanellopoulo s and G. G. W ilkinson
The development of `best practice’ in the software ® eld now has a considerable
importance. Slowly there has emerged a growing recognition that in many ® elds of
technology it is important as early as possible to develop standardized procedures
which are known to work well. The use of arti® cial neural networks to classify
satellite imagery is no exception to this, and we hope that this paper will make a
® rst contribution to the development of best practice in this area. We very much
hope that by so doing, others will have the bene® t of starting from a higher level of
experience which will help them to make faster developments in the years to come.
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014
We shall tackle various issues of best practice within separate sections below which
re¯ ect the main areas in which recommendations can be made.
This `weighted sum’ is then transformed by the node `activation function’ (usually
a sigmoid or hyperbolic tangent) to produce the node output:
1
oj = [sigmoid ] ( 2)
1+exp ( Õ net j + h j )
o j = m tanh(k (net j )) [hyperbolic tangent] ( 3)
where h j , m , and k are constants.
Weights are updated during training with the generalized delta rule:
D wji ( n +1) = g ( d jo i ) + aD wji ( n ) ( 4)
where D wji (n +1) is the change of a weight connecting nodes i and j , in two successive
layers, at the (n +1) th iteration, d j is the rate of change of error with respect to the
output from node j , g is the learning rate, and a a momentum term. Further details
about these networks can be found in Atkinson and Tatnall ( 1997 ).
Although many users of neural networks take the training parameters and activa-
tion functions as `givens’, it is important to realise that the values and form of these
parameters and functions respectively have important consequences both for the
way in which input data should be pre-processed and for the stability and e ciency
of the network training.
means that the network input valuesÐ that is the feature values from the satellite
imagery (typically the digital radiances in each spectral channel )Ð should be centred
and scaled to the activation function’s range order of magnitude (instead, for example,
of being within the range 0± 255 of most multispectral satellite images). This ensures
that the values propagated to the network nodes do not cause early saturation
e ects. Failure to perform such a normalization causes learning to `stall’ at an error
level which is too high (® gure 1). More details on such an approach can be found
in Fogelman Soulie ( 1991).
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014
Figure 1. Stalling of network learning when inputs are not normalized. Ð 8 class problem,
- -- 16 class problem.
714 I. Kanellopoulo s and G. G. W ilkinson
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014
Figure 2. Manifestation of chaos in a training sequence with pixels from Landsat imagery.
The network error varies considerably due to di erences in ¯ oating point rounding
errors between the di erent computers and for di erences between single and double
precision arithmetic.
single and double precision arithmetic, it is necessary to shift the training process into
a non-chaotic regime. This is most easily done by changing the learning rate and
momentum parameters which appear in the delta rule. Although general guidelines
cannot be given, a change of an order of magnitude appears to be a good starting point.
output vector is dependent on the weights and so the minimization takes place over
the entire weight space (that is, x would represent the set of all weights). Algorithms
that perform non-linear optimization include the gradient descent procedure, conjug-
ate gradient methods and second-order methods such as the quasi-Newton method
( Watrous 1987, Press et al . 1988).
The gradient descent method is an iterative optimization procedure which minim-
izes a function f (x ) in the direction of the local downhill gradientÐ V f (x ). Conjugate
gradient methods compute new directions of search at each step in such a way that
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014
the new direction is conjugate to the previous gradient. These are more e cient than
the gradient descent algorithm. Quasi-Newton methods make use of the second
derivative of the cost function which gives information about the curvature of the
error surface which may result in a more rapid convergence.
The basic back-propagation algorithm performs a gradient descent with a ® xed
step size. The step size ( that is, learning rate g ) though, may be changed while the
training process progresses. The main problem with the gradient descent is its slow
convergence, since as it gets close to the solution it progresses slowly.
To assess the performance of the di erent optimization techniques we conducted
experiments using multitemporal SPOT HRV imagery with large land cover variabil-
ity. Figures 3 and 4 show the evolution of the network error with the number of
iterations for the three optimization methods for 20 and eight land cover classes
respectively. From ® gure 3 it can be seen that in classifying the image into 20 land
cover classes, the conjugate gradient ( Polak± Ribiere algorithm) and the quasi-
Newton methods both fail to converge. On the other hand all three techniques
converged when the imagery was classi® ed into eight land cover classes ( ® gure 4).
The failure of these optimization methods to converge may be attributed to the
complexity of the 20 land cover classes and to the fact that both methods are quite
`greedy’, that is, they go `downhill’ as fast as they can and may fall into local minima.
One more point is that for both these methods the weights of the network are updated
after the presentation of all the patterns (the `o -line’ back-propagation approach).
That is, they ® rst accumulate the gradient information from all the patterns and then
the weights are updated. On the other hand, gradient descent updates the weights after
Figure 3. Training sequence for 20 land cover class problem using three di erent optimization
techniques. Note that the conjugate gradient and quasi-Newton methods fail to converge.
716 I. Kanellopoulo s and G. G. W ilkinson
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014
Figure 4. Training sequence for eight land cover class problem using three di erent
optimization techniques.
the presentation of each pattern to the network (the `on-line’ back propagation
approach). Therefore, modi® cations to the weights are more frequent and also the
network is able to escape unfavourable local minima (Fogelman Soulie 1991).
For the eight-class problem, the conjugate gradient method required 300 iterations
to reach a network error of 0´32 (94´9 per cent overall classi® cation accuracy on the
veri® cation data set). The gradient descent algorithm needed 760 iterations to obtain
a classi® cation accuracy of 97 per cent on the same data set with a network error of
0´1. Finally, the quasi-Newton method after 360 iterations converged to a network
error of 0´1 with an accuracy of 97´4 per cent. From these results we can see clearly
that the conjugate gradient method converges much faster than the other methods.
Although it appears that the quasi-Newton method is faster than the gradient descent
method ( less iterations), in practice it is slower since each iteration is computationally
more intensive.
remotely sensed data, we have found it useful to make the number of nodes roughly
equal to two to three times the total number of classes. If only one hidden layer is
used, we believe the number of nodes should be equal to the higher of the two ® gures
derived by the heuristics stated above. Clearly if this does not yield an accurate
classi® cation result, the network should be slowly expanded for successive training
runs until better results are achieved. In general, we have found single hidden layer
networks to be suitable for most classi® cation problems, though once the number of
classes gets near 20, it appears from our experience in remote sensing that additional
¯ exibility is required as provided by a two hidden layer network ( Kanellopoulos et al.
1992), but this is clearly dependent on the complexity of the data.
Table 1 summarizes the neural network architectures that we have used so far in
some of our experiments which resulted in the best performance (that is, best overall
classi® cation accuracy on a test set). The size of the best network in each case is
consistent with the heuristics.
However, we would caution that it is not possible to rely on such heuristics and
that each classi® cation problem needs to be carefully examined in its own right. It is
important always to check classi® cation performance during training, and, to verify
that the accuracies achieved with both test and training data are su cient, that is, to
ensure that the classi® er generalizes well to new/unseen data. One possible approach
for ® nding good architectures is simply to train a large number of networks with
di erent architectures in parallel. This is only practical in a realistic time scale with
special purpose hardware, such as the Siemens SYNAPSE-1 parallel neuro-computer,
which we have recently begun evaluating in the remote sensing context.
Number of
output land
Number of input cover Neural network
Data description features classes architecture
extra features to reduce overall classi® cation performance besides lengthening train-
ing time. In most cases, however, the addition of features is found to be bene® cial.
The use of additional features is, of course, appropriate for any kind of classi® er,
not just for neural networks. What is most interesting about neural networks,
however, is that they do not require that the features follow any parametric model,
unlike statistical classi® ers. The neural network approach can, therefore, be seen as
a more ¯ exible classi® cation method which makes it easier to incorporate additional
features from multiple sources.
In our experimental work we have enhanced spectral feature sets derived from
Neural networks in remote sensing 719
optical/infra-red satellite imagery (from LANDSAT TM and SPOT HRV ) using the
following:
(i ) texture features derived from the imagery;
(ii ) SAR backscattering intensity from ERS-1 and 2;
(iii ) texture features derived from SAR imagery;
(iv) features derived from ancillary GIS data sets.
It does not serve our purpose to describe all of our experiments on these feature
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014
set enhancements here, though a few interesting observations can be made. Firstly,
the use of SAR features as additional neural network inputs alongside optical and
infra-red channels generally has a bene® cial e ect on the overall classi® cation of
landscapes, which is not unexpected. However, whilst the use of the radar signal
enhances the accuracy of most classes, some can be less accurately classi® ed than
with optical/infra-red data alone. This has been observed both in classifying broad
land cover classes ( Wilkinson et al . 1994 ) and in classifying forested areas into
biodiversity classes ( Wilkinson et al . 1995a), with neural networks in both cases.
This suggests that strategies which combine the results of neural network classi® ca-
tions made: (a ) with optical/infra-red data alone; and (b ) with multi-source imagery
may be fruitful, though this has not so far been investigated. The use of texture
features in neural networks has been shown to give enhanced classi® cation results
both in our own work ( Kanellopoulos et al . 1994 ) and that of others ( for example,
Augusteijn et al . 1995 ).
Interestingly, although the use of additional features is viewed primarily as a way
of increasing accuracy by adding net information, in some cases there can be
signi® cant bene® ts in terms of training e ciency. In one particular experiment
conducted with SPOT data over an agricultural test site in southern France, we
found that the addition of a single ancillary featureÐ terrain height derived from a
digital terrain modelÐ reduced training time for a 20-class problem by a factor of 2
whilst at the same time yielding an accuracy improvement of 4 per cent. In this case
the altitude association of certain classes helped the network to learn at a very early
stage how to make class separations.
Figure 5. Functional link network which generates higher order terms from an initial feature
set. Such networks were found to improve overall classi® cation accuracy and reduce
training time signi® cantly.
di erences in the geometrical form of the class separation surfaces between di erent
types of classi® ers, it is also important to note that di erent neural networks (of the
same type) also yield di erent class separation surfacesÐ depending on the architec-
tures and starting weight sets of the networks concerned.
Given that di erent classi® ers use di erent geometrical forms to separate classes
and that the resulting classi® cation accuracies can di er signi® cantly between models
for the same classes, it is then appropriate to devise strategies to combine classi® ers
with the aim of improving overall classi® cation performance. There are several ways
of doing this.
mathematical models. This can be done using a simple scheme as shown in ® gure 7.
In this approach a multi-layer perceptron and maximum-likelihood classi® er are
trained separately to classify samples from an image. These two classi® ers are tested
with a set of independent samples. A second training set is then built up from those
samples about which the two initial classi® ers do not agree. This new training set is
then used to train a second multi-layer perceptron.
The purpose of this approach is to be able to apply both classi® er models to the
data in the classi® cation stage and to highlight samples for which the two models
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014
disagree. These `di cult’ pixels are then passed to the second neural network which
has been specially trained to deal with such cases. A neural network is used for the
`di cult cases’ since they are unlikely to fall into a distribution which can be modelled
well by a statistical classi® er. In tests performed within the last one to two years at
the JRC, we have found that this approach may signi® cantly increase overall classi-
® cation accuracy. For example, in one experiment an increase was achieved of
approximately 12 per cent compared to using the neural network or maximum
likelihood models alone on a problem which involved 16 distinct land cover classes
( Wilkinson et al . 1995b).
5. Discussion
In this paper we have attempted to draw together the ® ndings of a wide range
of experiments on neural network classi® cation we have conducted over ® ve years.
Whilst our results have not covered all aspects of neural network image classi® cation
in remote sensing, they do point to a number of fruitful strategies and implementation
techniques which could contribute to the development of a body of best practice
recommendations. Table 2 lists some of the main recommendations for e ective and
Table 2. Recommendations and strategies for `best practice’ in neural network image
classi® cation.
Number Recommendation/strategy
1 Preprocess input data and scale according to the form of the activation
function used.
2 Apply geometrical arguments and heuristics to set network architectures.
3 Recognize the e ects of chaos in network training and take steps to avoid it.
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014
e cient use of neural networks which have emerged from our work and which have
been discussed in this paper. We very much hope that others will be able to enhance
the collective knowledge in the ® eld by improving on our recommendations and
adding to them in due course. Overall, it can be stated with some con® dence, that
the wide experience gained in the use of neural networks for image classi® cation
now makes it possible to use them routinely in operational projects. The neural
network technique is now undergoing trials at the JRC in the context of the opera-
tional Monitoring Agriculture by Remote Sensing (MARS) Project, and is also being
evaluated in the context of mapping projects for the European Union’s statistical
o ce `Eurostat’. It can be expected that the use of neural networks will expand
rapidly in the coming years, and that they will form an important tool in operational
remote sensing.
Acknowledgm ents
The authors are grateful to past and present colleagues of the Joint Research
Centre who have contributed both directly and indirectly to the experimental work
and ® ndings reported in this paper. In particular we should like to thank Drs Freddy
Fierens, Paul Rosin, Ron Schoenmakers, Aristide Var® s, and Alessandra Chiuderi,
and also Joachim Hill, Wolfgang Mehl, Jacques Megier, Walter Di Carlo, Alice
Bernard, Stefania Go redo, and Karen Fullerton. We should also like to thank
Professor Zhengkai Liu and Suzanne Furby, scienti® c visitors to the JRC, who
through many fruitful discussions have contributed to our understanding of neural
networks and their relation to other classi® cation techniques.
References
A tkinson, P . M . and T atnall, A .R . L ., 1997, Neural networks in remote sensing. International
Journal of Remote Sensing , 18, 699± 709. (this issue).
A ugusteijn, M . F ., C lemens, L . E . and S haw, K . A ., 1995, Performance evaluation of texture
measures for ground cover identi® cation in satellite images by means of a neural
724 I. Kanellopoulo s and G. G. W ilkinson
network classi® er. I.E.E.E. T ransactions on Geoscience and Remote Sensing , 33,
616± 626.
B enediktsson, J . A ., S wain, P . H . and E rsoy, O . K ., 1990, Neural network approaches versus
statistical methods in classi® cation of multisource remote sensing data. I.E.E.E.
T ransactions on Geoscience and Remote Sensing , 28, 540± 552.
B ischof, H ., S chneider, W . and P inz, A . J ., 1992, Multispectral classi® cation of Landsat
images using neural networks. I.E.E.E. T ransactions on Geoscience and Remote Sensing ,
30, 482± 490.
C ivco, D . L . , 1993, Arti® cial neural networks for land-cover classi® cation and mapping.
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014
R ogova, G ., 1994, Combining the results of several neural network classi® ers. Neural Networks,
7, 777± 781.
R umelhart, D . E ., H inton, G . E . and W illiams, R . J ., 1986, Learning internal representations
by error propagation. In Parallel Distributed Processing: Explorations in the
Microstructure of Cognition, Volume 1: Foundation s, edited by D. E. Rumelhart, J. L.
McClelland, and the PDP Research Group (Cambridge, Massachusetts: MIT Press),
pp. 318± 362.
S erpico, S . B . and R oli, F . , 1995, Classi® cation of multisensor remote-sensing images by
structured neural networks. I.E.E.E. T ransactions on Geoscience and Remote Sensing ,
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014