You are on page 1of 17

This article was downloaded by: [NWFP University of Engineering & Technology

- Peshawar]
On: 20 June 2014, At: 00:37
Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954
Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH,
UK

International Journal of Remote


Sensing
Publication details, including instructions for authors
and subscription information:
http://www.tandfonline.com/loi/tres20

Strategies and best practice


for neural network image
classification
I. Kanellopoulos & G. G. Wilkinson
Published online: 25 Nov 2010.

To cite this article: I. Kanellopoulos & G. G. Wilkinson (1997) Strategies and best
practice for neural network image classification, International Journal of Remote
Sensing, 18:4, 711-725, DOI: 10.1080/014311697218719

To link to this article: http://dx.doi.org/10.1080/014311697218719

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the
information (the “Content”) contained in the publications on our platform.
However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or
suitability for any purpose of the Content. Any opinions and views expressed
in this publication are the opinions and views of the authors, and are not the
views of or endorsed by Taylor & Francis. The accuracy of the Content should
not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions,
claims, proceedings, demands, costs, expenses, damages, and other liabilities
whatsoever or howsoever caused arising directly or indirectly in connection
with, in relation to or arising out of the use of the Content.

This article may be used for research, teaching, and private study purposes.
Any substantial or systematic reproduction, redistribution, reselling, loan, sub-
licensing, systematic supply, or distribution in any form to anyone is expressly
forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014
int. j. remote sensing, 1997 , vol. 18 , no. 4 , 711 ± 725

Strategies and best practice for neural network image classi® cation

I. KANELLOPOULOS and G. G. WILKINSON


Space Applications Institute, Joint Research Centre, European Commission,
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014

21020 Ispra, Varese, Italy

( Received 18 January 1996; in ® nal form 20 June 1996 )

Abstract. This paper examines a number of experimental investigations of neural


networks used for the classi® cation of remotely sensed satellite imagery at the
Joint Research Centre over a period of ® ve years, and attempts to draw some
conclusions about `best practice’ techniques to optimize network training and
overall classi® cation performance. The paper examines best practice in such areas
as: network architecture selection; use of optimization algorithms; scaling of input
data; avoidance of chaos e€ ects; use of enhanced feature sets; and use of hybrid
classi® er methods. It concludes that a vast body of accumulated experience is
now available, and that neural networks can be used reliably and with much
con® dence for routine operational requirements in remote sensing.

1. Introduction
Arti® cial neural networks ® rst began to be used for the classi® cation of remotely
sensed imagery around 1988, with the ® rst journal papers appearing one to two
years later (for example, Key et al . 1989, Benediktsson et al . 1990, Lee et al . 1990 ).
Since that time, the number of reports of experimental tests of neural network
classi® ers in peer-reviewed journals has grown signi® cantly, subjectively appearing
to be at an exponential rate. Moreover, it is now rare for conferences devoted to
remote sensing not to contain special sessions devoted to neural networks. Whilst
the rapidly growing interest in the use of neural networks in remote sensing indicates
a widespread and healthy interest in the exploration of new techniques, it is also
evident that progress is hampered by lack of information on proven methodologies
and implementation techniques.
At the Joint Research Centre, Ispra, Italy, we have been actively investigating
the use of neural networks for remotely sensed image classi® cation for over ® ve
years. During that time we have explored many di€ erent types of networks, used
many di€ erent data sets, and investigated a number of hybrid architectures and
systems. Some of the results of this work have been reported in earlier journal papers,
at conferences, and in our own technical reports. However, we have not attempted
so far to extract the key ® ndings from this considerable body of experimental work
and to present it in one article with the aim of making the experience easily accessible
to future researchers. This paper attempts to do precisely this, and, can be considered
to be the culmination of a number of experiments leading towards the goal of high
accuracy image classi® cation. The material presented herein should not be regarded
as a comprehensive review of neural networks in remote sensing throughout the
world: such a review has already been performed elsewhere ( Paola and Schowengerdt
1995 ). Our aim here is to stress some of the more interesting ® ndings of our own
research, and to make some recommendations for strategies and `best practice’ in
using neural networks in remote sensing from our point of view.
0143 ± 1161/97 $12.0 0 Ñ 1997 Taylo r & Francis Ltd
712 I. Kanellopoulo s and G. G. W ilkinson

The development of `best practice’ in the software ® eld now has a considerable
importance. Slowly there has emerged a growing recognition that in many ® elds of
technology it is important as early as possible to develop standardized procedures
which are known to work well. The use of arti® cial neural networks to classify
satellite imagery is no exception to this, and we hope that this paper will make a
® rst contribution to the development of best practice in this area. We very much
hope that by so doing, others will have the bene® t of starting from a higher level of
experience which will help them to make faster developments in the years to come.
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014

We shall tackle various issues of best practice within separate sections below which
re¯ ect the main areas in which recommendations can be made.

2. Network architecture and training issues


The most commonly-used neural network model for image classi® cation in remote
sensing is the multi-layer perceptron trained by the back-propagation algorithm
( Rumelhart et al . 1986 ). The input to a node in such a network is the weighted sum
of the outputs from the layer below, that is,
net j = ž wjio i ( 1)
i

This `weighted sum’ is then transformed by the node `activation function’ (usually
a sigmoid or hyperbolic tangent) to produce the node output:
1
oj = [sigmoid ] ( 2)
1+exp ( Õ net j + h j )
o j = m tanh(k (net j )) [hyperbolic tangent] ( 3)
where h j , m , and k are constants.
Weights are updated during training with the generalized delta rule:
D wji ( n +1) = g ( d jo i ) + aD wji ( n ) ( 4)
where D wji (n +1) is the change of a weight connecting nodes i and j , in two successive
layers, at the (n +1) th iteration, d j is the rate of change of error with respect to the
output from node j , g is the learning rate, and a a momentum term. Further details
about these networks can be found in Atkinson and Tatnall ( 1997 ).
Although many users of neural networks take the training parameters and activa-
tion functions as `givens’, it is important to realise that the values and form of these
parameters and functions respectively have important consequences both for the
way in which input data should be pre-processed and for the stability and e ciency
of the network training.

2.1. Input feature preprocessing/scaling


The node activation functions used in multi-layer perceptrons, as above, are
essentially non-linear and have asymptotic behaviour. In practice they cause indi-
vidual nodes in the network to behave like non-linear signal ampli® ers. Ideally, to
ensure that a network learns e ciently how to classify, it is important that input
values are scaled so that the learning process (that is, iterative weight adjustment)
stays within the numerical range in which a percentage change in the weighted sum
input value net j is re¯ ected in a similar percentage change in the node output value
o j . This should happen unless the inputs are a long way outside their normal range,
in which case the output signal should saturate. This requirement for e cient learning
Neural networks in remote sensing 713

means that the network input valuesÐ that is the feature values from the satellite
imagery (typically the digital radiances in each spectral channel )Ð should be centred
and scaled to the activation function’s range order of magnitude (instead, for example,
of being within the range 0± 255 of most multispectral satellite images). This ensures
that the values propagated to the network nodes do not cause early saturation
e€ ects. Failure to perform such a normalization causes learning to `stall’ at an error
level which is too high (® gure 1). More details on such an approach can be found
in Fogelman Soulie ( 1991).
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014

2.2. Chaos e€ ects


A further consequence of the non-linearity of neural network activation functions
is that they are susceptible to falling into `chaotic’ regimes ( Van der Maas et al .
1990 ). Although such behaviour is not fully understood, it can happen with networks
used to classify satellite imagery. Chaotic systems can be recognized by the fact that
small changes in inputs lead to very large changes in output. ( This is the so-called
`butter¯ y e€ ect’; that a small butter¯ y ¯ apping its wings could cause a modi® cation
in local atmospheric behaviour which could eventually cause a tornado.) In computer
models chaos is seen, for example, when small changes such as rounding errors in
calculations generate signi® cantly di€ erent results.
In some of our early experiments on the classi® cation of both Landsat TM and
SPOT HRV multispectral imagery we found chaotic behaviour during network training.
This manifested itself as signi® cant di€ erences in training sequences run on di€ erent
computers which had di€ erent ways of dealing with rounding of ¯ oating point num-
bersÐ even when the network architecture, starting weights, and input data were
identical (® gure 2). Although we have found that chaos is not encountered frequently,
it is important to be aware of its potential occurrence. If chaotic behaviour is recognized,
for example, because network error remains at a high level during training and also
large di€ erences are seen on di€ erent kinds of computer or between training runs using

Figure 1. Stalling of network learning when inputs are not normalized. Ð 8 class problem,
- -- 16 class problem.
714 I. Kanellopoulo s and G. G. W ilkinson
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014

Figure 2. Manifestation of chaos in a training sequence with pixels from Landsat imagery.
The network error varies considerably due to di€ erences in ¯ oating point rounding
errors between the di€ erent computers and for di€ erences between single and double
precision arithmetic.

single and double precision arithmetic, it is necessary to shift the training process into
a non-chaotic regime. This is most easily done by changing the learning rate and
momentum parameters which appear in the delta rule. Although general guidelines
cannot be given, a change of an order of magnitude appears to be a good starting point.

2.3. Optimization technique s


Learning in neural networks involves adjusting the connection weights so that
the di€ erence between the network output and the desired output is decreased. This
in turn involves minimization of a cost function f (x ) in a multi-dimensional network
error space. This is an optimization problem. If the cost function f (x ) is a non-linear
function of x , the problem is one of non-linear optimization. Usually, the mean
squared error cost function is used which is expressed in terms of the network’s
output vector and the desired output vector for all input patterns. The network’s
Neural networks in remote sensing 715

output vector is dependent on the weights and so the minimization takes place over
the entire weight space (that is, x would represent the set of all weights). Algorithms
that perform non-linear optimization include the gradient descent procedure, conjug-
ate gradient methods and second-order methods such as the quasi-Newton method
( Watrous 1987, Press et al . 1988).
The gradient descent method is an iterative optimization procedure which minim-
izes a function f (x ) in the direction of the local downhill gradientÐ V f (x ). Conjugate
gradient methods compute new directions of search at each step in such a way that
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014

the new direction is conjugate to the previous gradient. These are more e cient than
the gradient descent algorithm. Quasi-Newton methods make use of the second
derivative of the cost function which gives information about the curvature of the
error surface which may result in a more rapid convergence.
The basic back-propagation algorithm performs a gradient descent with a ® xed
step size. The step size ( that is, learning rate g ) though, may be changed while the
training process progresses. The main problem with the gradient descent is its slow
convergence, since as it gets close to the solution it progresses slowly.
To assess the performance of the di€ erent optimization techniques we conducted
experiments using multitemporal SPOT HRV imagery with large land cover variabil-
ity. Figures 3 and 4 show the evolution of the network error with the number of
iterations for the three optimization methods for 20 and eight land cover classes
respectively. From ® gure 3 it can be seen that in classifying the image into 20 land
cover classes, the conjugate gradient ( Polak± Ribiere algorithm) and the quasi-
Newton methods both fail to converge. On the other hand all three techniques
converged when the imagery was classi® ed into eight land cover classes ( ® gure 4).
The failure of these optimization methods to converge may be attributed to the
complexity of the 20 land cover classes and to the fact that both methods are quite
`greedy’, that is, they go `downhill’ as fast as they can and may fall into local minima.
One more point is that for both these methods the weights of the network are updated
after the presentation of all the patterns (the `o€ -line’ back-propagation approach).
That is, they ® rst accumulate the gradient information from all the patterns and then
the weights are updated. On the other hand, gradient descent updates the weights after

Figure 3. Training sequence for 20 land cover class problem using three di€ erent optimization
techniques. Note that the conjugate gradient and quasi-Newton methods fail to converge.
716 I. Kanellopoulo s and G. G. W ilkinson
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014

Figure 4. Training sequence for eight land cover class problem using three di€ erent
optimization techniques.

the presentation of each pattern to the network (the `on-line’ back propagation
approach). Therefore, modi® cations to the weights are more frequent and also the
network is able to escape unfavourable local minima (Fogelman Soulie 1991).
For the eight-class problem, the conjugate gradient method required 300 iterations
to reach a network error of 0´32 (94´9 per cent overall classi® cation accuracy on the
veri® cation data set). The gradient descent algorithm needed 760 iterations to obtain
a classi® cation accuracy of 97 per cent on the same data set with a network error of
0´1. Finally, the quasi-Newton method after 360 iterations converged to a network
error of 0´1 with an accuracy of 97´4 per cent. From these results we can see clearly
that the conjugate gradient method converges much faster than the other methods.
Although it appears that the quasi-Newton method is faster than the gradient descent
method ( less iterations), in practice it is slower since each iteration is computationally
more intensive.

2.4. Network architectures


The number of hidden layers and the number of nodes in a hidden layer required
for a particular classi® cation problem are not easy to deduce. The neural network
architecture which gives the best results for a particular problem can only be deter-
mined experimentally, and this can be a lengthy process especially for large classi® ca-
tion tasks. This is often seen as an objection to neural network methods. However,
some geometrical arguments can be used to derive heuristics to set approximate
network sizes (Lippmann 1987). Although it is not strictly accurate, each node in a
multi-layer perceptron can be viewed as a system which combines inputs in a `quasi-
linear’ way and in so doing de® nes hyper-surfaces in feature space which, when
combined with a decision rule or process can be used to separate hyper-regions and,
thus, classes. To de® ne a network size which is appropriate for a given classi® cation
problem, it is necessary to examine the total number of input features and the number
of output classes. Ideally, the ® rst hidden layer of a network with two hidden layers
should contain two to three times the number of inputs such that a su cient number
Neural networks in remote sensing 717

of hyper-planes can be `formed’ to de® ne hyper-regions. For example, in a two-


dimensional feature space it would be useful to have more than two hyper-planes, and
perhaps as many as four to be able to de® ne a small hyper-region which corresponds
to a particular class. Our experience has shown that we should choose the number of
nodes to be at least equal to double the number of inputs and perhaps four times as
many to be safe. Likewise the ® nal hidden layer e€ ectively combines hyper-planes or
hyper-regions from the previous layer to form sub-regions de® ning each class. To
allow two or three regions per class, as often employed in statistical classi® cation of
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014

remotely sensed data, we have found it useful to make the number of nodes roughly
equal to two to three times the total number of classes. If only one hidden layer is
used, we believe the number of nodes should be equal to the higher of the two ® gures
derived by the heuristics stated above. Clearly if this does not yield an accurate
classi® cation result, the network should be slowly expanded for successive training
runs until better results are achieved. In general, we have found single hidden layer
networks to be suitable for most classi® cation problems, though once the number of
classes gets near 20, it appears from our experience in remote sensing that additional
¯ exibility is required as provided by a two hidden layer network ( Kanellopoulos et al.
1992), but this is clearly dependent on the complexity of the data.
Table 1 summarizes the neural network architectures that we have used so far in
some of our experiments which resulted in the best performance (that is, best overall
classi® cation accuracy on a test set). The size of the best network in each case is
consistent with the heuristics.
However, we would caution that it is not possible to rely on such heuristics and
that each classi® cation problem needs to be carefully examined in its own right. It is
important always to check classi® cation performance during training, and, to verify
that the accuracies achieved with both test and training data are su cient, that is, to
ensure that the classi® er generalizes well to new/unseen data. One possible approach
for ® nding good architectures is simply to train a large number of networks with
di€ erent architectures in parallel. This is only practical in a realistic time scale with
special purpose hardware, such as the Siemens SYNAPSE-1 parallel neuro-computer,
which we have recently begun evaluating in the remote sensing context.

3. Use of feature enhancements to improve speed/performance


Although the training speed and overall performance of neural networks can be
improved by using appropriate network architectures and good optimization proced-
ures in the learning algorithm, it is also possible to achieve improvements by
deliberately enhancing the features which are input to the network. There are two
ways to do this: (a ) use additional features which provide extra informationÐ either
extracted from the image itself, or from ancillary data sets; and (b ) to provide `higher
order’ terms derived directly from the original feature set. We have investigated both
approaches in our experiments.

3.1. Use of additiona l features


The use of extra features often has a bene® cial e€ ect on classi® cation performance,
so long as the features provide additional useful information. It is not guaranteed
that the use of extra features will always increase accuracy, however, since such
features increase the dimensionality of the feature space and the complexity of the
network, which can make training more di cult. In cases where there is considerable
redundancy between the new features and the original ones, it is possible for the
718 I. Kanellopoulo s and G. G. W ilkinson

Table 1. Some empirical results on best MLP network architectures.

Number of
output land
Number of input cover Neural network
Data description features classes architecture

France, ArdeÁche 6 (2 Ö 3 SPOT HRV 20 2 hidden layers, 17


Departement, channels) and 53 nodes per
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014

agricultural area, hidden layer


two dates, SPOT
HRV imagery
France, Loir et Cher 3 SPOT HRV 7 1 hidden layer, 15
Departement, channels nodes
agricultural area,
single date, SPOT
HRV imagery
France, Loir et Cher 6 (2 Ö 3 SPOT HRV 7 1 hidden layer, 29
Departement, channels) nodes
agricultural area,
two dates, SPOT
HRV imagery
Portugal, Lisbon/River 6 Landsat TM 16 1 hidden layer, 28
Tejo valley, very channels nodes
mixed land use,
Landsat TM
imagery
Portugal, Lisbon/River 6 Landsat TM 9 1 hidden layer, 17
Tejo valley, very channels plus ERS-1 nodes
mixed land use, SAR backscatter
Landsat TM and intensity channel
ERS-1 SAR imagery
Portugal, Lisbon/River 6 Landsat TM 16 1 hidden layer, 35
Tejo valley, Landsat channels and 3 SAR nodes
TM data and textural features
textural features
from ERS-1 SAR
data
Portugal, Lisbon/River 6 Landsat TM 16 1 hidden layer, 34
Tejo valley, Landsat channels and 4 SAR nodes
TM data and textural features
textural features
from ERS-1 SAR
data

extra features to reduce overall classi® cation performance besides lengthening train-
ing time. In most cases, however, the addition of features is found to be bene® cial.
The use of additional features is, of course, appropriate for any kind of classi® er,
not just for neural networks. What is most interesting about neural networks,
however, is that they do not require that the features follow any parametric model,
unlike statistical classi® ers. The neural network approach can, therefore, be seen as
a more ¯ exible classi® cation method which makes it easier to incorporate additional
features from multiple sources.
In our experimental work we have enhanced spectral feature sets derived from
Neural networks in remote sensing 719

optical/infra-red satellite imagery (from LANDSAT TM and SPOT HRV ) using the
following:
(i ) texture features derived from the imagery;
(ii ) SAR backscattering intensity from ERS-1 and 2;
(iii ) texture features derived from SAR imagery;
(iv) features derived from ancillary GIS data sets.
It does not serve our purpose to describe all of our experiments on these feature
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014

set enhancements here, though a few interesting observations can be made. Firstly,
the use of SAR features as additional neural network inputs alongside optical and
infra-red channels generally has a bene® cial e€ ect on the overall classi® cation of
landscapes, which is not unexpected. However, whilst the use of the radar signal
enhances the accuracy of most classes, some can be less accurately classi® ed than
with optical/infra-red data alone. This has been observed both in classifying broad
land cover classes ( Wilkinson et al . 1994 ) and in classifying forested areas into
biodiversity classes ( Wilkinson et al . 1995a), with neural networks in both cases.
This suggests that strategies which combine the results of neural network classi® ca-
tions made: (a ) with optical/infra-red data alone; and (b ) with multi-source imagery
may be fruitful, though this has not so far been investigated. The use of texture
features in neural networks has been shown to give enhanced classi® cation results
both in our own work ( Kanellopoulos et al . 1994 ) and that of others ( for example,
Augusteijn et al . 1995 ).
Interestingly, although the use of additional features is viewed primarily as a way
of increasing accuracy by adding net information, in some cases there can be
signi® cant bene® ts in terms of training e ciency. In one particular experiment
conducted with SPOT data over an agricultural test site in southern France, we
found that the addition of a single ancillary featureÐ terrain height derived from a
digital terrain modelÐ reduced training time for a 20-class problem by a factor of 2
whilst at the same time yielding an accuracy improvement of 4 per cent. In this case
the altitude association of certain classes helped the network to learn at a very early
stage how to make class separations.

3.2. Use of higher order terms


Another tactic to enhance feature sets fed into neural network classi® ers is to
generate so-called higher order terms from the initial feature set. This is the basis of
the functional link network ( Klassen and Pao 1988, Pao 1989). The functional link
net is an extension of the multi-layer perceptron concept in which an additional
processing module is included (the functional link unit) which generates higher order
terms from the initial input features (® gure 5). These higher order terms are usually
cross-products (for example x 1x 2 , x 2x 3 ,. . . -2nd order) derived from an initial feature
set {x1 , x 2 , x 3 ,. . . }. The functional link unit can be used to generate terms of various
higher orders. The inclusion of terms of progressively higher order can be controlled
according to the total network error: if training is not progressing as well as expected,
it can be automatically re-initiated with terms of the next order up.
We have tested the use of the functional link network as a means of improving
both accuracy and training time. In a typical experiment we found that training time
could be reduced by roughly a factor of 4, whilst total classi® cation accuracy could
be increased marginally (Wilkinson et al . 1993). The use of cross-product terms derived
from channel radiances is clearly vindicated, though the physical and mathematical
reasons for this need further exploration. Nevertheless, as a strategy to improve
performance and e ciency, the use of higher order term networks seems to be justi® ed.
720 I. Kanellopoulo s and G. G. W ilkinson
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014

Figure 5. Functional link network which generates higher order terms from an initial feature
set. Such networks were found to improve overall classi® cation accuracy and reduce
training time signi® cantly.

4. Use of mixed/hybrid neural network classi® ers


In a considerable number of experimental tests, neural network classi® cation of
satellite imagery has been compared with classi® cation by more conventional
methods (for example, Benediktsson et al . 1990, Bischof et al . 1992, Downey et al .
1992, Civco 1993, Foody 1995, Serpico and Roli 1995, Zhuang et al . 1995). In general
neural network methods have been found to perform well in such studies. However,
a particularly interesting and frequently overlooked aspect of such comparisons, is
that there are usually signi® cant di€ erences between the performance of the classi® ers
for individual classes: with some classes the neural network approach provides a
much higher accuracy than the conventional methods and with others the opposite
is found. This e€ ect results from the very di€ erent mathematical models underlying
the di€ erent types of classi® er and from the way they divide feature space. One of
the aims of our work in the last two to three years has been to try to understand
such di€ erences.
Visualizations and explorations of feature space are relatively revealing on this
issue ( Paola and Schowengerdt 1994, Fierens et al . 1994). Statistical approaches,
such as the maximum likelihood classi® er, divide feature space into regions formed
by intersecting ellipsoidsÐ these being the multivariate equiprobability surfaces.
Multi-layer perceptrons, however, divide feature space according to a completely
di€ erent mathematical approach. The combination of the weighted sum inputs to
nodes and the sigmoid or hyperbolic tangent activation function can result in
relatively complex division of feature space though with some quasi-planar class
separation surfaces (® gure 6 ). Such behaviour can be understood by examining the
computational geometry involved (Gibson and Cowan 1990). Apart from the
Neural networks in remote sensing 721
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014

Figure 6. Separation of two-dimensional feature space into classes by a multi-layer perceptron


network. The x -direction represents the radiance in Landsat TM channel 1, the y -
direction represents the radiance in TM channel 4. Note the geometrical form and
quasi-linear borders of some class regions.

di€ erences in the geometrical form of the class separation surfaces between di€ erent
types of classi® ers, it is also important to note that di€ erent neural networks (of the
same type) also yield di€ erent class separation surfacesÐ depending on the architec-
tures and starting weight sets of the networks concerned.
Given that di€ erent classi® ers use di€ erent geometrical forms to separate classes
and that the resulting classi® cation accuracies can di€ er signi® cantly between models
for the same classes, it is then appropriate to devise strategies to combine classi® ers
with the aim of improving overall classi® cation performance. There are several ways
of doing this.

4.1. Multiple neural network methods


One approach, involving only the neural approach, is to train (using the same
data) multiple networks which have di€ erent random starting weights or di€ erent
architectures. In performing classi® cation, each network is then used in parallel and
a majority voting strategy is used in which the class with the highest number of
votes from all the networks is taken as the class to be assigned to the sample. More
complex versions of this strategy can be utilized such as a combination process
which is based on the Dempster± Shafer theory of evidence ( Rogova 1994 ).

4.2. Combined neural/non-neural methods


A technique, which is potentially more powerful, is to take combinations of more
than one type of classi® er Ð for example the maximum likelihood method and the
multi-layer perceptronÐ gaining the advantage of integrating very di€ erent
722 I. Kanellopoulo s and G. G. W ilkinson

mathematical models. This can be done using a simple scheme as shown in ® gure 7.
In this approach a multi-layer perceptron and maximum-likelihood classi® er are
trained separately to classify samples from an image. These two classi® ers are tested
with a set of independent samples. A second training set is then built up from those
samples about which the two initial classi® ers do not agree. This new training set is
then used to train a second multi-layer perceptron.
The purpose of this approach is to be able to apply both classi® er models to the
data in the classi® cation stage and to highlight samples for which the two models
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014

disagree. These `di cult’ pixels are then passed to the second neural network which
has been specially trained to deal with such cases. A neural network is used for the
`di cult cases’ since they are unlikely to fall into a distribution which can be modelled
well by a statistical classi® er. In tests performed within the last one to two years at
the JRC, we have found that this approach may signi® cantly increase overall classi-
® cation accuracy. For example, in one experiment an increase was achieved of
approximately 12 per cent compared to using the neural network or maximum
likelihood models alone on a problem which involved 16 distinct land cover classes
( Wilkinson et al . 1995b).

5. Discussion
In this paper we have attempted to draw together the ® ndings of a wide range
of experiments on neural network classi® cation we have conducted over ® ve years.
Whilst our results have not covered all aspects of neural network image classi® cation
in remote sensing, they do point to a number of fruitful strategies and implementation
techniques which could contribute to the development of a body of best practice
recommendations. Table 2 lists some of the main recommendations for e€ ective and

Figure 7. Strategy for combination of maximum-likelihoo d classi® er and multi-layer


perceptron neural networks.
Neural networks in remote sensing 723

Table 2. Recommendations and strategies for `best practice’ in neural network image
classi® cation.

Number Recommendation/strategy

1 Preprocess input data and scale according to the form of the activation
function used.
2 Apply geometrical arguments and heuristics to set network architectures.
3 Recognize the e€ ects of chaos in network training and take steps to avoid it.
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014

4 Adopt fast optimization technique in training, e.g., conjugate gradient when


thematic classes are well separated or not very mixed.
5 Use derived higher order terms through functional link net feature set
expansion (avoids computing extra features from imagery and often yields
signi® cant bene® ts in terms of training time and overall accuracy).
6 Use additional features from multiple sources alongside basic pixel radiance
information (improves accuracy in most cases and can reduce net training
time signi® cantly in some cases).
7 Integrate neural networks with conventional classi® ers using simple strategies
to take advantage of the signi® cantly di€ erent underying mathematical
models.
8 Use multiple networks and voting strategies whenever possible as an alternative
to ( 7).

e cient use of neural networks which have emerged from our work and which have
been discussed in this paper. We very much hope that others will be able to enhance
the collective knowledge in the ® eld by improving on our recommendations and
adding to them in due course. Overall, it can be stated with some con® dence, that
the wide experience gained in the use of neural networks for image classi® cation
now makes it possible to use them routinely in operational projects. The neural
network technique is now undergoing trials at the JRC in the context of the opera-
tional Monitoring Agriculture by Remote Sensing (MARS) Project, and is also being
evaluated in the context of mapping projects for the European Union’s statistical
o ce `Eurostat’. It can be expected that the use of neural networks will expand
rapidly in the coming years, and that they will form an important tool in operational
remote sensing.

Acknowledgm ents
The authors are grateful to past and present colleagues of the Joint Research
Centre who have contributed both directly and indirectly to the experimental work
and ® ndings reported in this paper. In particular we should like to thank Drs Freddy
Fierens, Paul Rosin, Ron Schoenmakers, Aristide Var® s, and Alessandra Chiuderi,
and also Joachim Hill, Wolfgang Mehl, Jacques Megier, Walter Di Carlo, Alice
Bernard, Stefania Go€ redo, and Karen Fullerton. We should also like to thank
Professor Zhengkai Liu and Suzanne Furby, scienti® c visitors to the JRC, who
through many fruitful discussions have contributed to our understanding of neural
networks and their relation to other classi® cation techniques.

References
A tkinson, P . M . and T atnall, A .R . L ., 1997, Neural networks in remote sensing. International
Journal of Remote Sensing , 18, 699± 709. (this issue).
A ugusteijn, M . F ., C lemens, L . E . and S haw, K . A ., 1995, Performance evaluation of texture
measures for ground cover identi® cation in satellite images by means of a neural
724 I. Kanellopoulo s and G. G. W ilkinson

network classi® er. I.E.E.E. T ransactions on Geoscience and Remote Sensing , 33,
616± 626.
B enediktsson, J . A ., S wain, P . H . and E rsoy, O . K ., 1990, Neural network approaches versus
statistical methods in classi® cation of multisource remote sensing data. I.E.E.E.
T ransactions on Geoscience and Remote Sensing , 28, 540± 552.
B ischof, H ., S chneider, W . and P inz, A . J ., 1992, Multispectral classi® cation of Landsat
images using neural networks. I.E.E.E. T ransactions on Geoscience and Remote Sensing ,
30, 482± 490.
C ivco, D . L . , 1993, Arti® cial neural networks for land-cover classi® cation and mapping.
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014

International Journal of Geographical Information Systems, 7, 173± 186.


D owney, I . D ., P ower, C . H ., K anellopoulos, I . and W ilkinson, G . G ., 1992, A performance
comparison of Landsat TM land cover classi® cation based on neural network tech-
niques and traditional maximum likelihood algorithms and minimum distance algo-
rithms. Proceedings 1992 Annual Conference of the Remote Sensing Society: From
Research to Operation, Dundee, Scotland, 15± 17 September (Nottingham: Remote
Sensing Society), pp. 518± 528.
F ierens, F ., K anellopoulos, I ., W ilkinson, G . G . and M e’gier, J ., 1994, Comparison and
visualization of feature space behaviour of statistical and neural classi® ers of satellite
imagery. Proceedings of the International Geoscience and Remote Sensing Symposium
(IGARSS 94) Pasadena, California, 8± 12 August , Vol. 4, ( Piscataway NJ: I.E.E.E.
Press), pp. 1880± 1882.
F ogelman S oulie, F . , 1991, Neural network architectures and algorithms: a perspective.
Proceedings of the 1991 International Conference on Arti® cial Neural Networks
(ICANN-91) Espoo, Finland, 24± 28 June , Vol. 1, (Amsterdam: North Holland),
pp. 605± 615.
F oody, G . M . , 1995, Land cover classi® cation by an arti® cial neural network with ancillary
information. International Journal of Geographical Information Systems, 9, 527± 542.
G ibson, G . J . and C owan, C . F . N ., 1990, On the decision regions of multilayer perceptrons.
Proceedings of the I.E.E.E., 78, 1590± 1594.
K anellopoulos, I ., V arfis, A ., W ilkinson, G . G . and M e’gier, J ., 1992, Land-cover discrim-
ination in SPOT HRV imagery using an arti® cial neural network: a 20 class experiment.
International Journal of Remote Sensing , 13, 917± 924.
K anellopoulos, I ., W ilkinson, G . G . and C hiuderi, A ., 1994, Land cover mapping using
combined Landsat TM imagery and textural features from ERS-1 Synthetic Aperture
Radar Imagery. Image and Signal Processing in Remote Sensing, Proceedings SPIE
2315 (Bellingham, Washington: SPIE), pp. 332± 341.
K ey, J ., M aslanic, A . and S chweiger, A . J ., 1989, Classi® cation of merged AVHRR and
SMMR arctic data with neural networks. Photogrammetric Engineering and Remote
Sensing, 55, 1331± 1338.
K lassen, M . S . and P ao, Y .-H . , 1988, Characteristics of the functional-link net: a higher order
delta rule net. Proceedings of the 2nd Annual International Conference on Neural
Networks, June, San Diego, California , Vol. 1 (Piscataway NJ: I.E.E.E. Press),
pp. 507± 513.
L ee, J ., W eger, R . C ., S engupta, S . K . and W elch, R . M . , 1990, A neural network approach
to cloud classi® cation. I.E.E.E. T ransactions on Geoscience and Remote Sensing , 28,
846± 855.
L ippmann, R . P ., 1987, An introduction to computing with neural nets. I.E.E.E. ASSP
Magazine , 2, 4± 22.
P ao, Y .-H . , 1989, Adaptive Pattern Recognition and Neural Networks (Reading, Massachusetts:
Addison-Wesley).
P aola, J . D . and S chowengerdt, R . A ., 1994, Comparisons of neural networks to standard
techniques for image classi® cation and correlation. Proceedings of the International
Geoscience and Remote Sensing Symposium (IGARSS 94), Pasadena, California,
8 ± 12 August , Vol. 3 ( Piscataway NJ: I.E.E.E. Press), pp. 1404± 1406.
P aola, J . D . and S chowengerdt, R . A ., 1995, A review and analysis of back-propagation
neural networks for classi® cation of remotely-sensed multi-spectral imagery.
International Journal of Remote Sensing , 16, 3033± 3058.
P ress, W . H ., F lannery, B . P ., T eukolsky, S . A . and V etterling, W . T . , 1988, Numerical
Recipes in C (Cambridge: Cambridge University Press).
Neural networks in remote sensing 725

R ogova, G ., 1994, Combining the results of several neural network classi® ers. Neural Networks,
7, 777± 781.
R umelhart, D . E ., H inton, G . E . and W illiams, R . J ., 1986, Learning internal representations
by error propagation. In Parallel Distributed Processing: Explorations in the
Microstructure of Cognition, Volume 1: Foundation s, edited by D. E. Rumelhart, J. L.
McClelland, and the PDP Research Group (Cambridge, Massachusetts: MIT Press),
pp. 318± 362.
S erpico, S . B . and R oli, F . , 1995, Classi® cation of multisensor remote-sensing images by
structured neural networks. I.E.E.E. T ransactions on Geoscience and Remote Sensing ,
Downloaded by [NWFP University of Engineering & Technology - Peshawar] at 00:37 20 June 2014

33, 562± 578.


V an D er M aas, H . L . J ., V erschure, P . F . M . J . and M olenaar, P . C . M . , 1990, A note
on chaotic behaviour in simple neural networks. Neural Networks, 3, 119± 122.
W atrous, R ., 1987, Learning algorithms for connectionist networks: applied gradient methods
of non-linear optimization. Technical Report MS-CIS-87-51, University of
Pennsylvania, Philadelphia, U.S.A.
W ilkinson, G . G ., K anellopoulos, I ., L iu, Z . K . and F olving, S ., 1993, Integrated land
cover mapping from satellite imagery using arti® cial neural networks. Ground Sensing,
Proceedings SPIE 1941 (Bellingham, Washington: SPIE), pp. 68± 75.
W ilkinson, G . G ., K anellopoulos, I ., M ehl, W . and H ill, J ., 1994, Land cover mapping
using combined LANDSAT Thematic Mapper imagery and ERS-1 Synthetic Aperture
Radar imagery. Proceedings of the Pecora 12 Symposium: L and Information f rom Space-
Based Systems, Sioux Falls, USA, 24± 26 August 1993 (Bethesda, Maryland: American
Society for Photogrammetry and Remote Sensing), pp. 151± 158.
W ilkinson, G . G ., F olving, S ., K anellopoulos, I ., M c C ormick, N ., F ullerton, K . and
M egier, J ., 1995a, Forest mapping from multi-source satellite data using neural
network classi® ers: an experiment in Portugal. Remote Sensing Reviews, 12, 83± 106.
W ilkinson, G . G ., F ierens, F . and K anellopoulos, I ., 1995b, Integration of neural and
statistical approaches in spatial data classi® cation. Geographical Systems, 2, 1± 20.
Z huang, X ., E ngel, B . A ., X iong, X . and J ohannsen, C . J ., 1995, Analysis of classi® cation
results of remotely sensed data and evaluation of classi® cation algorithms.
Photogrammetric Engineering and Remote Sensing , 61, 427± 433.

You might also like