Land-Use Classification of Remotely Sensed Data Using Kohonen Self-organizing Feature Map Neural Networks
The use of Kohonen Self-organizing Feature Map (KSOFM, or feature map) neural networks for land-use/land-cover classification from remotely sensed data is presented. Different from the traditional multi-layer neural networks, the KSOFM is a two-layer network that creates class representation by selforganizing the connection weights from the input patterns to the output layer. A test of the algorithm is conducted by classifying a Landsat Thematic Mapper (TM) scene for seven land-uselland-cover types, benchmarked with the maximumlikelihood method and the Back Propagation (BP) network. The network outpexformes the maximum-likelihood method for per-pixel classification when four spectral bands are used. A further increase in classification accuracy is achieved when neighborhood pixels are incorporated. A similar accuracy is obtained using the BP networks for classifications both with and without neighborhood information. The feature map network has the advantage of faster learning but has the drawback of being a slow classification process. Learning by the feature map is affected by a number of factors such as the network size, the codebooks partitioning, the available training samples, and the selection of the learning rate. The feature map size controls the accuracy at which class borders are formed, and a large map may be used to obtain accurate class representation. It is concluded that the feature map method is a viable alternative for land-use classification of remotely sensed data.
Artificial neural networks have been widely studied for the land-use classification of remotely sensed data (e.g., Heermann and Khazenie, 1992; Bischof et al., 1992; Civco, 1993), and are now accepted alternatives to statistical classification techniques (Paola and Schowengerdt, 1997).The non-parametric neural network classifiers have numerous advantages over the statistical methods, such as no assumption about the probabilistic models of data, the ability to generalize in noisy environments, and the ability to learn complex patterns. Therefore, neural networks may perform well in cases where data are strongly non-Gaussian, such as classification that incorporates textural measures (e.g., Lee et al., 1990;Augusteijn et al., 1995), and multi-source data classification (e.g., Benediktsson et al., 1990; Gong, 1996;Bruzzone et al., 1997). This paper reports on the use of the Kohonen Self-Organiz-
ing Feature Map (KSOFM) land-uselland-cover classificafor tion. The algorithm is often described within the context of artificial neural networks in many textbooks. Basically, the feature map neural network is a vector quantizer which creates class representation onto a two-dimensional map by self-organizing the connection weights from a series of input patterns to outputs nodes. Figure 1depicts the network configuration. Each node in the output layer is fully connected to its adjacent ones and to the input signals. The weight vectors are adjusted so that the density function of codebook clusters approximates the probability density function of the input vectors. This is referred to as topographic representation (Bose and Liang, 1996). The algorithm is biologically motivated, as maps of sensory surfaces are found in many parts of the brain (Kohonen, 1982). Applications of the algorithm are being made in many areas of pattern recognition tasks such as speech recognition (Kohonen, 1988), robotics (Graf and LaLonde, 1988), and image compression (Nasrabadi and Feng, 1988). A number of studies are also found in remote sensing classification. Orlando et al. (1990) used the method to classify four ground-cover classes (three types of sea-ice and one shadow class) from a radar image. They found that the KSOFM performed nearly as well as the multi-layer perceptron and the Gaussian classifiers when networks contained at least 20 nodes in either one- or two-dimensional configurations. A study by Hara etal. (1994) for cloud classification from SAR images concluded that the method yielded comparable results to that of Learning Vector Quantization and Migrating Means methods but had the advantages of classifying data with complex texture. Other applications of the Feature Map algorithm in remote sensing include feature selection (Iivarinen, 1994), and data preprocessing for neural network classification (Yoshida and Omatu, 1994). The current study is focused on the implementations of the algorithm such as the network design and training. A Landsat Thematic Mapper scene is used to classify seven land-use/ land-cover types. Classification using the feature map method is benchmarked with a the maximum-likelihood statistical classifier and the Back Propagation (BP) neural networks for both pixel and window classification (i.e., classification that uses the neighborhood pixels). All the image processing and classification work is carried out using an image processing environment. system developed in the MS-Windows
C.Y. Ji was with the Institute of Remote Sensing and GIS Applications, Peking University, Beijing 100871, China. He can presently be contacted c/o Xinning Jia, Asian Development Bank, P.O. Box 789, 0980 Manila, The Philippines (jiaxinninga adb.org).
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
Photogrammetric Engineering & Remote Sensing Vol. 66, No. 12, December 2000, pp. 1451-1460.
0099-111210016612-1451$3.0010 8 2000 American Society for Photogrammetry
and Remote Sensing
mc(t + 1) = mc(t)
+ a(t)[xi(t)- m,(t)] if x is correctly
mc(t + 1)= m,(t) - a(t)[xi(t)- mc(t)l if x is incorrectly classified for i # c LVQZ uses a "window" with a specified width to locate the position of a sample point localized on the map in between two adjacent neurons, and the weights for both neurons are updated at the same time. The effect is to push the decision border to coincide with the Bayesian decision border. The vector X is defined to lie in the window if min (dildj, djldi) > (1 - w)(l + w) (4) where dj is the relative distance between X and mi, and wis the width of the window. The weights are updated according to the following rules, if x falls inside the window:
Input Layer Band Band2 Band3
Figure 1. A typical configuration of Ksom.
Details of the algorithm are found in Kohonen (1988; 1989)and Muller and Reinhardt (1990).The task is to determine a Ddimensional vector of "synaptic coefficients" m, so that each neuron n is "responsible" for a range of input vectors for which the distance ((x m,((takes on the smallest value (Muller and Reinhardt, 1990). A neighborhood is defined around each neuron so that the neurons within the neighborhood are updated at each learning step, and this neighborhood often starts with a large radius and shrinks gradually with time. x,, Let X = (x,, ...,x,} be an observation input vector where n indicates the dimension of vector X (e.g.,the number of spectral bands). Training of the network is accomplished in two stages: coarse tune and fine tune. The coarse tune of the map is as follows: Step 1: For each neuron, the synaptic coefficients are randomized to real numbers within the range of 0.0 to 255.0 (i.e., the dynamic range of digital numbers). Step 2: Feed the network with an input vector X; the distances of vector X to all the neurons are computed according to and Ci is the if Ci is the nearest class, but x belongs to Cj # Ci runner-up class. In all the other cases, the weight vectors remain unchanged. LVQ2 tends to overcorrect and LVQ3 is therefore developed, with a minor modification made when Cj, Ci, and x belong to the same class: i.e., Applicable values of E can be chosen in between 0.1 to 0.5. Classification is conducted using the nearest-neighbor principle. For each input vector, the Euclidean distances to all the neurons are calculated, and the class identity of the input is found at the neuron to which the distance is the minimum. A classification is rejected if the minimum distance is greater than a threshold, which is set to the largest distance of the training samples of one class to all the neurons labeled to that class.
Data Acquisition and Preparation
A TM scene acquired on 16 October 1995 over a northern suburban area of Beijing was used in this study (pathlrow: 321123). An extract of 1024 rows by 1024 columns was taken, which corresponds to an area of approximately 944 km2.Figure 2 shows the band 5 (mid-inhared) image of the extract. The area covers the northern part of the Haidian District, and the southern part of Changping County. Major land-use types of the area are urban areas, orchards, forests, and arable land. The area has undergone a dramatic change in terms of land-use patterns due to rapid urban expansion in recent years, and the image has been used previously in an urban expansion study for the period of 1991to 1995 (Jiet al., 1996).The latest land resource inventory was completed in 1991 and the land-use map was compiled at a scale of 1:50,000at the district level, from the original detailed survey maps at 1:10,000 scale. This map was used for sample selections and classification accuracy assessment. To reduce the data volume, a principal components transform was applied to bands 1,2, and 3, and the first principal component assembled nearly 97 percent of the total variation. This component was then used together with bands 4,5, and 7 for classification. The classification system used in the previous urban expansion study was adopted for the current study, and seven land-uselland-cover classes in the study area were under consideration (Table 1).Two sets of sample data were compiled according to the land-use maps and the information gathered during the ground visits. The training data were picked (see
Step 3: The neuron that has the minimum distance to the input vector X is chosen and the synaptic coefficients are updated as
Ncj(t) is the radius specifying the set of nodes in the neighborhood of node j at time t , a(t) is a scalar-valued "adaptation gain" and 0.0 < a(t) < 1.0. Step 4: Feed in new inputs and loop Steps 2 and 3 until the network converges. Step 5: Feed in vectors with known class and label each neuron by majority voting. The learning rate a must decrease slowly with time: a(t) = a(t - 1) * a Initial value of a can be chosen from . 0.5 to 0.9. The fine tune of the map is accomplished by Learning (Kohonen, 1990). The conVector Quantization (LVQI to L V Q ~ ) verged network is provided with samples of known class identities, and the weight vectors are updated according to the following if LVQl is used:
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
Figure 2. Band 5 image showing the study area.
Heermann and Khazenie, 1992)to ensure that each sample was unique and should only belong to one class.
Network Selection and Training
The Feature Map Sz ie
As a two-layer network, the feature map size is the only concern when designing the network. The size of the feature map, i.e., the number of codebooks to use, can be interpreted as the overall resolution at which the linear piecewise decision boundaries are formed. It is expected that large-sized maps will produce better class separation because this would allow a more accurate linear piecewise approximation to the decision boundary of Bayesian limits. However, larger-sized maps require more computation time for network learning. It is also likely that the global ordering of the feature map may be difficult to achieve for larger maps. An experiment was conducted to examine the impact of the feature map size on the network learning performance. The network size ranged from 25 up to 900. Figure 3 shows the relationship of the labeling accuracy against the map size using the samples of four spectral bands. The accuracy shown in the figure was obtained after fine-tuning for each different feature map, which is described in the
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
next section. One can see that the network learning improved as the size increased, and the increase was fast at the beginning and gradually stabilized as the network size became larger. The 25- by 25-neuron network was large enough, and further expansion of the network size was not necessary, and this map was therefore used for classification. As the size of the feature map increases, the improvement in terms of sample labeling accuracy is non-uniform across classes. For easily separable classes, the improvement is rather marginal. For instance, horticultural land does not show much improvement and it is well separated even when the map size is small. For classes for which the data distributions substantially overlap (e.g., forests and orchards), the improvement is much more evident. This suggests that large-sized feature maps may have to be used when dealing with data that are strongly overlapping.
Training of the Feature Map
A coarse tune forms the global ordering of weights in D-dimension, and partitioning of codebooks among the classes is accomplished once the coarse tune is completed. The mathematics behind the ordering of weights and formation of class
books so that the number allocated to one particular class is the number that is needed for that class to form accurate class boundaries. To achieve this, the following approach was taken. Class Land-use/ Training Test The proportion of samples for each class was calculated, and a ID cover Descriptions Set Set pattern was defined by which the number of samples that 1 Arable Land mainly winter wheat 8216 8660 should be fed to the network consecutively from each class was and paddy fields, determined, according to the proportions. By feeding the samand bare soil. ples to the network according to the specified pattern, partiBarren Land quarries, bare rocks, 1145 390 tioning of codebooks is accomplished so that the number of and cleared land for codebooks assigned to each class is the result of "winner-takeproperty developall" process. Figure 4 shows a feature map with 15 by 15 neument. coniferous forest in the 8849 3295 rons partitioned by the latter form. It can be seen from Figure 4 Forest mountainous areas, that similar patterns in the input space are mapped onto adjadeciduous woods cent neurons on the feature map. Apparently, a cluster of three elsewhere, and neurons assigned to forests on the lower-left corner is cut off nurseries. from the main block of neurons. This is not a "parcellation" 4 Orchard mainly apple and 2320 2029 caused by using a small neighborhood during the coarse tune peach orchards. 5 Urban rural village clusters, 9652 6865 process as mentioned by Kohonen (1990),because the initial neighborhood used is 14. In fact, these neurons are allocated to Features roads, and continudeciduous woodlands because they have a higher spectral ous urban fabric. 6 Horticulture vegetable fields and 657 793 reflectance in the near-infrared band than do coniferous woodgreenhouse bases. lands, and the spectral reflectance is lower than that of horti7 Open Water lakes, canals, and fish 4021 1268 cultural land but higher than that of orchards. Surface ponds. The learning parameters concerned are the learning rate a, Total 34833 23300 and the decreasing factor ufor a. Coarse tune is easily accomplished and the tuning is insensitive to the initial value of a, provided that a decreases slowly with time. For the 25 by 25 map, an initial value of 1.0 is used and the decay factor is set to 0.99. Convergence of the network (when adrops below 0.01) is 93 reached after 1644 epochs. Once the coarse tune is completed, the samples are classified and the error rate is checked. This 92 - error rate serves as the starting point for the fine adjustment procedure. For the fine tune process, the first step is to apply LVQl. This is optional but it often turns out to be efficient in correctingthe O weights that might be over tuned or less tuned by the coarse tune process. A small sensible value of a should be used. Three 89 -rounds of LVQI are carried out by setting the learning rate to 0.01. f! LVQ213 is then applied, and the width of the window is set to 20 2 88 -percent as suggested by Kohonen (1990).Varying the width has shown partial improvement in terms of labeling accuracy upon 87 -certain classes, while other classes deteriorate. This is clearly 86 caused by the difference in the number of samples falling inside 0 200 400 600 800 1000 the window. Therefore, the learning rate a is varied in order to Number of Neurons reduce this effect. For a large bulk of training samples, a relatively larger value of a is used, while a much smaller value is Figure 3. Relationship between the map size and the labelused for classes with a smaller quantity of samples. ing accuracy.
OF IN DATA AND THE TEST SET TABLE THE NUMBER PIXELS THE TRAINING 1. DATA SET.
representations is complicated, and a detailed study can be found in Bose and Liang (1996).Ripley (1996) argued that it is unclear how many codebook vectors should be selected for each class, because the number needed depends as much on how well they are employed as on the proportions of the class. Preliminary studies show that the partitioning of codebooks is affected by the sequence of the samples fed to the network, the number of samples drawn from each cover class, and how the data are distributed. Two forms of codebooks partitioning were employed in the experiment. One was to partition the codebooks evenly among the classes. By alternating the samples among the classes at each learning step, the codebooks were nearly evenly divided among the classes. However, as the probability density functions differ from class to class, classes with higher variations will be portrayed less accurately,because the number needed should in theory be greater than those with little variations. Therefore, this type of partitioning was discarded from further study. The other is to partition the code1454
O O O O @ ~ ~ ~ @ ~ OOOOOOfJQ(J(?JOOO@O @@@@)~(J@@~~(?JOo~~ @@@@@@@(-JQoopJo@o @@@@,@@@@(-JQ(=JOoo@ @@@@@@89@00000(=J e@@OOoOo0Oo@~ 8@8800000000 @@000000000@1 00000000000C,,-, e000000000000@@ ~80Q00000000000 0 0 0 0 @ e
Figure 4. A 15 by 15 feature map labeled for the seven land-use classes.
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
is implemented to classify 1 2 classes by splitting arable land into three subclasses (bright bare soil, dark bare soil, and vegeNetwork Configuration tated fields),forests into three subclasses (coniferous, deciduData Coding Input Hidden Output ous, and shadow), and urban features into rural villages and metropolitan areas. The distribution of each of these classes is Four Bands 1 0 units 4 0 units 1 8 units 7 units assumed to be Gaussian. The results are then amalgamated into 7 units 252 units Window 3 by 3 20 units 7 units seven land-use classes to align with the results from other Window 5 by 5 4 units 400 units 15 units 7 units classifications. 4 units 784 units Window 7 by 7 15 units 7 units Three-layer B p neural networks (with one hidden layer) are used for the comparative study. Coarse Coding (Bischof et al., 1992)is used to map digital values before they a e used as input It is observed that, once LVQ2 is applied, the use of LVQ3 to the networks. As described by Bischof et al. (1992),the nurnmay smooth out the decision border. Therefore, LVQ3 is no ber of coarse coding units do not have a profound impact on longer used, and LVQ2 is used repeatedly. To ensure that tuning network learning behavior, but coding does allow a relatively is properly carried out, the sample labeling accuracy is evalularge value of learning rate to be used in order to quickly reduce ated by forming a confusion matrix, after each complete round the network error. Table 2 presents the network architectures of tuning. If one particular class is over-tuned due to an inadefor the classifications. Ten units are used for training the netquate value of a being used, the results are discarded and work when four bands are used. Fewer units are used when another run of LVQZ is carried out. At the beginning of the fine textural information is incorporated to reduce the learning time tuning process, a relatively large value of a could be used (e.g., required for the networks to converge. The output layer has 0.3) for LVQZ without hampering the overall adjustment of seven neurons which correspond to the seven land-use classes. weights. As this process continues, smaller values may be The size of the hidden layer can be a crucial question in netapplied. Tuning is continued until no more increase in labeling work design. For each network, the number of hidden units is accuracy is obtained. Six rounds of iterative tuning are convaried and the network with the best performance in terms of ducted, and the network is said to have fully converged. the final network error is used for classification. The same training data set is used to train the BP networks. Inclusion of Textural Information Connection weights are updated after each sample is fed to the Studies (e.g., Lee et al., 1990;Bischof et al., 1992; Augusteijn et network. The initial learning rate is set to 0.5, and the momenal., 1995)show that neural networks are able to extract textural tum term is set to 0.2. An initial run of 30 epochs is conducted information directly from the neighborhood of a pixel and use without changing the learning rate and the momentum term. it to enhance classification performance, and no explicit defiModifications of these parameters are then made by examining nition of textural measures are required. In the current study, the dynamics of error as suggestedby Heermann and Khazenie (1992). The mean square error (MSE) is calculated every five the neighborhood is defined with a window of varying sizesi.e., 3 by 3 , s by 5, and 7 by 7, respectively. The center of the epochs, and the MSE decay as a function of time is show in Figwindow is placed at the same image coordinates as used ure 5. The learning rate may have to be reduced to a smaller before. Smaller feature maps are used for training the samples value before the learning is terminated in order to achieve good taken by the window sizes of 5 by 5 and of 7 by 7 (20 by 20 and separation between the samples that are difficult to learn (typically, the final learning rate is reduced to 0.001). The sharp 18 by 18, respectively) as an effort to reduce the computation turning points in Figure 5 correspond to the abrupt changes in time. This is based on the assumption that the added neighborthe learning rate. The networks with neighborhood information hood information helps classes separate; therefore, larger feadefined by 5 by 5 and 7 by 7 windows are able to converge to ture maps are not necessarily required. All feature maps are able to learn the samples much faster, as the samples are well the desired error tolerance term (which is set to 0.01), and the separated when coarse tune is terminated. Three rounds of fine network with a 3 by 3 window finishes close to the convertuning are carried out for each of the networks. The time used gence with the final error of 0.0134. Table 3 presents the trainfor training the network increases substantially (see Table 3). ing time and classification time for both networks. With the neighborhood information added, the increase in samResults ple labeling accuracy is seen across classes. Table 4 presents the classification accuracy using the maximum-likelihood classifier, the feature map method, and the Bp Benchmarking Studies algorithm. The classification accuracy for each individual class Two other classification methods are used for benchmarking is calculated as the correctly classified pixels divided by the studies: the maximum-likelihood method and the Back-Propagation neural network. The maximum-likelihood classification total number of pixels in the test set. Confusion matrices for
AND TIME TABLE TRAINING CLASSIFICATION FOR 3.
TABLE CONFIGURATIONSTHE FOURBP NETWORKS. 2. FOR
BOTH .NEURAL NETWORKS
RUNNING A P E M I U M /PC.~ ~ ON ~
Training Feature Map Four Bands Window 3 by 3 Window 5 by 5 Window 7 by 7 B Method P Four Bands Window 3 by 3 Window 5 by 5 Window 7 by 7
Coarse Tune 17.33 Minutes 37.2 Minutes 1.45 Hours 1.083 Hours
3.6 Minutes 10.8 Minutes 22.3 Minutes 1.05 Hours 0.546 Hours 4 . 4 Hours 5.9 Hours 6 . 3 Hours
0.349 Hour 0.8 Hour 1.82 Hours 2.133 Hours
Classification 217 Seconds 220 Seconds 2.3 Hours 2.87 Hours
9 1 Seconds 722 Seconds 763 Seconds 1980 Seconds 18
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
60 80 100 120 Number of m c h r
Figure 5. The learning curve of the BP networks.
classifications using a window size of 5 by 5 for BP and KSOFM are presented in Tables 5 and 6, respectively. Others are not presented here in order to reduce the length of the paper. All confusion matrices are standardized before the accuracy is evaluated. With the use of four-band data, the feature map network produced an overall accuracy of 89.7 percent, 2 percent higher than that of the maximum-likelihood classifier at 87.7 percent. The increase in classification accuracy is mainly attributed to better identification of arable land, urban features, and water bodies, because the data distributions of these classes are far from normal. The standard deviation of barren land is the high-
est in all spectral bands. Although many outliers in the training data are deleted, still the assumption of Gaussian distribution has resulted in a great deal of misclassification by the maximum-likelihood method. For orchards, which have a near normal distribution, the Kohonen Map method has also produced better classification results. In fact, the subdivision of arable land class reduced the internal variability, forcingmany orchard pixels to be assigned to arable land by the maximum-likelihood method. The feature map network produced better results due to the fact that the decision boundaries are formed more accurately in a piecewise manner by using a cluster (or clusters) of neurons. In a sense, the map method may be able to achieve results equally as accurate as is the maximum-likelihood method when data distribution is normal. However, this requires that the map be well trained using the data available. For classes with fewer samples, it is possible that the weights are not well tuned to favor these classes. This is in fact the case with Class 6 where the maximum-likelihood method scored higher classification accuracy then did the feature map method (the data distribution of Class 6 is near-normal). Figure 6 shows a segment of the classified images. It is evident that all three classifiers have produced noisy results, although the results generated by the two neural networks are slightly less noisy compared to that of the maximum-likelihood method. By incorporating the neighborhood information, significant increases in overall classification accuracy are achieved. The improvement is seen across classes, partly due to the homogeneity of the test sites chosen. Forests are much better identified as the confusion with arable land and with orchards reduces. Urban areas are also better classified, particularly in the metropolitan areas where fewer pixels are misclassified as water bodies. For open water surfaces, the identification is also improved. It is evident from Figure 6 that classification with textural information is much more noise free. The results indicate that the feature map network is able to utilize the neighbor-
~NDIvIDUAL LANDUSE/LANDCOVER CATEGORY ACCURACYAND OVERALL ACCURACY ALL THE CLASSIFICATIONS. MLC-MAXIMUM-LIKELIHOOD FOR KY: ES CLASSS~FIER; KH-KSOFM, BP-BACK PROPAGATION. TMXM-TEXTURAL CLASSIFICATION BY A WINDOW A SIZE M Y M DEFINED WITH OF B .
Percentage Correct % Class1 MLC KH4 BP4 KHt3 X3 BPt3 X3 KHt5 X5 BPt5X5 KHt7X7 BPt7X7 Class2 Class3 Class4 Class5 Class6 Class7
86.7 91.0 92.5 96.2 96.0 96.6 96.1 96.5 97.0
91.5 91.8 86.7 92.6 96.4 98.5 93.6 95.6 93.1
92.8 90.6 91.8 94.4 97.4 94.3 97.7 94.8 99.7
81.2 89.3 85.5 95.3 91.4 96.0 94.2 97.0 93.7
87.6 89.4 90.6 95.9 95.5 95.8 93.8 97.2 95.4
95.6 91.8 93.3 93.3 94.8 91.7 92.8 91.2 90.0
78.0 83.8 85.3 94.2 95.1 95.4 94.2 93.5 96.3
87.65% 89.67% 89.38% 94.54% 95.22% 95.47% 94.63% 95.12% 95.04%
CONFUSION MATRIX C A SFC TO USING 5 BY 5 WINDOW THE B ALGORITHM. FOR L S I I A I N A BY P Classified As
Ground W t h
5 0.023 0.023
Omission Error %
Row Total i n pixels
1 2 3 4 5 6 7 Commission Error % Column Total i n Pixels Overall Accuracy
0.033 0.009 0.03 0.0313 0.023
0.027 0.02 0.013 0.036 9.7 3549
0.014 0.928 0.022 8.1 0.8 6689 739 94.63%
3.9 6.4 2.3 5.8 6.2 7.2 5.8
8660 390 3295 2029 6865 793 1268
TABLE CONFUSION 6. MATRIXFOR
CLASSIFICATION USING A
5 WINDOW KSOm. BY
Classified As Ground Truth
1 2 3
Omission Error %
Row Total in Pixels
Commissioe Error % Column Total in Pixels Overall Accuracy
Figure 6. Extracts of the classified images. (1) Original band 5 image (2) MLC, (3)KH-four bands. (4) BP-four bands. (5)KH-Window 3 b 3.(6) BP-Window 3 b 3.(7) y y KHWindow 5 b 5. (8)BP-Window 5 by 5. (9) KH-Window y 7 by 7 (10)BP-Window 7 by 7. .
hood information to enhance classification performance without explicitly defining textural measures. The neural network trained with the Back Propagation algorithm achieved an overall accuracy similar to that of the feature map networks when four spectral bands are used. The major differencesbetween them in the classified results are the identification of Class 3 and Class 4. The feature map method has identified Class 4 more accurately than does the BP network, while the BP network has identified Class 3 more accurately. Class 4 is mostly confused with Class 3, because the spectral signature of orchards is very much similar to that of the deciduous woodlands. Although there are differencesbetween different runs of the BP network (sometimes the difference can be
huge) due to the randomizing of the initial weight and the selection of learning rate, the same effect on the separation of the two mentioned classes is observed. This may be attributed to the use of the Coarse Coding procedure which is designed to favor generalization, because similar numbers will produce similar patterns which are provided to the network. The separation between the two classes may be better achieved if coding is done using bit patterns of digital numbers (e.g., Heermann and Khazenie, 1992). In this case, similar DNs will be mapped completely differently; therefore, it would be easier for the network to learn similar patterns. For the feature map algorithm, when coarse tune is accomplished, the separation is also not well achieved for the two classes. The use of fine tune proceDecember ZOO0
dures, particularly LVQ2, is able to form accurate decision boundaries between them. One can see that the identification of both of the classes is well achieved and balanced. For all the window classifications, again the same level of accuracy is obtained by both neural networks. The differences are still in the two classes mentioned above. The over learning of the BP networks is observed when the 7 by 7 window classification is conducted. The network error is able to decrease continuously (the final MSE is 0.00745 at epoch 120), but the overall classification accuracy is decreased by more than half of a percent. Clearly, as the window size increases and the samples become unique, the BP network loses its ability to generalize, resulting in poor classification performance. Paola and Schowengerdt (1997) use the majority filtering technique to compare single-pixel classification against window classification. They found that majority filtering can simply smooth out fine details of classification. In the current study, it is found that the window classification also has the same effect, though not as dramatic. The fine details such as the internal variability (i.e., noise) within the land-usellandcover parcels are smoothed out, resulting in improved classification. However, other details such as linear features (e.g., canals, ditches, and roads) are either eliminated or misclassified. In addition the land-use parcels appear to be distorted around the edges. The problem is more evident when the window size becomes larger. This is because these areas are where the mixed pixels are located while the samples chosen are from more homogeneous areas. Moreover, the generalization ability of both types of networks is weakened due to the uniqueness of the samples, as mentioned earlier. It can be seen fromFigure 6 that the feature map has misclassified some of the forest areas, while the BP network has misclassified many land-use parcel edges as barren land. Window sizes of 3 by 3 andlor 5 by 5 for incorporating neighborhood information seem to be suitable choices because, beyond that, no further improvement is seen, and the distortion of classification around the edges has just began to emerge. Furthermore, network training time can be significantly reduced (referto Table 4). Even if small windows are used, training samples may have to be selected in areas along the edges so that the mixed samples are represented. The classification results also suggest that, even when the window size is quite large, single-pixel classifications can still be seen, which indicates that the neural networks are better techniques than majority filtering by which fine details of classification are largely removed, especially when a filter size of 7 by 7 is used for majority filtering where the details can barely be seen.
Comparison with the BP Method
The KSOFM and BP neural networks differ in a number of ways. The most notably is their network architecture. The former requires only two layers with one input layer and one output layer. The network design is relatively simple, provided that the output layer is large enough to form accurate class representation. For the latter, three layers (or more if more than one hidden layer is used) are usually needed, and the size of the hidden layer can have a profound impact on the success of the network performance. The dilemma of the hidden layer size is well addressed in the literature (Paola and Schowengerdt, 1997). An applicable size for a particular problem may have to be determined by trail and error. In this study, it was found that increasing the size of the hidden layer helps improve class separation, especially when fewer data are used for classification. The subtle difference between the two networks lies in the learning mechanism, i.e., the decision-making process. The feature map network creates decision boundaries in a linearpiecewise manner. To achieve the same level of accuracy as that of the BP network, a fairly large map has to be employed.
Even so, the decision boundaries may not be able to portray the data distributions much accurately by linear boundaries that demarcate classes. The BP network on the other hand, due to the complexity of its network architecture, plus the powerful learning mechanism, forms class borders in a non-linear arbitrary fashion (the network is able to mimic the distribution that is disjointed when two hidden layers are used (Lippmann, 1987));hence, it may be more robust and potentially capable of achieving better classification results. This is reflected in the network training because the networks are able to learn the samples with higher labeling accuracy, although the over learning by the networks has to be controlled by a proper convergence term. The feature map, however, is not able to further reduce the confusion among the samples when training is terminated, but it offers a good degree of compromise among the samples by forming linear decision borders. Both types of networks perform well as compared to statistical classifiers such as the maximum likelihood. The inconsistency in training by the BP network is noted by many researchers. For the feature map, the same phenomenon is seen because a different set of weights is obtained by varying the learning rate. But, the same final labeled map may be obtained even if a different learning rate is used (the labeled maps may only differ in orientations). For the BP network, convergence controls the generalization ability, and too-welltrained networks may not function well on the entire image. For the feature map, the over-correctingeffect by the LVQ2 is not observed, and the experiment conducted shows that classification accuracy continuously increases as fine tuning continues. To this end, the feature map is more consistent than the BP network. Computation-wise, the feature map is a faster learner but a slow classifier. On the other hand, the BP algorithm suffers from the fact that learning is time-consuming (Heermann and Khazenie; 1992, Tzeng et al., 1994). But, classification by the BP network is fast. It is evident from Table 3 that training of the feature map is usually 2 to 3 times faster than training the B p network. When more channels are used for training, the advantage of using the feature map network is even more pronounced, because the training of a large volume of data is much slower for the BP method, although fewer epochs are usually required. It should be mentioned that the BP network training is controlled by many factors such as the hidden layer size, the learning rate selection, the convergence term, and so on. The BP learning process can be significantly accelerated by reducing the hidden layer size and by using fewer nodes for coarse coding. Therefore, the figures in Table 3 should only serve as a relative comparison between the two networks. However, it appears that the BP network may need more neurons for the hidden layer to form accurate class borders in order to achieve the same level of classification as that of the feature map, and the increased size of the hidden layer prolongs the learning phase significantly. The enhanced performance using more hidden neurons could be due to the use of coarse coding of the input data, because the network has a huge number of input neurons (for instance, coding for 7 by 7 textural input results in 980 input nodes when using five units per DN); therefore, the network may require more hidden neurons to accommodate the input in order to effectively reduce the network error. For the classification using a 3 by 3 window and the BP network, a network with 252 by 15 by 7 topology has produced an overall classification at merely 93.74 percent, 1.5 percent lower than the accuracy presented in Table 4. For the classification phase, however, the BP networks are much faster than the feature map networks. For the feature map, finding the neuron to which an input vector belongs among a vast number of nodes (e.g., a 25 by 25 map results in 625 neurons) is a very time consuming process. The method proposed by Hardin and Thomson (1992) has been implemented to speed up this process. Still, the time
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
required for classification is tremendous, especially when there are 196 channels of data (a window size of 7 by 7 for the classification).
Training of the Feature Map Network
he author wishes to acknowledge the China National Natural Science Foundation for sponsoring this research through Project No. 49601014.
Classification by the feature map neural network is at least affectedby the following factors. First is the partition of codebooks. Second is the size of the feature map to use. h i n g of the feature map also contributes a certain degree of influence, and a fourth influence is the number of samples selected for each class. Partitioning of codebooks among classes determines the number of neurons allocated to each class. This process can be controlled, to some extent, by using non-equal numbers of training samples for each class, and by the sequence in which the samples are provided to the network. The experiments conducted in this study show that, by alternating samples from each class according to a predefined pattern, the codebooks are partitioned in a near optimal fashion. The size of the feature map should be large enough to form accurate class representations. When classifying four-band data, the 10 by 10 feature map has produced an overall accuracy at merely 87 percent, 2 percent lower than that of the 25 by 25 feature map. At small sizes, it is obsewed that, no matter how well the feature maps are partitioned and tuned, the overall labeling accuracy cannot reach the level produced by larger maps. For two similar classes, it is often the case that one class is better tuned than the other, leaving many patterns mixed which would otherwise be separated by larger networks. The number of samples selected for each class affects the network learning at various stages, such as the partition of the feature map and the fine tuning process. Classes with bulky training samples tend to occupy more codebooks and therefore are more accurately discriminated. Fewer samples will result in fewer codebooks being activated during training; class separation may therefore be underachieved. Civico (1993) stated that the selection of training samples is important for the BP network learning. For the feature map network, it is equally important that the samples may have to be pre-processed before the network training begins. For instance, redundant samples may be deleted, and samples may be balanced among the classes through a certain procedure. The use of large feature maps trades off the computation time. Fromthe map size of 10 by 10 to 25 by 25, the time used for the coarse tune has increased about 20 times (training of the 10by 10 feature map took less than one minute). A more challenging situation might be when a large number of groundcover types are to be classified. For example, classification of agricultural crops for a fairly large coverage may involve dozens of crop types. In this case, a large quantity of codebooks would be required, and the training time would increase substantially. Methods for speeding up the training phase, such as the "Quick Reaction" approach (Monnerjahn, 1994),are therefore necessary.
Augusteijin, M.F., L.E. Clemens, and K.A. Shaw, 1995. Performance of texture measures for ground cover identification in satellite images by means of a neural network classifier, E E E 'Transactions on Geoscience and Remote Sensing, 33(3):616-626. Benediktsson, J.A., P.H. Swain, and O.K. Ersoy, 1990. Neural network approaches versus statistical methods in classification of multisource remote sensing data, LEEE Zhnsactions on Geoscience and Remote Sensing, 28(4):540-552. Bischof, H., W. Schneider, and A.J. Pinz, 1992. Multispectral classification of Landsat images using neural networks, LEEE 'Transactions on Geoscience and Remote Sensing, 30(3):482-490. Bose, N.K., and P. Liang, 1996. Neural Networks Fundamentals with Graphs, Algorithms, and Applications, McGraw-Hill, Inc., New York, N.Y., ISBN 0-07-006618-3, p. 361, p. 367. Bruzzone, L., C. Conese, F. Maselli, and F. Roli, 1997. Multisource classification of complex rural areas by statistical and neural network approaches, Photogrammetric Engineering & Remote Sensing, 63(5):523-533. Civco, D.L., 1993. Artificial neural networks for land-cover classification and mapping, Int. J. Geographical Information Systems, 7(2):173-186. Gong, P., 1996. Integrated analysis of spatial data from multiple sources: Using evidential reasoning and artificial neural network techniques for geological mapping, Photogrammetric Engineering & Remote Sensing, 62(5):513-523. Graf, D.H., and W.R. LaLonde, 1988. A neural controller for collisionfree movement of general robot manipulators, Proc. EEE Int. Conf. on Neural Networks, ICNN-88, 24-27 July, San Diego, California, 1-77-1-84. Hara, Y., R.G. Atkins, S.H. Yueh, R.T. Shin, and J.A. Kong, 1994. Application of neural networks to radar image classification, IEEE 'Transactions on Geoscience and Remote Sensing, 32:lOO-109. Hardin, P.J., and C.N. Thomson, 1992. Fast nearest neighbor classification methods for multi-spectral imagery, The Professional Geographer, 44(2):191-201. Heermann, P.D., and N. Khazenie, 1992. Classification of multispectral remote sensing data using a back-propagation neural network, LEEE 'Transactions on Geoscience and Remote Sensing, 30(1):81-88. Iivarinen, J., K. Valkealahti, A. Visa, and 0. Simula, 1994. Feature Selection with Self-organizing Feature Map, Proceedings of the International Conference on Artificial Neural Networks, [dates of conference], Sorrento, Italy, 1:334-337. Ji, C.Y., D.F. Sun, and S. Wang, 1996. Dynamic Land-use Change Detec1 tion using TM Images, Technical Report 11to the State Land Administration of China, The State Land Administration, Beijing, 65 p. Kim, J.W., J.S. Ahn, C.S. Kim, H.Y. Hwang, and S.W. Cho, 1994. Multiple self-organizing neural networks with the reduced input dimension, Proceedings of the International Conference on ArtifiItaly, 1:310-313. cial Neural Networks, 26-29 May, Sol~ento, Kohonen, T., 1982. Self-organizing formation of topologically correct feature maps, Biological Cybernetics, 43:59-69. , 1988. The 'neural' phonetic typewriter, Computer, 21:ll-22. , 1990. The self-organizing map, Proceedings of the LEEE, 78(9):1464-1480. Kohonen, T., G. Barna, and R. Chrisley, 1988. Statistical pattern recognition with neural networks: Benchmarking studies, Proceedings of International Joint Conference on Neural Networks, IJCNN-88, 24-27 July, San Diego, California, 1-61-1-68. Kohonen, T., T. Kangas, J. Laaksonen, and K. Tprkkola, 1992. LVQ PAK. The Learning Vector Quantization Program Package, Version 2.1, Laboratory of Computer & Information Science, Helsinki University of Technology, Helsinki, Finland.
The results of this study show that the KSOFM algorithm is capable of achieving higher classification accuracy as compared to that of the maximum-likelihood method. There are no significant differencesbetween the classification performance of the feature map networks and the Back-Propagation networks. The feature map neural network is also able to utilize textural information to enhance classification performance. The size of the feature map and partition of codebooks affect class representation. Large-sized feature maps may have to be used to achieve more accurate class separation, and the algorithm may therefore encounter difficulties when a large number of classes are to be represented. Nevertheless, the feature map method is a useful viable alternative for the classification of remotely sensed data.
PHOTOGRAMMETRIC ENGINEERING 81REMOTE SENSING
Lee, J., R.C. Weger, S.K. Sengupta, and R. Welch, 1992. A neural network approach to cloud classification, BEE Transactions on Geoscience and Remote Sensing, 28(5):846-855. Lippmann, R.P., 1987. An introduction to computing with neural networks, lEEE ASSP Magazine, (April):4-22. Liu, Z.K., and J.Y. Xiao, 1992. Multiple Kohonen networks and their application to remote sensing pattern recognition, Proceedings of the 6thChina National Conference on Image and Gruphics, July, Beijing, China, pp. 428-431. Monnerjahn, J., 1994. Speeding-up self-organizing maps: The quick reaction, Proceedings of the International Conference on Artificial Neural Networks, 26-29 May, Sorrento, Italy, 1:326-329. Muller, B., and J. Reinhardt, 1990. Neural Networks: An Introduction, Springer-Verlag, New York, N.Y., pp. 245-248. Nasrabadi. N.M.. and Y. Fene. 1988. Vector auantization of images upon the ~ d h o n e n self-organizing featur;! maps, proceedin& of the IEEE International Conference on Neural Networks, ICNN-88, 24-27 May, San Diego, ~alifornia,1-101-1-108.
Orlando, R., R. Mann, and S. Haykin, 1990. Radar classification of seaice using traditional and neural classifiers, Proceedings of lnternational Joint Conference on Neural Networks, IJCNN-90, 15-19 January, Washington, D.C., 11-263-11-266. Paola, Justin D., and R.A. Schowengerdt, 1997. The effect of neural network structure on a multispectral land-uselland-cover classification, Photogrammetric Engineering b Remote Sensing, 63(5): 535-544. Ripley, B.D., 1996. Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, United Kingdom, ISBN 0-521-46086-7, p. 206, p. 323. Tzeng, Y.C., K.S. Chen, W.L. Kao, and A.K. Fung, 1994. A dynamic learning neural network for remote sensing applications, lEEE Transactions on Geoscience and Remote Sensing, 32(5): 1096-1102. Yoshida, T., and S. Omatu, 1994. Neural network approach to land cover mapping, lEEE 7Yansactions on Geoscience and Remote Sensing, 32(5):1103-1109. (Received 25 January 1999; revised and accepted 06 December 1999)
membership expire soon?
Renew Now! To Ensure you receive all of the ASPRS Member Benefits!
Renewal Now Guarantees: Delivery of 12 issues of PE&RS, including "Resource 2001" Up to $100 savings on St. Louis 2001 Annual Conference and ASPRS Specialty Conferences 25-40% discount off ASPRS publications
Remember that your membership renewal is based on the anniversary date of the month you joined. If your membership is due January 1': we encourage you to renew before the end of the year to ensure your continued subscription to PE&RS. A n y renewals received after the renewal due date may result in missed PE&RS issues. Please note - back issues will no longer be provided for missed issues caused by late renewals. However, you will be able to obtain back issues by purchasing single copies for $15 each.
Easy ways FO nenew:
Now members can save time and the cost of a stamp, and also help ASPRS become more efficient, by renewing their membership online. To renew online, navigate to the ASPRS e-serve site by clicking http://eserv.asprs.org or by clicking the ASPRS e-serve button at http://www.asprs.org. Login with your current login I D and password. At the left frame, select "Dues Renewal," then complete the form with the information requested. If you would like to make a contribution to the Building Fund, or other contributions, please select the appropriate box. If you have problems accessing the ASPRS eserve with your login ID and password, please contact Sokhan Hing: firstname.lastname@example.org,
If you are no longer a student, you may become an Associate Member. Dues are $65. Contact email@example.com to convert. For details see page 1405.
Please note -Your permanent new member card has been mailed to you along with the renewal notice. You can use this card as long as you are an active ASPRS member.
Call 301-493-0290, ext. 109 Fax your renewal to 301-493-0208 Mail renewal notice to: ASPRS, P.O. Box 630558, Baltimore, MD 21263-0558
AS ALWAYS, WE THANK YOU FOR CONTINUING TO SUPPORT ASPRS.
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING