You are on page 1of 10

Remote Sensing of Environment 81 (2002) 290 299 www.elsevier.

com/locate/rse

Pixel- and site-based calibration and validation methods for evaluating supervised classification of remotely sensed data
Douglas M. Muchoney*, Alan H. Strahler
Center for Remote Sensing/Department of Geography, Boston University, 675 Commonwealth Avenue, Boston, MA 02215, USA Received 29 March 2001; received in revised form 21 December 2001; accepted 22 December 2001

Abstract The characteristics of calibration and validation data, especially sample size, distribution, thematic labeling, and representativeness, are important to supervised classification algorithms, as are their use to ascribe accuracy statements to supervised classification results. While random and stratified random sampling of calibration and validation into calibration (training) and validation (testing) subsets has often been based on pixel sampling, the resulting accuracy statements may be overly optimistic and biased due to spatial autocorrelation. This is especially true for decision tree and neural network algorithms that tend to learn examples well even when just a single pixel from a site is presented in the learning phase. Therefore, polygon- or site-based sampling (using groups of contiguous pixels) for calibration and validation may provide better estimates of prediction accuracies. This paper presents the results of pixel- versus polygon-based calibration and validation of a supervised Gaussian ARTMAP neural network algorithm. Both techniques were applied to (1) a multitemporal AVHRR vegetation index data set and (2) a multitemporal data set of NDVI, temperature, and additional temporal metric and physical data to map vegetation and land cover (VLC) based on a regional network of sites in Central America. Our results indicate that although pixel-based accuracy assessment because of spatial autocorrelation may overstate accuracy, polygon-based assessments are also problematic. The Gaussian ARTMAP algorithm was used to provide insight into the problems of site heterogeneity, per-class accuracy, sampling rate, sampling representativeness, and the generalization properties of sites and classes, as related to pixel- and polygon-based sampling. D 2002 Elsevier Science Inc. All rights reserved.

1. Introduction 1.1. Background Supervised classification requires site data to calibrate (train) and validate (test) classification algorithms. Calibration data, also known as training data or examplars, are examples presented to supervised classification algorithms that are to be recognized or learned. Validation or testing data are alternatively an independent set of data that are used to assess the performance or accuracy of the classification algorithm. In the remote sensing context, calibration and validation are generally sites that are areas on the ground that are defined as points, pixels, groups of pixels, and polygons.

* Corresponding author. Present address: Conservation International, Center for Applied Biodiversity Science, 1919 M Street NW, Washington, DC 20036, USA. Tel.: +1-202-912-1801; fax: +1-202-912-0773. E-mail address: d.muchoney@conservation.org (D.M. Muchoney).

The characteristics of calibration and validation data are important to supervised classification algorithms, especially neural networks. The characteristics of training data influence category generation and what categories represent (Foody, 1995; Labovitz, 1986; Maren, 1990). For neural networks, training data control the generation of new internal categories given novel inputs. In general, the fewer internal categories or neurons that are required, the better the generalization characteristics (Dryer, 1993). As the number of internal categories increase, classification accuracy increases but begins to drop off, as categories are formed to overfit to specific examples (Muchoney & Williamson, 2001). Given the cost of generating reliable site data for calibration and validation, the classification algorithm requirement for site data is an important consideration. These requirements include the data label or labels, quality, sample size and representativeness, data presentation, and validation and accuracy reporting. Data labels are the a priori class name, parameters, or variables ascribed to a site. Site data labeling can be as

0034-4257/02/$ see front matter D 2002 Elsevier Science Inc. All rights reserved. PII: S 0 0 3 4 - 4 2 5 7 ( 0 2 ) 0 0 0 0 6 - 8

D.M. Muchoney, A.H. Strahler / Remote Sensing of Environment 81 (2002) 290299

291

simple as defining a categorical classification label or ascribing any number of biogeophysical parameters to a site. In the case of classification variables, the classification system can be based on land cover, vegetation type, habitat, bioclimate, physiognomy, or other criteria (Muchoney, Strahler, Hodges & Locastro, 1999). The ability to discriminate a class label is, in turn, based on the remotely sensed data. Site data quality is primarily due to errors in location and labeling. Decision tree classification algorithms are tolerant of noisy training data and have been used to remove noisy/ dirty training data (Brodley & Friedl, 1996). Artificial neural networks (ANNs), in general, have been found to be robust to training site heterogeneity (Paola & Schowengerdt, 1995). The sample size can be considered as a superset of pixels or as the set of sites, with a site defined as a group of contiguous pixels that is treated as a single sampling unit. This distinction is critical to understanding calibration and validation of supervised classification algorithms. Site data can be as small as a single pixel or include many pixels. The maximum-likelihood classification algorithm requires substantial training data, in the neighborhood of at least 10 30 times the number of features per-class (Foody, 1995; Mather, 1987; Piper, 1992). For ANNs, there seems to be promise of using smaller training sets (Hepner, Logan, Ritter, & Bryant, 1990). The back-propagation ANN can require extensive training, while adaptive resonance theory (ART) algorithms are designed to be stable enough to preserve significant past learning while still allowing new information to be incorporated in the neural network structure as such information appears in the data input stream (Strahler et al., 1999). Increasing pixel training size has been found to substantially increase the accuracy of four neural network architectures (Foody & Arora, 1997), although beyond a certain size, the rate of increase tends to decline (Zhuang, Engel, Lozano-Garcia, Fernandez, & Johannsen, 1994). To correctly predict an output class, training data must be representative of classes and their subclasses and be of sufficient number to allow pattern recognition to occur. Subclasses of a primary class such as needleleaf forest might include different floristic associations or geographical variants of wide-ranging species. The fundamental questions are how well the sites represent these subclasses and whether a site from one class is effectively equivalent to one from another, that is, are examples from each subclass needed to predict another subclass. For many neural networks and decision trees classification algorithms, back-classification, which is calibrating and validating on the same data, results in upwards to 100% prediction. An important issue, therefore, is the use of independent data to calibrate and validate classification algorithm performance. The use of pixel-based splits for calibration and validation is problematic for supervised classification algorithms. When using pixel-based splitting to train and test subsets, the algorithm is presented with one or more examples (pixels) from each site. This has been shown to result in higher agreement than when site subset-

ting into calibration and validation subsets is performed (Friedl et al., 2000), raising concerns of calibration and validation data independence and spatial autocorrelation. While reporting on prediction accuracy for pixel-based validation using confusion or contingency matrixes is straightforward, there are two possibilities for site-based validation: (1) pixel prediction can be reported and (2) sitelevel prediction in terms of correct pixels per site can be quantified. This would allow for an indicator of whether a site was correctly predicted overall but would also require that some threshold of correctness be identified. For example, if over 80% of a sites pixels were correctly predicted, the site could be determined to be correctly predicted. This logic would also lead to further qualitative assessments of prediction, introducing the concept of whether some misclassifications (predictions) are more egregious than others. For example, a misclassification of 30% of a pine savanna class to grassland may be more acceptable than its classification as broadleaf forest. The implications of land cover misclassification on global land surface models are described by DeFries and Los (1999). 1.2. Research need The purpose of this research is to support the development by Boston University of global land cover and land cover change products globally at 1000 m and 0.25 resolutions quarter-annually using an annual temporal sequence of multispectral and multiresolution MODIS data (Justice et al., 1998; Strahler et al., 1999). The authors have recently published results on the application of neural network and decision tree classification algorithms for mapping the vegetation and land cover (VLC) of Central America (Muchoney & Williamson, 2001; Muchoney et al., 2000). Evaluation of these classification algorithms using several remote sensing data sets has shown that these algorithms produce comparable results that are consistently superior to those produced by Bayesian (maximum-likelihood) classification algorithms (Friedl & Brodley, 1998; Friedl, Brodley, & Strahler, 1999; Gopal, Woodcock, & Strahler, 1999; Muchoney et al., 2000). The classification algorithm used in producing the quarterly MODIS land cover map will, therefore, be either a univariate decision tree, a supervised neural network (Gopal et al., 1999), or a hybrid of the two. The land cover classification that will be employed is the 17-class IGBP classification (Belward & Loveland, 1995). 1.3. Problem formulation There are two primary problems associated with defining a statistically valid independent estimate of accuracy: spatial autocorrelation and representativeness sampling. Related issues are sampling cost and population/sample characteristics. The problem of spatial autocorrelation is well understood (e.g. Campbell, 1981) and recent studies have

292

D.M. Muchoney, A.H. Strahler / Remote Sensing of Environment 81 (2002) 290299

elucidated this problem as it relates to calibration and validation (Friedl et al., 2000). Because pixels (samples) drawn from the same site are autocorrelated, they cannot be considered as being independent. Using the same data sets as this research, Friedl et al. (2000) found strong clustering in the NDVI data within sites, indicating substantial autocorrelation at the scale of the site data. The same applies to sites as samples. The autocorrelation function defines the distance from which two sample polygons are separated so that they are in fact independent. Therefore, when comparing pixel- and polygon-based calibration and validation, it is inappropriate to assume that all polygon sampling provides independent accuracy assessment. Sampling should be representative of the overall population, and the simplest way to ensure representative sampling is to use random or systematic sampling. This method, employed in the validation of the global IGBP land cover map (Belward, Estes, & Kline, 1999), is however restrictively costly. Directed sampling for representativeness is also problematic. Barring intensive sampling or complete enumeration, there is no formal way of knowing whether a distribution of samples is indeed representative of the overall population (Campbell, 1981). Calibration and validation sites are typically delineated to be as homogeneous as possible. Pixels within a site are defined specifically to be homogeneous and therefore correlated. The dilemma is that while [spatially auto] correlated sites are not independent, spatially independent sites belonging to the same general class are most probably uncorrelated and represent uncorrelated subpopulations of the general class. For global land cover mapping, the hope would be that distant things such as deciduous needleleaf forests in Canada and Russia would be more related than other classes in closer proximity. The problem is how to generate samples for calibration and validation that are not autocorrelated and yet are representative of the population and subpopulations. A related question is whether the poorer accuracy of polygon versus pixel calibration/validation necessarily means that polygon calibration/validation provides a more correct indication of the real accuracy of a classified map than pixelbased calibration/validation. 1.4. Objectives To address the problem of pixel- versus site-based calibration/validation and to further explore the functioning of Gaussian ARTMAP, two tests were performed. First, supervised classification based on the Gaussian ARTMAP algorithm was used to map the VLC of Central America using 12 months of monthly composited AVHRR NDVI data for 1995. Second, derived temperature, NDVI, and temporal metric and environmental data were used as additional features for both pixel- and site-based calibration and validation of supervised Gaussian ARTMAP. The latter data set comprised 42 features of NDVI,

temperature, and temporal metric and topographic data. The additional features were not used in a feature selection context but were used to provide a higher dimension data set on which to test pixel- and polygon-based calibration and validation. The purpose of this paper is to report on the comparative studies of accuracy derived from the use of pixel and polygon calibration and validation data. The Gaussian ARTMAP neural network classification algorithm (Williamson, 1996, 1997) provides additional insight into the problem of simultaneously developing accuracy assessment procedures that address both spatial autocorrelation and the fact that using spatially distant calibration and validation sites needs to include subpopulations that may or may not be correlated in feature space.

2. Study area and data 2.1. Study area The dimensions of the study area are 1169 1813 1-km AVHRR pixels bounded by 6 9 N latitude and 77.22 93 W longitude (Fig. 1). The regional site comprises southern Mexico, Guatemala, Belize, Honduras, El Salvador, Nicaragua, Costa Rica, and Panama. Central America includes a diverse array of natural and human-modified landscapes including broadleaf evergreen, deciduous and semideciduous forests, pine savanna and woodlands, swamp and mangrove forests, herbaceous wetlands, and agricultural types. As such, it provides an excellent test for regional classification. 2.2. Site data The 428 sites were distributed among 23 VLC classes and extracted from Landsat Thematic Mapper (TM), Systeme pour lObservation de la Terre-Haute Resolution Visible (SPOT-HRV), NOAA AVHRR, and plot data. The criteria for selecting sites were that they be at least 2 2 km, and site polygons were delineated to be within larger patches of classes, with at least a 1-km buffer from the polygon boundary to the patch boundary. This was to ensure that misregistration and mislocation of the AVHRR data and the coreferenced TM or HRV data would not permit a site to actually represent land cover outside of the patch. Representative sampling was attempted by acquiring samples based on knowledge of the primary landscape units amongst 23 VLC classes. Using ancillary plot and map data, teams of analysts ascribed a set of variables from interpreting TM and HRV data that described vegetation horizontal and vertical structure, leaf phenology and morphology, and site variables (see Muchoney et al., 1999, 2000). Three class labels were also defined: IGBP class, physiognomic formation, and the 23-class VLC system specific to Central America. Although the IGBP classification system (Belward & Loveland, 1995) is the standard

D.M. Muchoney, A.H. Strahler / Remote Sensing of Environment 81 (2002) 290299

293

Fig. 1. Study area and test site locations.

for Boston Universitys global land cover products, other classification systems are being evaluated. The original 428 data sites were reanalyzed for quality and restricted to sites where Landsat TM and SPOT-HRV data were available for use in site variable definition. This reduced the areal and numeric sampling from 428 to 344 sites, although the new site database is most probably more accurate and free of labeling errors. The location
Table 1 Site distribution by vegetation and land cover class VLC class 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Class name Needleleaf evergreen forest Broadleaf evergreen forest Broadleaf deciduous forest Mixed forest Swamp forest Mangrove Evergreen broadleaf woodland Evergreen needleleaf woodland Mixed woodland Evergreen broadleaf savanna Evergreen needleleaf savanna Mixed tree savanna Evergreen broadleaf scrub/shrub Cactus/thorn scrub Swamp scrub/shrub Perennial graminoid grassland Herbaceous wetland Grassland pasture Cropland Agriculture complex Bareground/sparse vegetation Urban-industrial Water Sums/means

and distribution of these sites are provided in Fig. 1 and Table 1. 2.3. Remotely sensed and physical data The remotely sensed data used in this study are monthly composited AVHRR NDVI and surface temperature data developed from 10-day composite data for the months of

Sites 9 108 6 14 10 15 29 9 4 6 6 7 4 2 4 8 8 11 34 16 2 19 13 344

Pixels (km2) 216 2538 76 408 211 81 845 220 50 191 286 129 171 34 29 82 92 281 797 341 8 116 488 7690

Mean site area (km2) 24.0 23.5 12.7 29.1 21.1 5.4 29.1 24.4 12.5 31.8 47.7 18.4 42.8 17.0 7.3 10.3 11.5 25.5 23.4 21.3 4.0 6.1 37.5 22.4

294

D.M. Muchoney, A.H. Strahler / Remote Sensing of Environment 81 (2002) 290299

February 1995 January 1996 provided by the US Geological Survey EROS Data Center (USGS-EDC). Monthly compositing using maximum NDVI reduces cloud, topographic effects, and scan angle dependence of radiance (Duggin et al., 1982). It also tends to remove extreme offnadir pixels (Eidenshink & Faundeen, 1994; Holben, 1986). Surface temperature was generated by USGS-EDC based on a split-window technique using 10-day maximum-value composites of Channels 3 and 4 AVHRR data for 1995. These were composited to maximum monthly values. The premise for using the temperature data was that they might provide additional discriminatory variables (features) for classification. Temporal metrics are indices developed from multitemporal data that are assumed to have some relevance to the temporal dynamics of vegetation and ecosystems, reflecting biological and land use practices associated with vegetation phenology and cropping systems (Reed et al., 1994). For the monthly composited NDVI data, these metrics consisted of mean annual NDVI, minimum monthly NDVI, maximum monthly NDVI, range of monthly NDVI, and month of maximum and minimum NDVI. The temperature metrics were mean annual temperature and annual temperature range, minimum and maximum monthly temperature, and the corresponding month when the maximum and minimum annual temperature occurred. The first six principal components were generated for the 12 months of NDVI and used as additional features in the analysis. Additional environmental variables consisted of a digital elevation model and derivative slope, aspect, and elevation data. The digital terrain model used to generate the elevation zones is the USGS-EDC 30-arc sec data. Slope gradient and slope aspect were derived from the elevation data. The raster DEM was converted to a triangulated irregular network (TIN). From the TIN structure, slope, aspect, and elevation classes were generated. Slope classes represented eight classes in slope degree units. Slope aspect classes correspond to 9 directional degree units. Table 2
Table 2 Remote sensing/environmental data variables Feature class NDVI Temporal NDVI Feature 12 monthly NDVI means Maximum monthly NDVI Minimum monthly NDVI Month of maximum NDVI Month of minimum NDVI Mean annual NDVI NDVI range NDVI principal components Slope Aspect Elevation 12 monthly temperature means Maximum monthly temperature Minimum monthly temperature Month of maximum temperature Month of minimum temperature

describes the remote sensing and environmental data used in this analysis.

3. Research approach The research consisted of two stages. (1) Supervised classification of just the monthly composited NDVI data was performed based on both random pixel and polygon splits. (2) The full set of 42 features including the monthly NDVI data were classified using the same calibration/ validation splits. Since earlier research indicated that classification results were similar for the algorithms of interest to MODIS land cover, only the Gaussian ARTMAP algorithm was employed in this study. 3.1. Gaussian ARTMAP Gaussian ARTMAP is an adaptation to ART that is based on using Gaussian distributions to define the ART category choice functions. For each category, there is an associated a priori probability and a mean and variance in each input dimension. ART categories (hidden units or F2 nodes) learn predictions or mappings to output classes during training. The Gaussian ART activation function evaluates the probability that an input belongs to a categorys distribution as well as the categorys a priori probability. The match function is based on how well the input fits the categorys distribution, which is normalized to unit height. The likely class prediction is based on these activations (Williamson, 1996, 1997). For Gaussian ARTMAP, high vigilance (r) means that more internal F2 categories are created by the network to match input data to output categories. The categories are less broad in the feature space. Gaussian ARTMAP accommodates choice and distributed learning. In choice learning, the maximally activated category is chosen. That is, the chosen categorys match function satisfies the vigilance criterion, and the category

Variable name c1_x c12_x x_max x_min mo_max mo_min xndvi x_range PCA_1 6 SLOPE ASP ELEV t1_x t12_x tmax tmin mtmax mtmin

Data type Interval Interval Interval Categorical Categorical Interval Interval Interval Ratio Interval Ratio Interval Interval Interval Categorical Categorical

Data range 0 255 0 255 0 255 1 12 1 12 0 255 0 100 0 255 0 25 0 255 0 255 0 255 0 255 0 255 1 12 1 12

Data units index index index months months index index index degrees degrees meters Kelvin index index months months

Source NDVI NDVI NDVI NDVI NDVI NDVI NDVI DEM DEM DEM DEM AVHRR AVHRR AVHRR AVHRR AVHRR

Topography

Temperature Temporal temperature

D.M. Muchoney, A.H. Strahler / Remote Sensing of Environment 81 (2002) 290299

295

resonates and learns the prediction. In distributed learning, each category is assigned credit based on the proportion of the net activation of all categories whose match function satisfies the vigilance criterion. When Gaussian ART is extended to Gaussian ARTMAP, the prediction of an output class during testing is akin to picking the class with the highest net probability. This is similar to methods of potential and radial basis functions. The activation of all categories sharing the same prediction are summed to yield the most probable class prediction (Muchoney & Williamson, 2001). In a supervised mode, Gaussian ARTMAP chooses an output class with the highest conditional probability and also provides an indication of the strength of the relationship with the test data. Gaussian ARTMAP represents the input data density with separable Gaussian distributions, with the number and inclusivity of the distributions a function of the vigilance parameter. Gaussian ARTMAP most efficiently represents data that are uncorrelated across dimensions and represents the mean, variance, and the a priori probabilities of the category (Williamson, 1996). Calibrating (training) Gaussian ARTMAP only requires estimating category means, covariances, and a priori probabilities from the training data. Like the maximum-likelihood classification algorithm, Gaussian ARTMAP is based on the assumption that, for each category, the input channels are normally distributed. However, because multiple categories can map to a single output class, there is no assumption that the class conditional distributions are normal. Rather, the assumption is that they are mixtures of normal distributions (Williamson, 1996, 1997). In sum, Gaussian ARTMAP generates as many internal F2 categories as are needed to match input data to output classes. This differs from the back-propagation neural network algorithms where the number of categories must be prespecified. 3.2. Calibration and validation subsets We performed two types of calibration and validation: pixel and site based. For the pixel type, we randomly selected 80% of the pixels as the calibration subset and the remaining 20% of the pixels served as the validation subset. For the site-based calibration/validation, we used stratified random sampling by class to randomly select 80% of the sites into a calibration subset, while the remaining 20% of the sites served as the validation subset. To ensure that out results were not dependent on a single randomization, we performed five random splits for both pixels and sites. Repeating the selection procedure five times provided five sets of calibration and validation pixels and five sets of sites that were used to generate an average accuracy for both pixels and sites. 3.3. Supervised classification of monthly NDVI data Supervised classification was based on using 80/20 calibration/validation splits for both pixels and polygons.

For both the polygon- and pixel-based calibration/validation splits, pixel counts were employed in the accuracy assessment. The classification system used was the VLC system developed by the Central America Vegetation Working Group of the PROARCA-CAPAS program (Muchoney et al., 2000). This classification system was developed because of its applicability to natural resource characterization in Central America. In the second stage of the supervised classification, 12 months of composited NDVI data, surface temperature, temporal metrics relating to temperature and NDVI, and environmental data were used as additional features for classification. Additional features have been found to provide additional information to classification algorithms and thus improve classification accuracy (DeFries, Hansen, & Townshend, 1995). Although the purpose of this study was not feature selection, use of these additional features provided a higher dimensional data set.

4. Results 4.1. Supervised classification Table 3 summarizes the results of supervised classification using the STEP-Vegetation classification system based on the 12-month NDVI and the full 42 features for both pixel and polygon calibration/validation splits averaged over

Table 3 Supervised Gaussian ARTMAP classification results Pixel-based percentage agreement Pixel VLC class 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Overall 12 features 93.18 90.16 31.25 80.49 32.56 0.00 45.56 31.82 30.00 56.41 98.28 57.69 80.00 100.00 16.67 29.41 15.79 61.40 74.38 55.07 0.00 62.50 86.49 70.85 42 features 88.64 96.26 62.50 86.59 65.12 41.18 65.09 84.09 10.00 89.74 96.55 73.08 97.14 100.00 66.67 58.82 31.58 77.19 88.75 81.16 50.00 79.17 83.78 84.35 Polygon 12 features 41.43 89.10 0.00 42.86 2.38 0.00 18.75 8.70 0.00 2.17 63.53 17.86 82.61 20.00 62.50 0.00 0.00 19.15 49.15 2.11 0.00 54.90 94.59 48.83 42 features 35.71 94.87 0.00 89.29 2.38 12.50 11.25 43.48 0.00 0.00 74.12 3.57 84.78 10.00 0.00 0.00 0.00 4.26 72.32 16.84 0.00 74.51 94.59 55.14

296

D.M. Muchoney, A.H. Strahler / Remote Sensing of Environment 81 (2002) 290299

Fig. 2. Central America vegetation and land cover - 1995.

the five iterations. Fig. 2 is the classification result based on using all features and calibration data, the assumption being that classification accuracy was already characterized and

that using all calibration data would provide the best overall map of VLC. This map was not postprocessed and represents the actual Gaussian ARTMAP classification output.

Fig. 3. Relationship of sites to nodes by class.

D.M. Muchoney, A.H. Strahler / Remote Sensing of Environment 81 (2002) 290299 Table 4 Example class F2 category statistics Monthly NDVI (S.D.) Class 5 14 14 F2 66 197 207 1 182.1 (9.3) 134.0 (8.7) 134.4 (25.4) 2 176.4 (19.3) 130.4 (11.3) 132.1 (25.8) 3 171.9 (10.7) 138.2 (22.9) 138.5 (72.1) 4 174.6 (9.8) 142.7 (18.1) 142.7 (63.9) 5 179.0 (8.6) 154.8 (40.3) 161.8 (42.9) 6 180.0 (12.8) 162.8 (27.3) 168.8 (33.0) 7 180.2 (10.3) 169.3 (20.0) 173.1 (27.9) 8 176.8 (13.1) 169.4 (19.9) 173.1 (27.9) 9 176.4 (19.1) 165.8 (15.2) 170.3 (25.9) 10 177.7 (11.7) 165.0 (17.4) 170.3 (26.1) 11 176.8 (9.7) 157.9 (19.7) 162.9 (33.3) 12

297

182.1 (9.3) 147.5 (21.6) 151.7 (34.5)

The results indicate that using the full-feature space improved classification accuracy by 19% for pixel-based calibration and validation and 13% for polygon calibration/ validation. There were trade-offs in improvement and impoverishment of accuracy by specific classes, indicating that supervised classification, which relies on a priori classification systems, is still problematic. Even for the lesser test of pixel-based validation, some classes do not map well, although the association of the error rate with number of samples indicates that a higher sampling rate might improve the mapping of these poorly mapped classes. Because it is difficult to identify a large number of samples for small, lesser-represented, irregularly shaped, and linear

vegetation such as mangrove (Class 7), it may be necessary to drop them from a regional mapping scheme or to use a double-sampling approach to characterize them. Pixel-based training and testing produced substantially higher accuracy than site-based training and testing. This is most probably due to the fact that pixel splits are correlated, providing sufficient information to make good predictions, and subclasses of land cover are sufficiently different as to require stratification of these subclasses to produce better site level predictions. The use of additional features improved the accuracy of both the pixel- and polygon-based approaches. This improvement was proportionally similar for both the pixel- and site-based classifications.

Fig. 4. Example F2 category predictions.

298

D.M. Muchoney, A.H. Strahler / Remote Sensing of Environment 81 (2002) 290299

4.2. Gaussian ARTMAP classification Fig. 3 presents the relationship of the number of sites to the number of F2 nodes that were created for each class. The implication is that while some classes may be represented by a few nodes or categories, others require more nodes be able to predict and map. Classes that can be represented by fewer categories generalize well. As for sites, some sites require one or more nodes, indicating that they are either unique subtypes or heterogeneous, respectively. An important characteristic of the Gaussian ARTMAP algorithm is its ability to predict and map the specific F2 nodes associated with a given class. As an example, Table 4 provides the Gaussian means and standard deviations for three F2 categories associated with Classes 5 (swamp forest) and 14 (cactus/thorn scrub). Fig. 4 is a mapped prediction of these categories. In the case of the cactus/thorn scrub class, only two F2 categories were needed to characterize it in feature space. Of the eight F2 categories representing swamp forest, the category predicted here is associated solely with the swamp forests of Punta Sal, Honduras and Cabo de Tres Puntos, Guatemala. The implications of this research for calibrating and validating supervised classification of VLC are that pixelbased calibration/validation tend to overestimate classification accuracy because the (neural network) classification algorithms are able to predict outputs based on minimal examples (calibration data). Developing a large-enough sample of sites for site-based calibration and validation that is also representative of subpopulations of VLC types is also problematic. This research provides a quantitative assessment of the nature of the difference in pixel versus polygon calibration and validation for a regional study area. A finding of this study is that while pixel-based calibration and validation may tend to overestimate accuracy due to spatial autocorrelation and the learning efficiency of the Gaussian ARTMAP algorithm, polygon calibration/validation may underestimate accuracy unless calibration and validation is performed at the subpopulation level. That is, it is necessary to sample all subpopulations of a particular class. This is difficult to achieve because of the lack of knowledge of subpopulation distribution, the cost of generating a sufficient sample size at the subpopulation level, and the fact that subpopulations can be very small, in some cases singular entities. The true accuracy most likely falls somewhere between the two accuracies. Gaussian ARTMAP does provide insight into the problem. It segments the landscape into subpopulations (separable F2 categories) based on site data. In this case, the site data were defined to be representative of the VLC of Central America. For individual sites, the number and characteristics of the F2 categories provide an indication of site homogeneity/ heterogeneity. For all sites, it provides the number of F2 categories needed to represent a class, which gives an indication of the generalization properties of sites and

classes. Gaussian ARTMAP also provides information on the number of sites that are required to sample the population. At a certain level, additional sites would no longer manifest themselves in additional F2 categories. Therefore, sampling would be redundant. This guideline could be used to specify the minimum number of sites that are needed, minimizing sampling costs. Finally, mapping the F2 categories allows them to be evaluated as to their distribution on the landscape.

Acknowledgments This research was supported by NASA (Contract NAS531369) and the Proyecto Ambiental de Centroamerica (PROARCA-CAPAS), the Comision de la Centroamericana de Ambiente y Desarrollo (CCAD), and The Nature Conservancy. The authors further wish to acknowledge the support of Jordan Borak, Huaying Chi, Mark Friedl, Sucharita Gopal, John Hodges, Douglas McIver, and Curtis Woodcock of the Boston University Department of Geography/Center for Remote Sensing, Jess Brown of USGSEDC, and Jim Williamson of the Boston University Center for Cognitive and Neural Systems.

References
Belward, A., & Loveland, T. R. (1995). The IGBP 1km land cover project. In: Proceedings of the 21st Annual Conference of the Remote Sensing Society, Southampton, UK. ( pp. 1099 1106). Belward, A. S., Estes, J. E., & Kline, K. D. (1999). Photogrammetric Engineering and Remote Sensing, 56, 469 473. Brodley, C. E., & Friedl, M. A. (1996). Identifying and eliminating mislabeled training instances. In: Proceedings of 13th National Conference on Artificial Intelligence August 4-8, 1996, Portland, OR. (pp. 799 805). Campbell, J. B. (1981). Spatial correlation effects upon accuracy of supervised classification of land cover. Photogrammetric Engineering and Remote Sensing, 47 (3), 355 363. DeFries, R., Hansen, M., & Townshend, J. (1995). Global discrimination of land cover types from metrics derived from AVHRR pathfinder data. Remote Sensing of Environment, 54, 209 222. DeFries, R., & Los, S. O. (1999). Implications of land-cover misclassification for parameters estimates in global land-surface models: an example from the simple biosphere model (SiB2). Photogrammetric Engineering and Remote Sensing, 65, 1083 1088. Dryer, P. (1993). Classification of land cover using optimized neural nets on SPOT data. Photogrammetric Engineering and Remote Sensing, 59, 617 621. Duggin, M. J., Piwinski, D., Whitehead, V., & Ryland, G. (1982). Scanangle dependence of raidiance recorded by the NOAA-AVHRR. Proceedings SPIE, Advanced Remote Sensing, Aug. 26-27, San Diego, CA. ( pp. 98 101). Eidenshink, J. C., & Faundeen, J. L. (1994). The 1-km AVHRR global land data set: first stages in implementation. International Journal of Remote Sensing, 15, 3443 3462. Foody, G. M. (1995). Using prior knowledge in artificial neural network classification with a minimal training set. International Journal of Remote Sensing, 16, 310 312. Foody, G. M., & Arora, M. K. (1997). An evaluation of some factors

D.M. Muchoney, A.H. Strahler / Remote Sensing of Environment 81 (2002) 290299 affecting the accuracy of classification by an artificial neural network. International Journal of Remote Sensing, 18, 799 810. Friedl, M., Woodcock, C., Gopal, S., Muchoney, D., Strahler, A., & BarkerSchaaf, C. (2000). A note on procedures for accuracy assessment in land cover maps derived from AVHRR data. Remote Sensing Letters, International Journal of Remote Sensing, 21 (5), 1073 1077. Friedl, M. A., & Brodley, C. E. (1998). Decision tree classification of land cover from remotely sensed data. Remote Sensing of Environment, 61, 399 409. Friedl, M. A., Brodley, C. E., & Strahler, A. H. (1999). Maximizing land cover classification accuracies produced by decision trees at continental to global scales. IEEE Transactions on Geoscience and Remote Sensing, 37, 969 977. Gopal, S., Woodcock, C. E., & Strahler, A. H. (1999). Fuzzy neural network classification of global land cover from a 1 AVHRR data set. Remote Sensing of Environment, 67, 230 243. Hepner, G. F., Logan, T., Ritter, N., & Bryant, N. (1990). Artificial neural network classification using a minimal training set: comparison to conventional supervised classification. Photogrammetric Engineering and Remote Sensing, 56, 469 473. Holben, B. (1986). Characteristics of maximum-value composite images from multitemporal AVHRR data. International Journal of Remote Sensing, 7, 1417 1434. Justice, C., Vermote, E., Townshend, J. R. G., DeFries, R., Roy, D. P., Hall. D. K., Salomonson, V. V., Privette, J., Riggs, G., Strahler, A., Lucht, W., Myneni, R., Knjazihhin, Y., Running, S., Nemani, R, Wan, Z., Huete, A., van Leeuwen, W., Wolfe, R., Giglio, L., Muller, J.-P., Lewis, P., & Barnsley, M. (1998). The Moderate Resolution Imaging Spectroradiometer (MODIS): land remote sensing for global change. IEEE Transaction on Geoscience and Remote Sensing, 36, 1228 1249. Labovitz, M. L. (1986). Issues arising from sampling designs and band selection in discriminating ground reference attributes using remotely sensed data. Photogrammetric Engineering and Remote Sensing, 52, 201 211. Maren, A. J. (1990). Feedforward/feedback (resonating) heteroassociative networks. In: C. Parten, C. Hairston, & P. Pap (Eds.), Handbook of neural computing applications. San Diego: Academic Press, pp. 155 178. Mather, P. M. (1987). Preprocessing of training data for multispectral

299

image classification. In: Advances in digital image processing. Proceedings 13th Annual Conference of the Remote Sensing Society, 7 11 September, Nottingham ( pp. 111 119). Southampton, UK: Remote Sensing Society. Muchoney, D., & Williamson, J. (2001). A Gaussian adaptive resonance theory neural network classification algorithm applied to supervised land cover mapping using multitemporal vegetation index data. IEEE Transactions Geoscience and Remote Sensing, 39, 1969 1977. Muchoney, D. M., Borak, J., Chi, H., Friedl, M., Hodges, J., Morrow, N., & Strahler, A. (2000). Application of the MODIS global supervised classification model to vegetation and land cover mapping of Central America. International Journal of Remote Sensing, 21, 1115 1138. Muchoney, D. M., Strahler, A., Hodges, J., & Locastro, J. (1999). The IGBP DISCover confidence sites and the system for terrestrial ecosystem parameterization: tools for validating global land cover data. Photogrammetric Engineering and Remote Sensing, 65, 1061 1067. Paola, J. D., & Schowengerdt, R. A. (1995). A detailed comparison of backpropogation neural network and maximum-likelihood classifiers for urban land use classification. IEEE Transactions on Geoscience and Remote Sensing, 33, 981 996. Piper, J. (1992). Variability and bias in experimentally measured classifier error rates. Pattern Recognition Letters, 13, 685 692. Reed, B., Reed, B. D., Brown, J. F., VanderZee, D., Loveland, T. R., Merchant, J. W., & Ohlen, D. O. (1994). Measuring phenological variability from satellite imagery. Journal of Vegetation Science, 5, 703 714. Strahler, A., Muchoney, D., Borak, J., Feng, G., Friedl, M., Gopal, S., Hodges, J., Lambin, E., McIver, D., Moody, A., Schaaf, C., & Woodcock, C. (1999). MODIS land cover product algorithm theoretical basis document (ATBD) version 5.0. Boston: Boston University, 89 pp. Williamson, J. R. (1996). Gaussian ARTMAP: a neural network for fast incremental learning of noisy multidimensional maps. Neural Networks, 9, 881 897. Williamson, J. R. (1997). A constructive, incremental-learning network for mixture modeling and classification. Neural Computation, 9, 1517 1543. Zhuang, X., Engel, B. A., Lozano-Garcia, D. F., Fernandez, R. N., & Johannsen, C. J. (1994). Optimization of training data required for neuro-classification. International Journal of Remote Sensing, 15, 3271 3277.

You might also like