You are on page 1of 12

REMOTE SENS. ENVIRON.

37:35-46 (1991)

A Review of Assessing the Accuracy of


Classifications of Remotely Sensed Data
Russell G. Congalton
Department of Forestry and Resource Management, University of California, Berkeley

T h i s paper reviews the necessary considerations obvious assumption made here is that the photoin-
and available techniques for assessing the accuracy terpretation is 100% correct. This assumption is
of remotely sensed data. Included in this review rarely valid and can lead to a rather poor and
are the classification system, the sampling scheme, unfair assessment of the digital classification
the sample size, spatial autocorrelation, and the (Biging and Congalton, 1989).
assessment techniques. All analysis is based on the Therefore, it is essential that researchers and
use of an error matrix or contingency table. Exam- users of remotely sensed data have a strong knowl-
ple matrices and results of the analysis are pre- edge of both the factors needed to be considered
sented. Future trends including the need for assess- as well as the techniques used in performing any
ment of other spatial data are also discussed. accuracy assessment. Failure to know these tech-
niques and considerations can severely limit one's
ability to effectively use remotely sensed data. The
INTRODUCTION objective of this paper is to provide a review of the
appropriate analysis techniques and a discussion of
With the advent of more advanced digital satellite the factors that must be considered when perform-
remote sensing techniques, the necessity of per- ing any accuracy assessment. Many analysis tech-
forming an accuracy assessment has received re- niques have been published in the literature; how-
newed interest. This is not to say that accuracy ever, I believe that it will be helpful to many
assessment is unimportant for the more traditional novice and established users of remotely sensed
remote sensing techniques. However, given the data to have all the standard techniques summa-
complexity of digital classification, there is more of rized in a single paper. In addition, it is important
a need to assess the reliability of the results. to understand the analysis techniques in order to
Traditionally, the accuracy of photointerpretation fully realize the importance of the various other
has been accepted as correct without any confir- considerations for accuracy assessment discussed
mation. In fact, digital classifications are often in this paper.
assessed with reference to photointerpretation. An

TECHNIQUES
Address correspondence to R. G. Congalton, 145 Mulford Hall,
Department of Forestry and Resource Management, University of
California, Berkeley, CA 94720.
Until recently, the idea of assessing the classifica-
Received 15 October 1990; revised 14 April 1991. tion accuracy of remotely sensed data was treated

oo34-42s7/91/$3.50 35
36 Congalton

more as an afterthought than as an integral part of gory relative to the actual category as verified on
any project. In fact, as recently as the early 1980s the ground (Table 1). The columns usually repre-
many studies would simply report a single number sent the reference data while the rows indicate the
to express the accuracy of a classification. In many classification generated from the remotely sensed
of these cases the accuracy reported was what is data. An error matrix is a very effective way to
called non-site-specific accuracy. In a non-site- represent accuracy in that the accuracies of each
specific accuracy assessment, locational accuracy is category are plainly described along with both the
completely ignored. In other words, only total errors of inclusion (commission errors) and errors
amounts of a category are considered without re- of exclusion (omission errors) present in the classi-
gard for the location. If all the errors balance out, fication.
a non-site-specific accuracy assessment will yield
very high but misleading results. In addition, most
assessments were conducted using the same data
Descriptive Techniques
set as was used to train the classifier. This training The error matrix can then be used as a starting
and testing on the same data set also results in point for a series of descriptive and analytical
overestimates of classification accuracy. statistical techniques. Perhaps the simplest de-
Once these problems were recognized, many scriptive statistic is overall accuracy which is com-
more site specific accuracy assessments were per- puted by dividing the total correct (i.e., the sum of
formed using an independent data set. For these the major diagonal) by the total number of pixels
assessments, the most common way to represent in the error matrix. In addition, accuracies of
the classification accuracy of remotely sensed data individual categories can be computed in a similar
is in the form of an error matrix. Using an error manner. However, this case is a little more com-
matrix to represent accuracy has been recom- plex in that one has a choice of dividing the
mended by many researchers and should be number of correct pixels in that category by either
adopted as the standard reporting convention. The the total number of pixels in the corresponding
reasons for choosing the error matrix as the stan- row or the corresponding column. Traditionally,
dard are clearly demonstrated in this paper. the total number of correct pixels in a category is
An error matrix is a square array of numbers divided by the total number of pixels of that
set out in rows and columns which express the category as derived from the reference data (i.e.,
number of sample units (i.e., pixels, clusters of the column total). This accuracy measure indicates
pixels, or polygons) assigned to a particular cate- the probability of a reference pixel being correctly

Table 1. A n E x a m p l e Error Matrix

Reference Data
lOW

D C BA SB total
D 65 4 22 24 115 Land Cover Categories
C 6 81 5 8 100 D = deciduous
BA 0 11 85 19 115 C = conifer

SB 4 7 3 90 104 BA = b a r r e n

column 75 103 115 141 434 SB = s h r u b


total

OVERALL ACCURACY =
3 2 1 / 4 3 4 = 74%

PRODUCER'S ACCURACY USER'S ACCURACY


D=65/75= 87% D=65/115= 57%
C = 81/103 = 79% C= 81/100= 81%
BA = 8 5 / 1 1 5 = 74% BA=85/115= 74%
SB = 9 0 / 1 4 1 = 64% SB = 9 0 / 1 0 4 = 87%
Review: Assessing Classification Accuracy 37

classified and is really a measure of omission error. analytical statistical techniques. This is especially
This accuracy measure is often called "producer's true of the discrete multivariate techniques. Start-
accuracy" because the producer of the classifica- ing with Congalton et al. (1983), discrete multi-
tion is interested in how well a certain area can be variate techniques have been used for performing
classified. On the other hand, if the total number statistical tests on the classification accuracy of
of correct pixels in a category is divided by the digital remotely sensed data. Since that time many
total number of pixels that were classified in that others have adopted these techniques as the stan-
category, then this result is a measure of commis- dard accuracy assessment tools (e.g., Rosenfield
sion error. This measure, called "user's accuracy" and Fitzpatrick-Lins, 1986; Hudson and Ramm,
or reliability, is indicative of the probability that a 1987; Campbell, 1987). Discrete multivariate tech-
pixel classified on the map/image actually repre- niques are appropriate because remotely sensed
sents that category on the ground (Story and data are discrete rather than continuous. The data
Congalton, 1986). are also binomially or multinomially distributed
A very simple example quickly shows the ad- rather than normally distributed. Therefore, many
vantages of considering overall accuracy, "pro- common normal theory statistical techniques do
ducer's accuracy," and "user's accuracy.'" The er- not apply. The following example presented in
ror matrix shown in Table 1 indicates an overall Tables 2-9 demonstrates the power of these dis-
map accuracy of 74%. However, suppose we are crete multivariate techniques. The example begins
most interested in the ability to classify deciduous with three error matrices and presents the results
forests. We can calculate a "producer's accuracy" of the analysis techniques.
for this category by dividing the total number of Table 2 presents the error matrices generated
correct pixels in the deciduous category (65) by from using three different classification algorithms
the total number of deciduous pixels as indicated to map a small area of Berkeley and Oakland,
by the reference data (75). This division results in California surrounding the University of California
a "producer's accuracy" of 87%, which is quite campus from SPOT satellite data. The three classi-
good. If we stopped here, one might conclude fication algorithms used included a traditional su-
that, although this classification has an overall pervised approach, a traditional unsupervised ap-
accuracy that is only fair (74%), it is adequate for proach, and a modified approach that combines
the deciduous category. Making such a conclusion the supervised and unsupervised classifications to-
could be a very serious mistake. A quick calcula- gether to maximize the advantages of each
tion of the "user's accuracy" computed by dividing (Chuvieco and Congalton, 1988). The classification
the total number of correct pixels in the deciduous was a simple one using only four categories; forest
category (65) by the total number of pixels classi- (F), industrial (I), urban (U), and water (W). All
fied as deciduous (115) reveals a value of 57%. In three classifications were performed by a single
other words, although 87% of the deciduous areas analyst. In addition, Table 3 presents the error
have been correctly identified as deciduous, only matrix generated for the same area using only the
57% of the areas called deciduous are actually modified classification approach by a second ana-
deciduous. A more careful look at the error matrix lyst. Each analyst was responsible for performing
reveals that there is significant confusion in dis- an accuracy assessment. Therefore, different num-
criminating deciduous from barren and shrub. bers of samples and different sample locations
Therefore, although the producer of this map can were selected by each.
claim that 87% of the time an area that was The next analytical step is to "normalize" or
deciduous was identified as such, a user of this standardize the error matrices. This technique uses
map will find that only 57% of the time will an an iterative proportional fitting procedure which
area he visits that the map says is deciduous will forces each row and column in the matrix to sum
actually be deciduous. to one. In this way, differences in sample sizes
used to generate the matrices are eliminated and,
therefore, individual cell values within the matrix
Analytical Techniques are directly comparable. In addition, because as
In addition to these descriptive techniques, an part of the iterative process the rows and columns
error matrix is an appropriate beginning for many are totaled (i.e., marginals), the resulting normal-
38 Congalton

Table 2. Error Matrices for the Three Classification Approaches from Analyst #1

Supervised Approach
Reference Data
F I U W

F 68 7 3 0

Classified I 12 112 15 10 Overall Accuracy =


Data 3 2 5 / 3 9 1 = 83%
U 3 9 89 0

W 0 2 5 56

UnsupervisedApproach
Reference Data
F I U W
F 60 11 3 4

Classified I 15 102 14 8 Overall Accuracy =


Data U 6 13 90 2 3 0 4 / 3 9 1 = 78%

W 2 4 5 52

Modified Approach
Reference Data
F I U W

F 75 6 1 0

Classified I 4 116 11 3 Overall Accuracy =


Data U 3 7 96 2 3 4 8 / 3 9 1 = 89%

W 1 1 4 61

Table 3. Error Matrix for the Modified Classification Approach from Analyst #2

ModifiedApproach - - Analyst # 2
Reference Data
F AG U W

F 35 6 1 0

Classified AG 3 82 5 10 overall accuracy =


Data 2 0 8 / 2 4 6 = 85%
U 4 2 54 0
W 0 5 2 37

ized matrix is more indicative of the off-diagonal major diagonal and dividing by the total of the
cell values (i.e., the errors of omission and com- entire matrix. Consequently, one could argue that
mission). In other words, all the values in the the normalized accuracy is a better representation
matrix are iteratively balanced by row and column of accuracy than is the overall accuracy computed
thereby incorporating information from that row from the original matrix because it contains infor-
and column into each individual cell value. This mation about the off-diagonal cell values. Table 4
process then changes the cell values along the presents the normalized matrices from the same
major diagonal of the matrix (correct classifica- three classification algorithms for analyst # 1 gen-
tions) and therefore a normalized overall accuracy erated using a computer program called MARGFIT
can be computed for each matrix by summing the (marginal fitting). Table 5 presents the normalized
Review: Assessing ClassificationAccuracy 39

Table 4. Normalized Error Matrices for the Three Classification Approaches from Analyst #1

Supervised Approach
Reference Data
F I U W
F 0.8652 0.0940 0.0331 0.0073
Classified I 0.0845 0.7547 0.0784 0.0824 Normalized Accuracy =
Data
U 0.0435 0.1171 0.8319 0.0072 3 . 3 5 4 9 / 4 = 84%
W 0.0069 0.0342 00567 0.9031

UnsupervisedApproach
Reference Data
F I U W
F 0.7734 0.1256 0.0387 0.0622

Classified I 0.1242 0.7014 0.1006 0.0824 Normalized Accuracy =


Data U 0.0656 0.1163 0.7094 0.0273 5.1022/4 = 78%

W 0.0369 0.0567 0.0702 0.8370

Modified Approach
Reference Data
F I U W
F 0.9080 0.0687 0.0152 0.0076

Classified I 0.0372 0.8460 0.0801 0.0366 Normalized Accuracy =


Data U 0.0370 0.0697 0.8598 0.0334 3.5362/ 4 = 88%

W 0.0178 0.0156 0.6450 0.9224

Table 5. Normalized Error Matrix for the Modified Approach from Analyst #2

ModifiedApproach- - Analyst # 2
Reference Data
AG U W
F 0.8519 0.1090 0.0287 0.0113

Classified AG 0.0464 0.7641 0.0581 0.1313 Overall Accuracy =


Data U 0.0897 0.0348 0.8655 0.0094 3.3295/ 4 = 83%

W 0.0120 0.0921 0.0477 0.8480 I

matrix for the modified approach performed by while analyst # 2 classified 35 correctly. Neither of
analyst #2. these numbers means much because they are not
In addition to computing a normalized accu- directly comparable due to the differences in the
racy, the normalized matrix can also be used to number of samples used to generate the error
directly compare cell values between matrices. matrix by each analyst. Instead, these numbers
For example, we may be interested in comparing would need to be converted into percent so that a
the accuracy each analyst obtained for the forest comparison could be made. Here another problem
category using the modified classification ap- arises: Do we divide the total correct by the row
proach. From the original matrices we can see that total (user's accuracy) or by the column total (pro-
analyst #1 classified 75 sample units correctly ducer's accuracy)? We could calculate both and
40 Congalton

compare the results or we could use the cell value Table 6. A Comparison of the Three Accuracy Measures for
the Three Classification Approaches
in the normalized matrix. Because of the iterative
proportional fitting routine, each cell value in the Classification Overall KHAT Normalized
matrix has been balanced by the other values in its Algorithm Accuracy Accuracy Accuracy
corresponding row and column. This balancing has Supervised
approach 84% 77% 83%
the effect of incorporating producer's and user's Unsupervised
accuracies together. Also since each row and col- approach 78% 70% 78%
umn add to 1, an individual cell value can quickly Modified
approach 88% 85% 89%
be converted to a percentage by multiplying by
100. Therefore, the normalization process provides
a convenient way of comparing individual cell
values between error matrices regardless of the
number of samples used to derive the matrix. measure incorporates various levels of information
Another discrete multivariate technique of use from the error matrix into its computations. Over-
in accuracy assessment is called KAPPA (Cohen, all accuracy only incorporates the major diagonal
1960). The result of performing a KAPPA analysis and excludes the omission and commission errors.
is a KHAT statistic (an estimate of KAPPA), which As already described, normalized accuracy directly
is another measure of agreement or accuracy. The includes the off-diagonal elements (omission and
KHAT statistic is computed as commission errors) because of the iterative pro-
portional ftting procedure. As shown in the KHAT
equation, KHAT accuracy indirectly incorporates
/~= i=1 i=1 the off-diagonal elements as a product of the row
F
and column marginals. Therefore, depending on
N2- E (x,+ • x+,)
the amount of error included in the matrix, these
i=1
three measures may not agree. It is not possible to
where r is the number of rows in the matrix, xii is give cleareut rules as to when each measure should
the number of observations in row i and column i, be used. Each accuracy measure incorporates dif-
x i+ and x +i are the marginal totals of row i and ferent information about the error matrix and
column i, respectively, and N is the total number therefore must be examined as different computa-
of observations (Bishop et al., 1975). The KHAT tions attempting to explain the error. My experi-
equation is published in this paper to clear up ence has shown that if the error matrix tends to
some confusion caused by a typographical error in have a great many off-diagonal cell values with
Congalton et al. (1983), who originally proposed zeros in them, then the normalized results tend to
the use of this statistic for remotely sensed data. disagree with the overall and Kappa results. Many
Since that time, numerous papers have been pub- zeros occur in a matrix when an insuflqeient sam-
lished recommending this technique. The equa- ple has been taken or when the classification is
tions for computing the variance of the KHAT exceptionally good. Because of the iterative pro-
statistic and the standard normal deviate can be portional fitting routine, these zeros tend to take
found in Congalton et al. (1983), Rosenfield and on positive values in the normalization process
Fitzpatriek-Lins (1986), and Hudson and Ramm showing that some error could be expected. The
(1987) to list just a few. It should be noted that normalization process then tends to reduce the
the KHAT equation assumes a multinomial sam- accuracy because of these positive values in the
pling model and that the variance is derived using off-diagonal cells. If a large number of off-diagonal
the Delta method. cells do not contain zeros, then the results of the
Table 6 provides a comparison of the overall three measures tend to agree. There are also times
accuracy, the normalized accuracy, and the KHAT when the Kappa measure will disagree with the
statistic for the three classification algorithms used other two measures. Because of the ease of com-
by analyst # 1. In this particular example, all three puting all three measures (software is available
measures of accuracy agree about the relative from the author) and because each measure re-
ranking of the results. However, it is possible for fleets different information contained within the
these rankings to disagree simply because each error matrix, I recommend an analysis such as the
Review: Assessing Classification Accuracy 41

Table 7. Results of the KAPPA Analysis Test of Significance Table 9. Results of KAPPA Analysis for Comparison be-
for Individual Error Matrices tween Modified Approach for Analyst # 1 vs. Analyst # 2

Test of Significance of Each Error Matrix Test of Significant Differences between Error Matrices
Classification Algorithms KHAT Statistic Z Statistic Result" Comparison Z Statistic Result ~
Supervised approach .7687 29.41 Sb Modified #1 vs. modified #2 1.6774 NS b
Unsupervised approach .6956 24.04 S °At the 95% confidence level.
/'NS = not significant.
Modified approach .8501 39.23 S
~At the 95% confidence level.
bS = significant.
In addition to the discrete multivariate tech-
niques just presented, other techniques for assess-
Table 8. Results of KAPPA Analysis for Comparison be-
tween Error Matrices for Analyst # 1 ing the accuracy of remotely sensed data have also
been suggested. Rosenfield (1981) proposed the
Test of Significant Differences between Error Matrices
use of analysis of variance techniques for accuracy
Comparison Z Statistic Result ~ assessment. However, violation of the normal the-
Supervised vs. unsupervised 1.8753 NS/~ ory assumption and independence assumption
Supervised vs. modified 2.3968 S when applying this technique to remotely sensed
Unsupervised vs. modified 4.2741 S data has severely limited its application. Aronoff
°At the 95% confidence level. (1985) suggested the use of a minimum accuracy
I~S= significant, NS = not significant.
value as an index of classification accuracy. This
approach is based on the binomial distribution of
one performed here to glean as much information the data and is therefore very appropriate for
from the error matrix as possible. remotely sensed data. The major disadvantage of
In addition to being a third measure of accu- the Aronoff approach is that it is limited to a single
racy, KAPPA is also a powerful technique in its overall accuracy value rather than using the entire
ability to provide information about a single matrix error matrix. However, it is useful in that it this
as well as to statistically compare matrices. Table 7 index does express statistically the uncertainty
presents the results of the KAPPA analysis to test involved in any accuracy assessment. Finally,
the significance of each matrix alone. In other Skidmore and Turner (1989) have begun work on
words, this test determines whether the results techniques for assessing error as it accumulates
presented in the error matrix are significantly bet- through many spatial layers of information in a
ter than a random result (i.e., the null hypothesis: GIS, including remotely sensed data. These tech-
KHAT = 0). Table 8 presents the results of the niques have included using a line sampling method
KAPPA analysis that compares the error matrices for accuracy assessment as well as probability the-
two at a time to determine if they are significantly ory to accumulate error from layer to layer. It is in
different. This test is based on the standard normal this area of error analysis that much new work
deviate and the fact that, although remotely sensed needs to be performed.
data are discrete, the KHAT statistic is asymptoti-
cally normally distributed. A quick look at Table 8
shows why this test is so important. Despite the
overall accuracy of the supervised approach being CONSIDERATIONS
6% higher than the unsupervised approach (84%
- 7 8 % = 6%), the results of the KAPPA analysis Along with the actual analysis techniques, there
show that these two approaches are not signifi- are many other considerations to note when per-
cantly different. Therefore, given the choice of forming an accuracy assessment. In reality, the
only these two approaches, one should use the techniques are of little value if these other factors
easier, quicker, or more efficient approach because are not considered because a critical assumption of
the accuracy will not be the deciding factor. Simi- all the analysis described above is that the error
lar results are presented in Table 9 comparing the matrix is truly representative of the entire classi-
modified classification approach for analyst #1 fication. If the matrix is improperly generated,
with analyst #2. then all the analysis is meaningless. Therefore, the
42 Congalton

following factors must be considered: ground data Classification Scheme


collection, classification scheme, spatial autocorre-
lation, sample size, and sampling scheme. Each of When planning a project involving remotely sensed
these factors provide essential information for the data, it is very important that sufficient effort be
assessment and failure to consider even one of given to the classification scheme to be used.
them could lead to serious shortcomings in the In many instances, this scheme is an existing
assessment process. one such as the Anderson classification system
(Anderson et al., 1976). In other cases, the classi-
fication scheme is dictated by the objectives of the
Ground Data Collection
project or by the specifications of the contract. In
It is obvious that in order to adequately assess the all situations a few simple guidelines should be
accuracy of the remotely sensed classification, ac- followed. First of all, any classification scheme
curate ground, or reference data must be col- should be mutually exclusive and totally exhaus-
lected. However, the accuracy of the ground data tive. In other words, any area to be classified
is rarely known nor is the level of effort needed to should fall into one and only one category or class.
collect the appropriate data clearly understood. In addition, every area should be included in the
Depending on the level of detail in the classifica- classification. Finally, if possible, it is very advan-
tion (i.e., classification scheme), collecting refer- tageous to use a classification scheme that is hier-
ence data can be a very difficult task. For example, archical in nature. If such a scheme is used, cer-
in a simple classification scheme the required level tain categories within the classification scheme can
of detail may be only to distinguish residential be collapsed to form more general categories. This
from commercial areas. Collecting reference data ability is especially important when trying to meet
may be as simple as obtaining a county zoning predetermined accuracy standards. Two or more
map. However, a more complex forest classifica- detailed categories of lower than the minimum
tion scheme may involve collecting reference data required accuracy may need to be grouped to-
for not only species of tree, but size class, and gether (collapsed) to form a more general category
crown closure as well. Size class involves measur- that exceeds the minimum accuracy requirement.
ing the diameters of trees and therefore a great For example, it may be impossible to separate
many trees may have to be measured to estimate interior live oak from canyon live oak. Therefore,
the size class for each pixel. Crown closure is even these two categories may have to be collapsed to
more difficult to measure. Therefore, in this case, form a live oak category to meet the required
collecting accurate reference data can be difficult. accuracy standard.
A traditional solution to this problem has been Because the classification scheme is so impor-
for the producer and user of the classification to tant, no work should begin on the remotely sensed
assume that some reference data set is correct. For data until the scheme has been thoroughly re-
example, the results of some photointerpretation viewed and as many problems as possible identi-
or aerial reconnaissance may be used as the refer- fied. It is especially helpful if the categories in the
ence data. However, errors in the interpretation scheme can be logically explained. The difference
would then be blamed on the digital classification, between Douglas fir and Ponderosa pine is easy to
thereby wrongly lowering the digital classification understand; however, the difference between
accuracy. It is exactly this problem that has caused Density Class 3 (50-70% crown closure) and Den-
the lack of acceptance of digital satellite data for sity Class 4 ( > 70% crown closure) may not be. In
many applications. Although no reference data set fact, many times these classes are rather artificial
may be completely accurate, it is important that and one can expect to find confusion between a
the reference data have high accuracy or else it is forest stand with a crown closure of 67% that
not a fair assessment. Therefore, it is critical that belongs in Class 3 and a stand of 73% that belongs
the ground or reference data collection be care- in Class 4. Sometimes there is little that can be
fully considered in any accuracy assessment. Much done about the artificial delineations in the classi-
work is yet to be done to determine the proper fication scheme; other times the scheme can be
level of effort and collection techniques necessary modified to better represent natural breaks. How-
to provide this vital information. ever, tZailure to try to understand the classification
Review: Assessing Classification Accuracy 43

scheme from the every beginning will certainly equation based on the binomial distribution or the
result in a great loss of time and much frustration normal approximation to the binomial distribution
in the end. to compute the required sample size. These tech-
niques are statistically sound for computing the
sample size needed to compute the overall accu-
Spatial Autocorrelation racy of a classification or even the overall accuracy
Spatial autoeorrelation is said to occur when the of a single category. The equations are based on
presence, absence, or degree of a certain charac- the proportion of correctly classified samples
teristic affects the presence, absence, or degree of (pixels, clusters, or polygons) and on some allow-
the same characteristic in neighboring units (Cliff able error. However, these techniques were not
and Ord, 1973). This condition is particularly im- designed to chose a sample size for filling in an
portant in accuracy assessment if an error in a error matrix. In the case of an error matrix, it is
certain location can be found to positively or nega- not simply a matter of correct or incorrect. It is a
tively influence errors in surrounding locations matter of which error or, in other words, which
(Campbell, 1981). Work by Congalton (1988a) on categories are being confused. Sufficient samples
Landsat MSS data from three areas of varying must be acquired to be able to adequately repre-
spatial diversity (i.e., an agriculture, a range, and a sent this confusion. Therefore, the use of these
forest site) showed a positive influence as much as techniques for determining the sample size for an
30 pixels (over 1 mile) away. These results are error matrix is not appropriate. Fitzpatrick-Lins
explainable in an agricultural environment where (1981) used the normal approximation equation to
field sizes are large and typical misclassification compute the sample size for assessing a land u s e /
would be to make an error in labeling the entire land cover map of Tampa, Florida. The results of
field. However, these results are more surprising the computation showed that 319 samples needed
for rangeland and forested sites. Surely these re- to be taken for a classification with an expected
suits should affect the sample size and especially accuracy of 85% and an allowable error of 4%. She
the sampling scheme used in accuracy assessment, ended up taking 354 samples and filling in an
especially in the way this autocorrelation affects error matrix that had 30 categories in it (i.e., a
the assumption of sample independence. This au- matrix of 30 rows × 3 0 columns or 900 possible
tocorrelation may then be responsible for periodic- cells). Although this sample size is sufficient for
ity in the data that could effect the results of any computing overall accuracy, it is obviously much
type of systematic sample. In addition, the size of too small to be represented in a matrix. Only 35 of
the cluster used in cluster sampling would also be the 900 cells had a value greater than zero. Other
effected because each new pixel would not be researchers have used the equation to compute the
contributing independent information. sample size for each category. Although resulting
in a larger sample, the equation still does not
account for the confusion between categories.
Sample Size
Because of the large number of pixels in a
Sample size is another important consideration remotely sensed image, traditional thinking about
when assessing the accuracy of remotely sensed sampling does not often apply. Even a one-half
data. Each sample point collected is expensive and percent sample of a single Thematic Mapper scene
therefore sample size must be kept to a minimum can be over 300,000 pixels. Not all assessments are
and yet it is critical to maintain a large enough performed on a per pixel basis, but the same
sample size so that any analysis performed is relative argument holds true if the sample unit is a
statistically valid. Of all the considerations dis- cluster of pixels or a polygon. Therefore, practical
cussed in this paper, the most has probably been considerations more often dictate the sample size
written about sample size. Many researchers, no- selection. A balance between what is statistically
tably van Genderen and Lock (1977), Hay (1979), sound and what is practically attainable must be
Hord and Brooner (1976), Rosenfield et al. (1982), found. It has been my experience that a good rule
and Congalton (1988b), have published equations of thumb seems to be collecting a minimum of 50
and guidelines for choosing the appropriate sam- samples for each vegetation or land use category
ple size. The majority of researchers have used an in the error matrix. If the area is especially large
44 Congalton

(i.e., more than a million acres) or the classifica- nice statistical properties of simple random sam-
tion has a large number of vegetation or land use pling, this sampling scheme is not always that
categories (i.e., more than 12 categories), the mini- practical to apply. Simple random sampling tends
mum number of samples should be increased to 75 to undersample small but possibly very important
or 100 samples per category. The number of sam- areas unless the sample size is significantly in-
ples for each category can also be adjusted based creased. For this reason, stratified random sam-
on the relative importance of that category within piing is recommended where a minimum number
the objectives of the mapping or by the inherent of samples are selected from each strata (i.e.,
variability within each of the categories. Some- category). Even stratified random sampling can be
times it is better to concentrate the sampling on somewhat impractical because of having to collect
the categories of interest and increase their num- ground information for the accuracy assessment at
ber of samples while reducing the number of random locations on the ground. The problems
samples taken in the less important categories. with random locations are that they can be in
Also it may be useful to take fewer samples in places with very difficult access and they can only
categories that show little variability such as water be selected after the classification has been per-
or forest plantations and increase the sampling in formed. This limits the accuracy assessment data
the categories that are more variable such as un- to being collected late in the project instead of in
even-aged forests or riparian areas. Again, the conjunction with the training data collection,
object here is to balance the statistical recommen- thereby increasing the costs of the project. In
dations in order to get an adequate sample to addition, in some projects the time between the
generate an appropriate error matrix with the time, project beginning and the accuracy assessment
cost, and practical limitations associated with any may be so long as to cause temporal problems in
viable remote sensing project. collecting ground reference data. In other words,
the ground may change (i.e., the forest harveste d )
between the time the project is started and the
Sampling Scheme accuracy assessment is begun.
In addition to the considerations already dis- Therefore, some systematic approach would
cussed, sampling scheme is an important part of certainly help make this ground collection effort
any accuracy assessment. Selection of the proper more efficient by making it easier to locate the
scheme is absolutely critical to generating an error points of the ground and allowing data to be
matrix that is representative of the entire classified collected simultaneously for training and assess-
image. Poor choice in sampling scheme can result ment. However, results of Congalton (1988a)
in significant biases being introduced into the showed that periodicity in the errors as measured
error matrix which may over or under estimate the by the autocorrelation analysis could make the use
true accuracy. In addition, use of the proper sam- of systematic sampling risky for accuracy assess-
pling scheme may be essential depending on the ment. Therefore, perhaps some combination of
analysis techniques to be applied to the error random and systematic sampling would provide
matrix. the best balance between statistical validity and
Many researchers have expressed opinions practical application. Such a system may employ
about the proper sampling scheme to use (e.g., systematic sampling to collect some assessment
Hord and Brooner, 1976; Ginevan, 1979; Rhode, data early in a project while random sampling
1978; Fitzpatrick-Lins, 1981). These opinions vary within strata would be used after the classification
greatly among researchers and include everything is completed to assure that enough samples were
from simple random sampling to stratified system- collected for each category and to minimize any
atic unaligned sampling. Despite all these opin- periodicity in the data.
ions, very little work has actually been performed In addition to the sampling schemes already
in this area. Congalton (1988b) performed sam- discussed, cluster sampling has also been fre-
pling simulations on three spatially diverse areas quently used in assessing the accuracy of remotely
and concluded that in all cases simple random sensed data, especially to collect information on
without replacement and stratified random sam- many pixels very quickly. However, cluster sam-
pling provided satisfactory results. Despite the piing must be used intelligently. Simply using
Review: Assessing Classification Accuracy 45

very large clusters is not a valid method of collect- on relatively small remote sensing projects. How-
ing data because each pixel is not independent of ever, there is a need to use remote sensing for
the other and adds very little information to the much larger projects such as monitoring global
cluster. Congalton (1988b) recommended that no warming, deforestation, and environmental degra-
clusters larger than 10 pixels and certainly not dation. We do not know all the problems that will
larger than 25 pixels be used because of the lack arise when dealing with such large areas. Cer-
of information added by each pixel beyond these tainly, the techniques described must be extended
cluster sizes. and refined to better meet these assessment needs.
Finally, some analytic techniques assume that It is critical that this work and the use of quantita-
certain sampling schemes were used to obtain the tive analysis of remotely sensed data continue. We
data. For example, use of the Kappa analysis as- have suffered too long because of the oversell of
sumes a multinomial sampling model. Only simple the technology and the underutilization of any
random sampling completely satisfies this assump- quantitative analysis early in the digital remote
tion. The effect of using another of the sampling sensing era. Papers such as Meyer and Werth
schemes discussed here is unknown. An interest- (1990) that state that the digital remote sensing is
ing project would be to test the effect on the not a viable tool for most resource applications
Kappa analysis of using a sampling scheme other continue to demonstrate the problems we have
than simple random sampling. If the effect is created by not quantitatively documenting our
found to be small, then the scheme may be appro- work. We must put aside the days of a casual
priate to use within the conditions discussed above. assessment of our classification. "It looks good" is
If the effect is found to be large, then that sam- not a valid accuracy statement. A classification is
pling scheme should not be used to perform Kappa not complete until it has been assessed. Then and
analysis. To conclude that some sampling schemes only then can the decisions made based on that
can be used for descriptive techniques and others information have any validity.
for analytical techniques seems impractical. Accu- In addition, we must not forget that remotely
racy assessment is expensive and no one is going sensed data is just a small subset of spatial data
to collect data for only descriptive use. Eventually, currently being used in geographic information
someone will use that matrix for some analytical systems (GIS). The techniques and considerations
technique. discussed here need to be applied over all spatial
data. Techniques developed for other spatial data
need to be tested for use with remotely sensed
CONCLUSIONS data. The work has just begun, and if we are going
to use spatial data to help us make decisions, and
This paper has reviewed the factors and tech- we should, then we must know about the accuracy
niques to be considered when assessing the accu- of this information.
racy of classifications of remotely sensed data. The
work has really just begun. The factors discussed The author would like to thank Greg Biging and Craig Olson
here are certainly not fully understood. The basic for their helpful reviews of this paper. Thanks also to the two
anonymous reviewers whose comments significantly improved
issues of sample size and sampling scheme have this manuscript.
not been resolved. Spatial autocorrelation analysis
has rarely been applied to any study. Exactly what
constitutes ground or reference data and the level REFERENCES
of effort needed to collect it must be studied.
Research needs to continue in order to balance Anderson,J. R., Hardy, E. E., Roach,J. T., and Witmer, R. E.
(1976), A land use and land cover classificationsystem for
what is statistically valid within the realm of prac-
use with remote sensor data, U.S. Geol. Survey Prof. Paper
tical application. This need becomes increasingly 964, 28 pp.
important as techniques are developed to use re- Aronoff, Stan (1985), The minimum accuracy value as an
motely sensed data over large regional and global index of classificationaccuracy, Photogramm. Eng. Remote
domains. What is valid and practical over a small Sens. 51(1):99-111.
area may not apply to regional or global projects. Biging, G. and Congalton, R. (1989), Advances in forest
Up to now, the little experience we have has been inventory using advanced digital imagery, in Proceedings
46 Congalton

of Global Natural Research Monitoring and Assessments': Ginevan, M. E. (1979), Testing land-use map accuracy: an-
Preparing for the 21st Century, Venice, Italy, September, other look, Photogramm. Eng. Remote Sens. 45(10):
Vol. 3, pp. 1241-1249. 1371-1377.
Bishop, Y., Fienberg, S., and Holland, P. (1975), Discrete Hay, A. M. (1979), Sampling designs to test land-use map
Multivariate Analysis--Theory and Practice, MIT Press, accuracy, Photogramm. Eng. Remote Sens. 45(4):529-533.
Cambridge, MA, 575 pp. Hord, R. M., and Brooner, W. (1976), Land use map accuracy
Campbell, J. (1981), Spatial autocorrelation effects upon the criteria, Photogramm. Eng. Remote Sens. 42(5):671-677.
accuracy of supervised classification of land cover, Pho- Hudson, W., and Ramm, C. (1987), Correct formulation of the
togramm. Eng. Remote Sens. 47(3):355-363. kappa coefficient of agreement, Photogramm. Eng. Remote
Campbell, J. (1987), Introduction to Remote Sensing, Guilford Sens. 53(4):421-422.
Press, New York, 551 pp. Meyer, M., and Werth, k (1990), Satellite data: management
panacea of potential problem?, J. Forestry 88(9):10-13.
Chuvieco, E., and Congalton, R. (1988), Using cluster analysis
to improve the selection of training statistics in classifying Rhode, W. G. (1978), Digital image analysis techniques for
remotely sensed data, Photogramm. Eng. Remote Sens. natural resource inventory, in National Computer Confer-
54(9): 1275-1281. ence Proceedings, pp. 43-106.
Cliff, A. D., and Ord, J. K. (1973), Spatial Autocorrelation, Rosenfield, G. (1981), Analysis of variance of thematic map-
Pion, London, 178 pp. ping experiment data, Photogramm. Eng. Remote Sens.
47(12):1685-1692.
Cohen, J. (1960), A coefficient of agreement for nominal
Rosenfield, G., and Fitzpatrick-Lins, K. (1986), A coefficient
scales, Educ. Psychol. Measurement 20(1):37-46.
of agreement as a measure of thematic classification accu-
Congalton, R. G. (1988a), Using spatial autocorrelation analy- racy, Photogramm. Eng. Remote Sens. 52(2):223-227.
sis to explore errors in maps generated from remotely Rosenfield, G. H., Fitzpatriek-Lins, K., and Ling, H. (1982),
sensed data, Photogramm. Eng. Remote Sens. 54(5): Sampling tbr thematic map accuracy testing, Photogramm.
587-592. Eng. Remote Sens. 48(1):131-137.
Congalton, R. G. (1988b), A comparison of sampling schemes Skidmore, A., and Turner, B. (1989), Assessing the accuracy
used in generating error nmtrices for assessing the accu- of resource inventory maps, in Proceedings of Global
racy of maps generated from remotely sensed data, Pho- Natural Resource Monitoring and Assessments: Preparing
togramnt. Eng. Remote Sens. 54(5):593-600. for the 21st Century, Venice, Italy, September, Vol. 2, pp.
Congalton, R. G., Oderwald, R. G., and Mead, R. A. (1983), 524-535.
Assessing Landsat classification accuracy using discrete Story, M., and Congalton, R. (1986), Accuracy assessment: a
multivariate statistical techniques, Photogramm. Eng. Re- user's perspective, Photogramm. Eng. Remote Sens.
mote Sens. 49(12):1671-1678. 52(3):397-399.
Fitzpatrick-Lins, K. (1981), Comparison of sampling proce- van Genderen, J. L., and Lock, B. F. (1977), Testing land use
dures and data analysis for a land-use and land-cover map, map accuracy, Photogramm . Eng. Remote Sens.
Photogramm. Eng. Remote Sens. 47(3):343-351. 43(9):1135-1137.

You might also like