Data Quality in Gis

GIS11
- 1 -
DATA QUALITY IN GIS
When using a GIS to analyse spatial data, there is sometimes a tendency to assume that all data, both locational and
attribute, are completely accurate. This of course is never the case in reality. Whilst some steps can be taken to
reduce the impact of certain types of error, they can never be completely eliminated. Generally speaking, the greater
the degree of error in the data, the less reliable are the results of analyses based upon that data. This is sometimes
referred to as GIGO (Garbage In Garbage Out). There is obviously a need to be aware of the limitations of the data
and the implications this may have for subsequent analyses.
We will begin by looking at some of the terminology used to describe data quality in GIS. We will then look at some
sources of error in GIS data, before looking at how errors can be modelled. Following that we will look at the role of
metadata.
TERMINOLGY
A specialised vocabulary is used to describe data quality in GIS. We will begin with a quick review of some of the
more important concepts.
The terms data quality and error are used in a fairly loose, but common sense, sort of way. Data quality refers to
how good the data are. An error is a departure from the correct data value. Data containing a lot of errors are
obviously poor in quality.
A distinction is usually made between accuracy and precision. Accuracy is the extent to which a measured data
value approaches its true value. No dataset is 100 per cent accurate. Accuracy could be quantified using tolerance
bands - i.e. the distance between two points might be given as 173 metres plus or minus 2 metres. These bands are
generally expressed in probabilistic terms (i.e. 173 metres plus or minus 2 metres with 95 per cent confidence).
Precision refers to the recorded level of detail. A distance recorded as 173.345 metres is more precise than if it is
recorded as 173 metres. However, it is quite possible for data to be accurate (within a certain tolerance) without
being precise. It is also possible to be precise without being accurate. Indeed, data recorded with a high degree of
precision may give a misleading impression of accuracy. Data should not be recorded with a higher degree of
precision than their known accuracy.
1
The term bias is used to refer to a consistent error. For example, if a map was accidentally moved during digitising,
all points digitised after the move will be displaced relative to those digitised before the move in a systematic manner
(i.e. by a fixed amount in a certain direction). As another example, all data values may be truncated by the software,
resulting in a lower degree of precision.
The above terms apply to both attribute and locational data. The terms resolution and generalisation refer only to the
locational data. Resolution refers to the size of the smallest features captured in the data. In raster mode this is a
function of the size of the raster cells. For example, if each cell covers an area of 20 metres by 20 metres on the
ground, then features smaller than this (e.g. free standing trees) will not be captured. If digitising in vector mode, the
resolution will be a function of the scale of the source map.
Generalisation refers to the degree of simplification when drawing a map. Maps are models of the real world rather
than miniaturisations - i.e. in order to display certain features clearly, cartographers have to eliminate various details
which would only tend to clutter the map. For example, lines with many twists and turns may be straightened out;
features which would be difficult to see at small scale if represented by polygons are represented as point features;
features which might be difficult to see if drawn at their true scale are exaggerated (e.g. the width of roads); etc.

1
Precision is sometimes defined in a different manner to refer to the repeatability of measurements. Burrough and
McDonnell suggest accuracy defines the relationship of the measured data value to the true data value and can be
expressed statistically using the standard error. Precision defines the spread of values around the mean and can be
expressed as a standard deviation. This concept of precision is sometimes referred to as observational variance.
GIS11
- 2 -
Currency introduces a time dimension and refers to the extent to which the data have gone past their 'sell by' date.
Administrative boundaries tend to exhibit a high degree of geographical inertia, but they are sometimes revised from
time to time. Other features may change location on a more frequent basis: rivers may follow a different channel
after flooding; roads may be straightened; the boundaries between different types of vegetation cover may change as
a result of deforestation or natural ecological processes; and so forth. The attribute data associated with spatial
features will also change with time. The metadata (see below) associated with a particular set of data should specify
the date of data capture.
Other data quality considerations include completeness, compatibility, consistency and applicability. Completeness
refers to the degree to which data are missing - i.e. a complete set of data covers the study area and time period in its
entirety. Sample data are by definition incomplete, so the main issue is the extent to which they provide a reliable
indication of the complete set of data.
The term compatibility indicates that it is reasonable to use two data sets together. Maps digitised from sources at
different scales may be incompatible. For example, although GIS provides the technology for overlaying coverages
digitised from maps at 1:10,000 and 1:250,000 scales, this would not be a very useful exercise due to differences in
accuracy, precision and generalisation.
To ensure compatibility, data sets should be developed using the same methods of data capture, storage,
manipulation and editing (collectively referred to as consistency). Inconsistencies may occur within a data set if the
data were digitised by different people or from different sources (e.g. different map sheets, possibly surveyed at
different times).
The term applicability refers to the suitability of a particular data set for a particular purpose. For example, attribute
data may become outdated and therefore unsuitable for modelling that particular attribute a few years later,
especially if the attribute is likely to have changed in the interim.
SOURCES OF ERROR
Data errors may originate from a large number of different sources. Identifying possible sources of error and taking
steps to reduce errors is largely a matter of common sense. The following therefore is only intended to provide an
indication of possible error sources rather than a comprehensive list of all possible errors.
Inaccuracies may arise with regard to space, time or attribute. Spatial inaccuracies arise if the co-ordinates used to
identify the location of an entity (i.e. point, line or polygon) or a data point used to interpolate field data are
measured or recorded incorrectly. Attribute errors arise if the attribute data for objects or the data values for sample
points used to interpolate a field are measured or recorded incorrectly. As noted above, attribute data values and
locational characteristics are likely to change over time, so it is good procedure to record the date and time when the
data were collected. Inaccuracies may also arise in the recorded time, although for most types of GIS application
(except possibly for systems in which there is rapid change - e.g. weather systems) temporal errors are less critical
than the other types of error. They are therefore not considered further here.
Inaccuracies may occur at all stages in a GIS analysis. The following identifies some of the sources of error at each
stage.
Data Input Errors
The data for entry into a GIS may contain measurement inaccuracies. These may be primary or secondary. Primary
data acquisition errors occur during data capture or measurement. For example, if digitising data from a printed
map, the printed map may contain errors which will naturally be retained after conversion to a digital format.
Attribute data sources may also contain errors arising from problems with measurement instruments, sample bias,
errors in recording, coding errors, etc. Some measurement methods (e.g. surveying) are obviously more likely to be
accurate than others (e.g. interpretation of an air photo). Further errors, referred to as secondary data acquisition
errors, may be introduced subsequently during the process of entering the data into the GIS - e.g. digitising errors,
typing errors, etc.
GIS11
- 3 -
Locational Data
The capture of locational data for entities (e.g. by digitising from a paper map) can result in numerous errors. ESRI
suggests a useful checklist of objectives when capturing data in vector mode:
1. All entities that should have been entered are present.
2. No extra entities have been digitised.
3. The entities are in the right place and are of the correct shape and size.
4. All entities that are supposed to be connected to each other are.
5. All polygons have only a single label point to identify them.
6. All entities are within the outside boundary identified with registration marks.
This provides a good indication of the types of problem that might arise. Entities (i.e. points, lines, polygons) may
simply be overlooked when digitising, or may be entered more than once. An arc missing between two nodes may
result in two polygons being captured as a single polygon. An arc inadvertently digitised twice may result in a sliver
line. Vertices inaccurately digitised may result in lines having the wrong shape or, if the vertex in question is a node,
may result in a dangling node. The dangling node may either undershoot its correct location, resulting in a gap, or
it may overshoot its intended location, resulting in a cul-de-sac (and an intersection not identified as a node).
Vertices digitised in the wrong sequence may result in weird polygons or a polygonal knot. If digitising polygons
then it is obviously important to have the correct number of labels points in the correct locations. Too few label
points will result in some polygons not having associated attribute data, whilst too many label points may result in a
polygon having the wrong attribute data associated with it.
Digitising errors do not necessarily indicate a lack of accuracy when digitising points. They may also arise if the
snapping tolerance is incorrectly set. For example, dangling nodes frequently arise if the snapping tolerance is set too
low. However, if the snapping tolerance is set too high, nodes may be snapped to the wrong points. Apart from
causing lines to have the wrong shape, this could result in topological inconsistencies.
If the data are topologically encoded, then the digitising software can run a number of checks to identify potential
problems. For example, the software can check how many line segments enter each node. If only one line segment
enters a node then it can be identified as a dangling node. If two line segments enter a node then it is referred to as a
pseudo node. Both situation can be flagged as potential problems. However, dangling nodes may reflect genuine
cul-de-sacs in a road system, or the sources of tributaries in a river system; whilst a pseudo node may identify a
polygon completely enclosed within another polygon (e.g. a lake) or a change in attribute along a line (e.g. single
lane road to dual carriageway). The first situation is sometimes referred to as a spatial pseudo node and the second
as an attribute pseudo node.
Attribute Data
Errors in the attribute data may be caused either by primary or secondary data acquisition errors. Primary data
acquisition errors occur during measurement. Most secondary data acquisition errors are simply a result of typing
mistakes. For example, numbers may be entered wrongly or names may be spelt wrongly. Spelling mistakes in a field
used to join the attribute table to the spatial features may result in those features not being associated with the correct
attribute data. Missing attribute data (for whatever reason) will also cause fairly obvious problems.
Data Processing Errors
Further errors may be introduced during data processing. For example, if converting data from raster to vector mode,
vector mode lines which should be straight may take on a stepped appearance. There are various smoothing
algorithms which may be used to smooth out angular lines, but there is no way of knowing whether the smoothed
lines are actually any more accurate - the net effect of smoothing the lines may be to introduce further errors by
making them artificially smooth. Vector to raster conversions may result in topological errors being introduced or
even in the creation or loss of small polygons. Raster coverages created from the same vector coverage will tend to
vary depending upon relatively arbitrary decisions about cell size, the orientation of the raster and the location of the
origin.
Interpolation of data values in a continuous field from sample points will result in different values depending upon
the choice of method of interpolation and other decisions made with regard to the parameters used. The number of
sample points will also have a fairly obvious influence upon the reliability of the resulting estimates. When analysing
field data it is therefore necessary to bear in mind that the estimated data values are not necessarily accurate.
GIS11
- 4 -
It is important to realise that computers may introduce errors when processing data due to limitations placed upon
the precision of numbers arising from the way in which they are stored in a computer. When working with numbers
requiring a large number of significant digits, calculations done by computers may result in a high degree of
inaccuracy. This problem is becoming less serious with the availability of 32-bit and 64-bit machines, provided that
the software has been programmed to take advantage of the extra precision. If you are working with high precision
numbers you should confirm that both the computer and the software can support the degree of precision required
because the implications may be more serious than simply rounding numbers to a smaller number of significant
digits.
Finally, use errors may arise from simply using inappropriate tools for a particular type of analysis.
Data Display Errors
The display of data may also introduce errors. For example, the display of raster data on a vector mode device (e.g. a
plotter) or the display of vector data on a raster device (e.g. a monitor or a printer) will generally introduce other
small inaccuracies due to the need to round off during the conversion from one mode to the other. These errors can
probably be ignored for practical purposes, but they serve as a reminder that errors can creep in at all stages in a GIS
analysis.
MODELLING DATA ERRORS
Apart from recognising that errors are likely and then taking whatever steps one can to minimise them, what else can
be done? The treatment of errors in GIS has received relatively little attention, especially in commercial applications
software, but there have been some tentative steps towards using quantitative measures of errors to provide some
indication of the reliability of data in a GIS.
Attribute Errors
Measurement errors in the attribute data can be modelled using conventional statistical techniques. For example, if
the measurement errors can be assumed to be normally distributed with a mean of zero, then one can calculate the
standard error and use it to place confidence limits on the data values.
If the attribute data refer to sample points, then it may be necessary to interpolate the values of intervening points.
Kriging provides an estimate of the variance of the interpolated values.
If the attribute data are non-numerical categorical data (e.g. landuse types) then it may be possible to calculate a
misclassification matrix (also known as a confusion matrix or an error matrix). The rows in this matrix represent
the various categories as measured and the columns represent the correct categories. For example, the rows may
represent the categories in a landuse classification based on an analysis of satellite images, while the columns may
correspond to the categories in a classification based on ground truthing. The data values in the matrix would
indicate the number of cells in a raster image falling into each combination of categories. Once the table is
constructed, it is a simple matter to calculate what percentage of cells in each landuse category would be correctly
classified from the satellite imagery. These calculations may then be applied to other images.
Positional Error Models
Positional error models represent an attempt to place confidence bands around locational features. It is assumed that
if the x co-ordinate of a point (or vertex) was measured repeatedly then the observed x co-ordinates would have a
Normal (i.e. Guassian) distribution with an expected value (or mean) corresponding to the true value. 68 per cent of
the observed co-ordinates would be within one standard error of mean, and 90 per cent would be within 1.65
standard errors. Having established the standard error for one point by experiment, the expected error associated
with other points could be expressed using probabilistic confidence bands.
There are a number of assumptions implicit in the choice of a Normal distribution to model measurement errors (e.g.
it is assumed that the errors are unbiased; it is assumed that the probabilities associated with errors of differing
magnitude form a continuum, etc.). If these assumptions are unrealistic, then other statistical distributions may be
preferable. However, the basic approach is much the same.
GIS11
- 5 -
Point Data
When recording the spatial location of a point we record two co-ordinates (i.e. x and y). If the errors associated with
each are assumed to be normally distributed, then the probability distribution will be a bell-shaped surface, declining
at the same rate in all directions.
2
The standard error of this surface is called the circular standard error (CSE).
39.35 per cent of all points can be expected to lie within a circle with a radius of 1 CSE centred on the mean. 90 per
cent of the points should be within 2.146 CSE. One way to define the accuracy of a map is to specify a circular map
accuracy standard (e.g. 2.146 CSE, meaning that 90 per cent of all the observed data points will be within this
distance of their locations).
Line Data
The true location of each point on a line can be thought of as lying within a band on either side of a digitised line,
where the width of the band reflects the standard error. These bands are sometimes referred to as epsilon bands. The
original epsilon band was hypothesised as rectangular in cross-section, but a Normal distribution is more frequently
assumed. However, given that each point in a line is not independent of the points which precede and follow it, it
seems plausible that the digitised points forming a sequence will tend to be displaced in the same direction (i.e. there
will be a bias, and the true distribution of the errors will be skewed). Some investigators have suggested the cross-
sectional probability distribution should be bimodal.
Polygon Data:
Similar principles apply to polygons. Information on the width of the epsilon bands can be used to place confidence
limits on point in polygon tests. As can be seen from the diagram, instead of points being classified as either inside

2
This can be thought of as a bell-shaped curve rotated through 360 degrees around its vertical axis.
GIS11
- 6 -
or outside a polygon, they can now be classed as being definitely in, definitely out, possibly in, possibly out or
ambiguous.
A Monte Carlo approach can be used to model errors. This basically involves adding a random noise factor (which
can be positive or negative) to each co-ordinate for each point before performing whatever GIS operation you need
to do. The results are saved, and the whole process is repeated a large number of times (e.g. 100 times). The
accumulated results can be used to calculate confidence limits for numerical answers or to draw confidence bands
around features on output maps. The main problem with a Monte Carlo approach is that it requires a lot of computer
resources.
Burrough and McDonnell (Chapter 10) discuss a similar approach for evaluating the effects of measurement errors in
numerical models. They also discuss the statistical theory of error propagation.
3
Whilst mathematically more
challenging, this provides a computationally more efficient means of achieving the same objectives. The main
conclusion from their review of several case studies is that even relatively small measurement errors can have a
much greater impact than one might imagine. There is therefore an obvious need to develop better methods for
assessing data quality and its implications.
METADATA
The reliability of a particular set of data is dependent upon the uses to which it is put. Data which are completely
inappropriate in one context may be totally adequate in a different context (or vice versa). Data quality is therefore to
some extent a relative concept dependent upon the context. The emphasis has therefore tended to switch away from
simply trying to make the data as error free as possible to providing potential users with the information which they
require to make an informed decision about the adequacy of the data for a particular purpose. This information is
referred to as metadata.
Data exchange format Data storage format.
Data summary Data sources, areal coverage, classification used, date collected, scale, etc.
Lineage Agency of origin, method of data collection, primary survey techniques,
digitising method. Dates updated. Processing history: co-ordinate
transformations, data model translations, attribute transformations.
Co-ordinate system Type of co-ordinate system. Map projection parameters.
Spatial data model Specification of primitive spatial objects. Topological data stored.
Feature coding system Definition of feature codes and classification system.
Classification completeness Documentation on the extent of usage of classification system.
Geographical coverage Overall extent. Detailed specification of coverage if not complete.
Positional accuracy Statistics on co-ordinate errors.
Attribute accuracy Statistics on attribute errors.
Topological accuracy Methods of topology validation employed.
Graphical representation Graphical symbolism for each feature class. Text fonts for annotation.
Metadata is data about data. In the GIS context, each set of data should be accompanied by metadata explaining not
only what the it contains but how and when it was collected, and details relating to its quality. The table above (from
Jones) indicates the type of information the metadata might include. There are now a number of international
standards for metadata (e.g. OGC, ISO).

3
The term error propagation refers to the cumulative effect of errors upon the final results of the analysis.

Data Quality in Gis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Quality in Gis

Uploaded by

Copyright:

Available Formats

GIS11

You might also like