You are on page 1of 5

Feature Layer Similarity Analysis

Overview

In order to process feature layers for predictive analysis, it is advantageous to remove


layers that function as homologs (duplicate layers). This concept is much more complex
than it appears, in that the representation of spatial information can take many forms, at
many resolutions and precisions.

The Problem

Having multiple layers that fundamentally capture the same characteristics in space (or
spatially linked characteristics) will provide a harmonic reinforcement to the output of
any analytical processes that operate over the similar layers. This reinforcement of
output signals can contribute to a false elevation of output values in areas of redundant
“signals” while masking subtle features in discreet and distinct areas.

Similarity of layers is composed of multiple concepts:


1. Spatial overlap In order for 2 layers to be similar, they must cover a
common area of space
2. Distribution If the spatial distribution of the data is dissimilar between
the 2 layers, they are not truly similar
3. Generalization If the spatial data is of differing geometry types or
resolutions, the layers may still be similar

Each of these concepts can be elaborated on the 2 primary facets of similarity to


predictive analysis. In general, if the analysis is performed using a grid for analytical
resolution of output, then the second facet of similarity is based on the resolution of the
grid and the similarity of the layers at the cellular level of the grid.

Spatial Overlap

If the layers do not overlap at all, they cannot be similar, however, they may map the
same category of data in disjoint areas. The mapping of similar categorical data in
disjoint areas cannot be resolved using geometrical means and will not be addressed any
further in this document.

If the layers share a common area, where the extents of each layer has some portion that
is distinct to itself and not shared with the other layer, then the similarity of the layers can
only be for the areas of commonality in spatial domain. The distinct areas for each layer
may however only contribute to a small portion of the data contained within that layer, so
the entirety of the layers must still be evaluated unless a threshold percentage
(representative sample) can be determined relative to the layer whole that would make
the layer similarity impossible at the finer level of per feature analysis.
If the layers have spatial domains such that the extent of one layer is entirely within the
extent of the other, then it is possible that one layer is a proper subset of the other in
terms of spatial similarity. This can provide for different opportunities for comparison.
If the spatially smaller layer is similar to the larger layer over all (i.e. the actual
data in the larger layer is predominantly contained within the extent of the smaller
layer), then the layers are wholly similar
If the spatially larger layer has a distribution of data that is proportionally outside
of the extent of the smaller layer in significant proportion to preclude similarity at
the overall level, then the layers cannot be wholly similar
If the spatially smaller layer is similar to the larger layer within the constraints of
the bounds of the spatially smaller layer, then the layers are locally similar at the
extent of the smaller layer, or, the smaller layer is a similar subset of the larger
layer. This may permit the exclusion of one of the layers for the constrained
region

For the grid-based analysis purpose, if the common extent of the layers is greater than or
equal to the extent of the grid, then the local similarity is all that is important. Similarly,
if the extent of the overlap of the extent of the layers intersects the extent of the grid,
there may be a consideration for similarity constrained to the extent of the grid,
irrespective of the extents of the layers themselves.

Distribution

Given that the layers of interest have a common extent, it is possible that the layers are
not similar in the distribution of the features within them. This is the concept of the
spatial distribution of features, where the count of features is potentially different
between the layers, but the distribution of the features is similar in space.

If the layers features are all proximal to features in the other layer, the layers may be
similar; however, if a significant proportion of the features in a layer are disjoint from
any feature in the other layer, the layers cannot be wholly similar. The definition of
proximal is based upon the resolution of the data itself, where the significant proportion
is based on a more subjective target threshold.

In the grid-based analytical model, the distribution of features is the most significantly
different from the similarity of data layers at the global level. In the grid-based approach,
the distribution of geometries within the layers is based upon cellular membership, where
layers that are generally dissimilar may be identical on a cellular level depending on the
resolution and extent of the analysis grid.

Generalization

It is often possible that layers will represent spatially linked, or spatially identical features
in different layers to represent varying resolutions of data, or to serve different purposes.
As an example, buildings may be represented as points in one layer and as polygon
boundaries in another to provide different data resolutions. Additionally, at an even
coarser resolution, buildings may be represented as a single point to represent a cluster of
buildings. As a second example, rivers may be managed as polygons of high resolution,
following the waterline along the course in one layer, and managed as coarse polygons
representing the maximum flood line in another layer. These layers represent the same
basic data at differing resolutions or for different purposes.

Determining similarity for these types of relationships is the core challenge of this
process. This is accomplished at the feature level, and requires a runtime of at least n*m,
where n and m are the number of records in each of the 2 respective layers to be
compared.

The methodology for comparison is based upon the incident geometry types of the
corresponding layers, which may be any combination of:
Point - simple 2d locations, i.e. lat lon
Line - 2d linear features represented as a group of points connected by
line segments
Polygon - 2d areas defined by a line that defines the boundary of the area
and any holes within that area

Comparison methodologies

Point  Point
Comparing 2 point layers requires a threshold distance (radius) for linkage and a
percentage linkage to define similarity. Each point in the first layer is compared
to each point in the second layer by distance. A count of feature pairs that are
closer than the threshold distance is built in the process. If this count is greater
than the percentage threshold of the features in each layer, then the layers are
similar. If the count is greater than the threshold for only a single layer, then there
is a directional similarity, i.e. A is similar to B but B is not similar to A.

xa , xb | if ( D( xa , xb ) Dt );1,0
Note that the count of linked pairs may be greater than the total number of points
in either layer, if multiple points link to multiple points. If this behavior is not
desired, a secondary methodology can be added to stop after the first link is
found. If completeness is desired, the secondary methodology would increase
runtime to a worst-case of 2(n*m).

Point  Line
To compare lines to points, a threshold distance (radius) for linkage and a
percentage to define similarity is required. Each point is compared to each line by
distance. A count of feature pairs that are closer than the threshold distance is
built in the process. If this count is greater than the percentage threshold of the
features in each layer, then the layers are similar.

pa , lb | if ( D( pa , lb ) Dt );1,0
A single point may link to multiple lines, and a line may link to multiple points.
If either behavior is not desired, secondary methodologies can be added, at
increased runtimes (2(n*m) worst case).

Point  Polygon
Comparing points to polygons is a 2 fold-process in terms of the nature of the
compare. If a point is inside a polygon, it is linked by inclusion, if a point is not
inside, but instead within the threshold distance of the polygon, it is linked by
distance.
For this methodology a threshold distance and 2 percentage thresholds are
required. The first percentage threshold maps to the percentage of points that are
contained by polygons to define layers as similar. The second percentage
threshold maps to the percentage of points that must be contained or within the
threshold distance of polygons to define layers as similar.
Currently, there is no metric that compares the proportion of polygons that are
near to points, if that is desirable, it can be added as a secondary methodology at
no additional runtime cost.

Line  Line
To compare lines to lines requires a distance threshold, a length proportion
similarity threshold and a proportion of features to classify as similar. The
methodology compares each line in one layer to every line in the other by
buffering the line by the threshold distance and comparing the length of the
second line to the length of the intersection of the second line to the buffer of the
first line. If this proportion is greater than the threshold proportion, the lines are
considered similar. If the length proportion is less than the threshold, then the
reverse is performed, buffering the first feature and comparing to the length of the
second feature.

This methodology is partially incomplete at present as it does not account for


fractioned lines (broken lines). Also, shortened lines may be given directional
preference depending on the comparison. A bidirectional compare could be
performed at each feature pair to ensure that both intersecting proportions meet
the threshold constraint.

Line  Polygon
To compare lines to polygons, the polygons are taken as the perimeter line of the
polygon, and treated from there as lines. For details on this comparison, see
above LineLine section.

Polygon  Polygon
Comparing polygon layers requires the definition of a proportion overlap
threshold and a percentage of features to classify as similar. The comparison is
performed by comparing each feature in one layer to every feature in the other.
The area of the intersection of the 2 polygons is compared to the area of each of
the polygons. If the ratio of the intersection to the original polygon is greater than
the threshold proportion for both polygons, the polygons are considered similar.
If the total count of similar polygon pairs is greater than the threshold proportion,
the layers are considered similar.
In this methodology, it is possible for the count of similar pairs to exceed the
count of features in either layer, as all possible pairs are enumerated. If this is not
acceptable, a secondary methodology can be employed to stop compares after the
first match.

Determining Appropriate Threshold Values

This set of methodologies requires the definition of threshold values for all calculations.
There are no set values to use based on a cursory review of related literature, so an
experimental approach to determining these values should be taken. Additionally, the
intended use for or source of the data may alter the desired values for any threshold.

Uses For Similarity Measures

The determination of layer similarity comes into play across the GIS field. In general,
most agencies using GIS software acquire data from a myriad of sources which often
provide layers representing the same ground features at different resolutions or in
different projections. Often the tabular attributes are different between these sources with
little hope for textual correlations across the feature layers, this defines geometric
comparisons to be the only alternative approach aside from manual examination.

While the approach defined in this document cannot contend with gross spatial
dissimilarities caused by massive misalignments or projection distortions, it can detect
similarities between layers mapping similar features over a common area. Further
investigation into possible improvements in these capabilities should prove worth while
in other areas. For the purposes of grid-based analysis, cell-based similarity approaches
will most likely provide the best results, possibly hybridizing some of the methodologies
here into the cellular world.

You might also like