Data Generalisation

Data Generalization in GIS
Sometimes GIS data contains an excess of detail or spatial information than what is needed for the scale
of the map being prepared. Generalization is the method used in GIS to reduce detail in data. For
example, a small scale map of the United States does not need detailed coastlines or a map of California
does not need to show every road in the state.
Generalization can be achieved by removing detail, such as only showing major roads, showing only the
boundary of a state instead of all the counties. In GIS generalization is also used to smooth out lines,
removing small detail such as the nooks and crannies of a coastline or the meanderings of a stream.
Since detail about a geographic feature is simplified during generalization, generalized data is less
spatially accurate. Those using generalized data to calculate length, perimeter, or area will incur errors
in the calculations.
GIS and automated generalization
As GIS developed from about the late 1960s onward, the need for automatic, algorithmic generalization
techniques became clear. Ideally, agencies responsible for collecting and maintaining spatial data should
try to keep only one canonical representation of a given feature, at the highest possible level of detail.
That way there is only one record to update when that feature changes in the real world.[1] From this
large-scale data, it should ideally be possible, through automated generalization, to produce maps and
other data products at any scale required. The alternative is to maintain separate databases each at the
scale required for a given set of mapping projects, each of which requires attention when something
changes in the real world.
Several broad approaches to generalization were developed around this time:
The representation-oriented view focuses on the representation of data on different scales, which is
related to the field of Multi-Representation Databases (MRDB).[citation needed]
The process-oriented view focuses on the process of generalization.[citation needed]
The ladder-approach is a stepwise generalization, in which each derived dataset is based on the other
database of the next larger scale.[citation needed]
The star-approach is the derived data on all scales is based on a single (large-scale) data base
The International Cartographic Association defines Cartographic Generalisation as "the selection and
simplified representation of detail appropriate to the scale and/or the purpose of a map" (ICA 1967).
More generally, the objective of generalisation is to supply information on a content and detail level
corresponding to the necessary information for correct geographical reasoning.
SANGEETH M SIVADAS, Asst. Professor, Dept. of Geography, YUVAKSHETRA COLLEGE Page 1

Generalisation inputs are :
The needs
The geographical data: density, distribution, size, diversity etc.
The readability rules
The means: time, money, technique etc.
Characteristics of generalisation are:
 An extreme reduction compared to reality. Example: with a 1:25 000 scale, the image area 625
million times smaller than reality.
 A pure photographic reduction of the original scale leads to an illegible map.
 Already in the 1:25 000 scale, many objects of the landscape cannot be represented any more.
 At smaller scales, a representation of all objects of the landscape is impossible.
 A complex and unclear structured reality must be simplified according to the scale of the map.
Aims of Generalisation
Cartographic generalisation is born of the necessity to communicate. As it is not possible to

communicate map information at 1:1 scale, generalisation has many aims.
The following aims can also be considered as generalisation rules.
 Structure: The map content is well structured.
The estimation of map content priorities has to be adapted to the mapscale and to the intended
purpose.
The objects have to be classified according to clear and reasonable criteria.
The grouping of objects has to be logical.
 Legend: Expressive and associative symbols constitute the base for clear map communication.
The size and the form of the symbols are adapted to the other symbols and to the reality.
 Generalisation level: The level of generalisation implies simplification and detailing.
A low level of generalisation signifies a high information density and a fine structured map.
A high level of generalisation signifies a low information density and a thick structured map.
The level of generalisation varies according to the purpose and to the mapscale.
The level of generalisation is carefully defined.
The level of generalisation affects the legend and the symbols.
 Selection of objects: The objects selection complies with the map purpose.
The objects selection complies with the mapscale and with the intended purpose.

The objects that are visible in reality (e.g. houses) are completed with non-visible objects as borders or
labelling.
 Accuracy of objects: The optimal accuracy of the objects regarding position and form is reached.
However, the visual placement of objects is more important than the geometrical accuracy.
Displacing objects is only needed for raising the legibility and for clarification.
The symbols of visible objects (in reality) have a high accuracy.
The symbols of non-visible objects (in reality) have a limited accuracy.
Object displacement is necessary, and the neighbouring objects are adapted.
TThe form accuracy is only limited by the good legibility and the respect proportions demand.
The contour lines are not treated as a single line, but are adjusted to the correct reproduction of the
ground structure on each other.
 Reality accuracy: Indeed, the reality is revised and changed, but is still, as far as possible,
represented truthfully.
All objects present in the map really exist.
Appropriate legend symbols are assigned to the objects.
Labelling is correctly raised, written and assigned.
 Legibility of the map elements: The map must be readable without auxiliary means (e.g.
magnifying glass), and in bad conditions.
Good legibility is conditioned to the respect of the graphical minimal dimensions (sizes and distances) of
the symbols.
Graphical minimal dimensions leads to an unscaled representation, i.e. to an enlargement of the

dimensions scale.
Graphical readability rules support legibility.
 Graphical representation of the objects: Map content is adapted, legible and graphically
convincing.
Legend is credible and exact.
The generalisation of the forms and line symbols respect the most exceptional forms and eliminates the
small and fortuity ones.
The quantitative generalisation from strewn objects (e.g. houses) respects the density of objects in
reality.
The relations and dependencies of objects in reality (e.g., streets, ways, waters, contour lines, etc.) are
carefully considered.

Generalisation work flow

Data Classification in GIS
Classification is "the process of sorting or arranging entities into groups or categories; on a map, the
process of representing members of a group by the same symbol, usually defined in a legend."
Classification is used in GIS, cartography and remote sensing to generalize complexity in, and extract
meaning from, geographic phenomena and geospatial data. There are different kinds of classifications,
but all will generally involve a classification schema or key, which is a set of criteria (usually based on the
attributes of the individuals) for deciding which individuals go into each class. Changing the classification
of a data set can create a variety of different maps.
Data classification is broadly defined as the process of organizing data by relevant categories so that it
may be used and protected more efficiently. On a basic level, the classification process makes data
easier to locate and retrieve. Data classification is of particular importance when it comes to risk
management, compliance, and data security.
REASONS FOR DATA CLASSIFICATION
Data classification has improved significantly over time. Today, the technology is used for a variety of
purposes, often in support of data security initiatives. But data may be classified for a number of
reasons, including ease of access, maintaining regulatory compliance, and to meet various other
business or personal objectives. In some cases, data classification is a regulatory requirement, as data
must be searchable and retrievable within specified timeframes. For the purposes of data security, data
classification is a useful tactic that facilitates proper security responses based on the type of data being
retrieved, transmitted, or copied.
Data classification involves tagging data to make it easily searchable and trackable. It also eliminates
multiple duplications of data, which can reduce storage and backup costs while speeding up the search
process.
A DEFINITION OF DATA CLASSIFICATION
Data classification is broadly defined as the process of organizing data by relevant categories so that it
may be used and protected more efficiently. On a basic level, the classification process makes data
easier to locate and retrieve. Data classification is of particular importance when it comes to risk
management, compliance, and data security.
Data classification involves tagging data to make it easily searchable and trackable. It also eliminates
multiple duplications of data, which can reduce storage and backup costs while speeding up the search
process. Though the classification process may sound highly technical, it is a topic that should be
understood by your organization’s leadership.
REASONS FOR DATA CLASSIFICATION
Data classification has improved significantly over time. Today, the technology is used for a variety of
purposes, often in support of data security initiatives. But data may be classified for a number of
reasons, including ease of access, maintaining regulatory compliance, and to meet various other
business or personal objectives. In some cases, data classification is a regulatory requirement, as data
must be searchable and retrievable within specified timeframes. For the purposes of data security, data
classification is a useful tactic that facilitates proper security responses based on the type of data being
retrieved, transmitted, or copied.

TYPES OF DATA CLASSIFICATION
Data classification often involves a multitude of tags and labels that define the type of data, its
confidentiality, and its integrity. Availability may also be taken into consideration in data classification
processes. Data’s level of sensitivity is often classified based on varying levels of importance or
confidentiality, which then correlates to the security measures put in place to protect each classification
level.
There are three main types of data classification that are considered industry standards:
Content-based classification inspects and interprets files looking for sensitive information
Context-based classification looks at application, location, or creator among other variables as indirect
indicators of sensitive information
User-based classification depends on a manual, end-user selection of each document. User-based

classification relies on user knowledge and discretion at creation, edit, review, or dissemination to flag
sensitive documents.
Types of Classification Schema
Classification schema can take a number of forms and can be derived using a variety of methods. The
choice of schema type and classification methodology depends largely on the nature of the source data
and the nature of the criteria for putting each individual into a class. Choosing which classification
method to use could be the hardest and most complicated decision an analyst will make. In many cases,
multiple schemes are available that are equally valid but portray very different patterns in analysis and
visualization. Information can even be falsely represented if the correct method is not used.
Quantitative Thresholding
The simplest method is to divide the range of values of a single quantitative attribute into ordinal
classes. This is the method usually used for choropleth and isarithmic maps. For example, the incomes of
families in a county could be classified as "high" (>$200,000), "medium" ($40,000-199,999), and "low"
(<$40,000). There are several techniques for developing this type of schema, based on patterns in the
data:
 Manual interval :Use Manual Class to define your own classes, to manually add class breaks and
to set class ranges that are appropriate for the data. Alternatively, you can start with one of the
standard classifications and make adjustments as needed.
 Equal Interval: arranges a set of attribute values into groups that contain an equal range of
values. This can help show different groups when they are close in size. However, this doesn't
often occur in geographic phenomena. Take the range of your data (maximum - minimum) and
divide by your chosen number of categories.
 Quantile: divides the attribute values equally into a predefined number of classes. The attribute
values are added up, and then divided into the predetermined number of classes. In order to do
this, you take the number of total observations and divide that by the number of classes
resulting in the number of observations in each class. One of the advantages to using this
method is that the classes are easy to compute and each class is equally represented on the
map. Ordinal data can be easily classified using this method since the class assignment of
quantiles is based on ranked data.

 Jenks Natural Breaks: the Jenks Natural Breaks Classification (or Optimization) system is a data
classification method designed to optimize the arrangement of a set of values into "natural"
classes. This is done by seeking to minimize the average deviation from the class mean while
maximizing the deviation from the means of the other groups. The method reduces the variance
within classes and maximizes the variance between classes.
 Geometric Interval: This classification method is used for visualizing continuous data that is not
distributed normally. This method was designed to work on data that contains excessive
duplicate values, e.g., 35% of the features have the same value.
Standard Deviation: The Standard Deviation Classification method finds the mean value of the
observations then places class breaks above and below the mean at intervals of either .25, .5, or 1
standard deviation until all the data values are contained within the classes. This classification method
shows how much the feature's attribute value varies from the mean. Using a diverging color scheme to
illustrate these values is useful to emphasize which observations are above the mean and which
observations are below the mean.
Decision Tree
A Decision Tree is an ordered set of questions applied to each individual entity (or each point in space)
to determine the category to which it belongs. The answer to each question either results in a final
choice of category or leads to another more specific question. As these questions branch out into more
possibilities, the diagram takes on the shape of a horizontal tree. The questions can involve a wide range
of attributes and criteria. As a geographic example, the Köppen climate classification system is usually
implemented as a decision tree. Also, the biological species of an organism is usually determined using a
decision tree.
In GIS, decision tree classification schemes are typically implemented by evaluating each question for
the entire study area using relevant data GIS analysis techniques, such as query and overlay (in raster or
vector). The "answer" for a given question will thus be a set of regions for each possible answer, each of
which can be attributed with a final class, or used as a mask for where to apply the next question. For
example, the Köppen climate classification system can be modeled using 24 raster grids (long-term
mean precipitation and temperature for each month) and Map Algebra.
Multivariate Clustering
Clustering is a classification method that is most commonly used in data mining and remote sensing
image analysis. It is based on the premise that if a set of meaningful categories exists in a phenomenon
(e.g., types of land cover), they should appear as patterns in the characteristics of the phenomena.
Specifically, there should be clusters of individuals that are similar in several attributes while being very
different from other individuals in the same attributes (i.e., the minimal intra-variability, maximal inter-
variability ideal discussed above). For example, if humans are able to intuitively identify different types
of land cover in an aerial photograph by recognizing similarities and differences in color and texture,
then remote sensing software should be able to identify the same patterns in multispectral imagery
data. A good example of multivariate clustering could be if you had collected data on mountain lion
sightings and wanted to better understand their territories. Understanding where and when mountain
lions congregate at different life stages, for example, could assist with designing protected areas that
may help ensure successful breeding.

While analytically identifying the perfect set of clusters in a multivariate dataset is computationally
difficult (NP-Hard), there are a variety of analysis methods and heuristic optimization algorithms for
searching for clusters, such as Lloyd's K-Means Algorithm. The Jenks optimization algorithm discussed
above is essentially k-means performed on a single variable.
Multi-criteria Evaluation
In many geographic classification schemes, each class is defined by a set of criteria that manifests as a
spatial region. In this case, finding the region corresponding to each class can be implemented as a
multi-criteria evaluation or Suitability analysis procedures. This is done using GIS analysis methods such
as queries, buffers, overlay, and map algebra.
For example, imagine that a GIS analyst is searching for the best site to build a hypothetical waste
management facility, based on certain spatial criteria. Such criteria could be that the imagined facility
needs to be near existing roads, far from wildlife reserves, and far from land use areas zoned as
residential. In order to classify the better areas versus areas that are less than ideal, the analyst could
use multi-criteria evaluation to consider the several variables that would affect where the facility would
ultimately be located. With this method, it is also possible to set weights for each criterion so certain
variables can be considered more strongly than others if needed. For example, in the case of the waste
management facility, if it were more important for the facility to be located far from residential areas
than near existing roads, this could be accounted for in the classification. This method of classification is
especially useful in GIS because it is often necessary to consider multiple spatial criteria when working
with data.
Index Model
If the final set of classes is ordinal (for example, low-medium-high earthquake hazard potential), then it
can be modeled as an index, a pseudo-measurement of something that cannot directly be measured (in
the above example, hazard potential on a scale of 1-10), typically based on factors that can be
measured. The most common way this is done is that each contributing factor is mapped, with attributes
scaled to a common quantitative scale, then combined using a formula such as Weighted Linear
Combination to produce a final score.

Data Generalisation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Generalisation

Uploaded by

Copyright:

Available Formats

Data Generalization in GIS

GIS and automated generalization

Several broad approaches to generalization were developed around this time:

The process-oriented view focuses on the process of generalization.[citation needed]

SANGEETH M SIVADAS, Asst. Professor, Dept. of Geography, YUVAKSHETRA COLLEGE Page 1

Characteristics of generalisation are:

Cartographic generalisation is born of the necessity to communicate. As it is not possible to

The following aims can also be considered as generalisation rules.

 Structure: The map content is well structured.

The objects have to be classified according to clear and reasonable criteria.

The grouping of objects has to be logical.

 Generalisation level: The level of generalisation implies simplification and detailing.

The level of generalisation is carefully defined.

The level of generalisation affects the legend and the symbols.

SANGEETH M SIVADAS, Asst. Professor, Dept. of Geography, YUVAKSHETRA COLLEGE Page 2

The symbols of visible objects (in reality) have a high accuracy.

The symbols of non-visible objects (in reality) have a limited accuracy.

Object displacement is necessary, and the neighbouring objects are adapted.

All objects present in the map really exist.

Appropriate legend symbols are assigned to the objects.

Labelling is correctly raised, written and assigned.

Graphical minimal dimensions leads to an unscaled representation, i.e. to an enlargement of the

Graphical readability rules support legibility.

Legend is credible and exact.

SANGEETH M SIVADAS, Asst. Professor, Dept. of Geography, YUVAKSHETRA COLLEGE Page 3

SANGEETH M SIVADAS, Asst. Professor, Dept. of Geography, YUVAKSHETRA COLLEGE Page 4

REASONS FOR DATA CLASSIFICATION

A DEFINITION OF DATA CLASSIFICATION

REASONS FOR DATA CLASSIFICATION

SANGEETH M SIVADAS, Asst. Professor, Dept. of Geography, YUVAKSHETRA COLLEGE Page 5

User-based classification depends on a manual, end-user selection of each document. User-based

Types of Classification Schema

SANGEETH M SIVADAS, Asst. Professor, Dept. of Geography, YUVAKSHETRA COLLEGE Page 6

SANGEETH M SIVADAS, Asst. Professor, Dept. of Geography, YUVAKSHETRA COLLEGE Page 7

SANGEETH M SIVADAS, Asst. Professor, Dept. of Geography, YUVAKSHETRA COLLEGE Page 8

You might also like