You are on page 1of 31

CHAPTER 2

GIS DATA MODELS AND DATA SOURCE


GIS does not store a map in any conventional sense. Instead GIS stores the data from which
we can draw a desired view to suit a particular purpose. There are two types of data in GIS
Spatial Data (Location of a particular feature)
Attribute data (information about features. E.g. name of roads, forest type etc.) GIS
integrates the two data types and allows the users to derive new data for planning. Spatial
Models are important in that way in which information is represented, affects the type of
analysis performed and types of graphic display that can be obtained. The two data
models common in GIS are Vector data model and Raster data Model.
In general the basic data type in a GIS reflects traditional data found on a map.
Accordingly, GIS technology utilizes two basic types of data. These are:
Spatial data describes the absolute and relative location of geographic features.
Attribute data describes characteristics of the spatial features. These characteristics can
be quantitative and/or qualitative in nature. Attribute data is often referred to as tabular
data. The coordinate location of a forestry stand would be spatial data, while the
characteristics of that forestry stand, e.g. cover group, dominant species, crown closure,
height, etc., would be attribute data. Other data types, in particular image and multimedia
data, are becoming more prevalent with changing technology. Depending on the specific
content of the data, image data may be considered either spatial, e.g. photographs,
animation, movies, etc., or attribute, e.g. sound, descriptions, narration's, etc.

2.1 Spatial Information:

Spatial characteristics of information can be broadly distinguished between:


a) Those that describe where things are? Using locations consisting of reference positions,
spatial units and spatial relationships.
b) Those that describe the form of phenomena using qualitative and quantitative description
of shape and structure.
c) Those that describe associations and interaction between different phenomena.

Basic Concepts:
1) All geographic data can be represented by three basic entities:
i) Point
ii) Line
iii) Area or Polygon plus a label saying what is it
E.g.
 An oil well could be represented by a single point consisting of X, Y coordinates.
 Road – represented by a series of X, Y coordinates
 Forest – represented by a set of X, Y coordinates plus the label forest. The label could
be actual name or a special symbol.
2) Layers and Coverage’s of spatial data
GIS organize spatial data into layers or coverage‟s
Typical layers represent information belonging to particular class. E.g. Roads,
Rivers, Vegetation types are different layers.
All the layers or coverages pertaining to an area are referenced to a common
projection system
The layers can be combined with each other in various ways to create new layers
that are functions of individual layers.

Land use

Settlement

Drainage

Road

Figure 3 Layers and Coverage’s of spatial data


3) Data Model:
In order to represent the spatial information and their attributes, a Data Model a set of logical
definitions or rules for characterizing the geographical data is adopted The Data Model
represents the linkages between the real world domain of geographical data and the computer
and GIS representation of these features. As a result, Data Model, not only helps in organizing
the real world geographical features into a systematic storage/retrieval mechanism, but also helps
in capturing user‟s perception of these features.

2.2 Conceptual Models of Spatial Information:

There are different models – which have influenced the way in which data are organized and
processed within GIS. They are based on Objects, Network and Fields.

Object Based Model:


Object based spatial models emphasize individual phenomena that are to be studied in isolation
or in terms of their relationship with other phenomena. Any phenomena, however bigger or
small, may be designated as an object, provided that it can be separated conceptually from
neighboring phenomena. Objects may be composed from other objects and they may have a
specific relationship with other separate objects. An object-based view is appropriate to
phenomena that have a well-defined boundary.

Figure 4 the object based conceptual view

Network Model:
Network based spatial module share some aspects of the object based module in that “they often
deal with discrete phenomenon”. But the essential characteristic is the need to consider
interaction between multiple objects, often along discrete path or routes that connect them. The
exact shape of phenomena is not of much importance. What is important is some measure of
distance and impendence (interaction) between specified phenomena.
Figure 5 Network Model:

Network Based Model

 E.g. Studies of Traffic on road, analysis of flow of water, flow of electricity etc.

Field Model:
Field Based model is appropriate for modeling phenomena that are regarded as continuously
variable across some region of space. E.g. concentration of pollutants in the air, temperature of
ground surface, moisture level of soil, elevation of ground etc. Field model may represent either
2 or 3 dimensions depending upon the applications.

2.3 Vector Based Model:

Vector is a data structure, used to store spatial data. Vector data is comprised of lines or arcs,
defined by beginning and end points, which meet at nodes. The locations of these nodes and
the topological structure are usually stored explicitly. Features are defined by their boundaries
only and curved lines are represented as a series of connecting arcs. Vector storage involves
the storage of explicit topology, which raises overheads, however it only stores those points
which define a feature and all space outside these features is 'non-existent'.

A vector based GIS is defined by the vectorial representation of its geographic data.
According with the characteristics of this data model, geographic objects are explicitly
represented and, within the spatial characteristics, the thematic aspects are associated.

There are different ways of organizing this double data base (spatial and thematic). Usually,
vectorial systems are composed of two components: the one that manages spatial data and the
one that manages thematic data. This is the named hybrid organisation system, as it links a
relational data base for the attributes with a topological one for the spatial data. A key element
in these kind of systems is the identifier of every object. This identifier is unique and
different for each object and allows the system to connect both data bases.

A vector based GIS is defined by the vectorial representation of its geographic data.
According with the characteristics of this data model, geographic objects are explicitly
represented and, within the spatial characteristics, the thematic aspects are associated. The
vector representation of an object is an attempt to represent the object as exactly as possible.
The geographical phenomena are represented by three basic entities along with their
attributes.

Point – City – population, no. of school, no. of houses etc.

Line – Road – Type of road, road name etc.

Area – Land use – class, soil type etc.

 The coordinate space is assumed to be continuous, allowing all positions, lengths


and dimensions to be defined precisely.

 The vector data structure represents each geographical feature by a set of


coordinates.

 The basic thing is to define a 2D space where coordinates on the two axes
represent features.
Point Features:
 A zero-dimensional abstraction of an object represented by a single X,Y co-ordinate. A
point normally represents a geographic feature too small to be displayed as a line or area; for
example, the location of a building location on a small-scale map, or the location of a service
cover on a medium scale map.
 Besides the X, Y coordinate, other data must be stored to indicate what kind of point it is
and other information associated with it. Fig. 1 shows a typical point data stored in GIS.

Fig.1

Figure 6 Point Features:

Line Features:
 A set of ordered co-ordinates that represent the shape of geographic features too narrow to
be displayed as an area at the given scale (contours, street centrelines, or streams), or linear
features with no area (county boundary lines). A lines is synonymous with an arc.
 Simplest line required the storage of begin point and end point. (Two X, Y coordinates plus
a possible record). An arc, a chain or string is asset of n X, Y coordinate pairs describing a
continuous complex line.
 Shorter the line segment and larger the no. of X,Y coordinate pairs, the closer the chain will
approximate a complex curve. Fig.2

6
Fig: 2

Figure 7 Line Features:

Area Features (Polygon Features):


 A feature used to represent areas. A polygon is defined by the lines that make up its
boundary and a point inside its boundary for identification. Polygons have attributes that
describe the geographic feature they represent.
 The boundary of area features separate the interior area from the exterior area.
 It may be isolated or connected.
Fig:3

Fig3
Figure 8 Area Features (Polygon Features):

2.4 Raster Based Model:

Raster is a method for the storage, processing and display of spatial data. Each area is divided
into rows and columns, which form a regular grid structure. Each cell must be rectangular in

7
shape, but not necessarily square. Each cell within this matrix contains location co-ordinates as
well as an attribute value. The spatial location of each cell is implicitly contained within the
ordering of the matrix, unlike a vector structure which stores topology explicitly. Areas
containing the same attribute value are recognized as such, however, raster structures cannot
identify the boundaries of such areas as polygons.

Raster data is an abstraction of the real world where spatial data is expressed as a matrix of cells
or pixels (see figure 9), with spatial position implicit in the ordering of the pixels. With the raster
data model, spatial data is not continuous but divided into discrete units. This makes raster data
particularly suitable for certain types of spatial operation, for example overlays or area
calculations.

Raster structures may lead to increased storage in certain situations, since they store each cell in
the matrix regardless of whether it is a feature or simply 'empty' space.

Grid size and resolution

A pixel is the contraction of the words picture element. Commonly used in remote sensing to
describe each unit in an image. In raster GIS the pixel equivalent is usually referred to as a cell
element or grid cell. Pixel/cell refers to the smallest unit of information available in an image or
raster map. This is the smallest element of a display device that can be independently assigned
attributes such as color.

Raster based spatial models regard space as a tessellation (resembled mosaic) of cells, each of
which is associated with a record of classification or identity of the phenomena that occupies
it. The raster model represents the 2D location of phenomena as a matrix of grid cell.
Each cell is known as pixel (Short form of Picture Element).
Since the cells are of fixed size and location, raster tend to represent natural and human made
objects in a blocky fashion. The information content in one cell depends upon the size of the
cell. If the cells are sufficiently small, the information present in one cell will be more. This is
called resolution of the image.
The raster model or grid cell is relatively simple approach to data representation both
conceptually and operationally, and hence has been popular since the earliest days of GIS
development.
The simplest raster data structure consists of an array of grid cells. A row and column number
references each grid cell and it contains a number representing the type or value of the attribute
being mapped. Fig: 4 a, 4b and 4c explain the raster model, raster representation of location
and raster resolution respectively.
8
In raster structure a single cell represents a point. A line by a number of neighboring cells
string out in a given direction and area by agglomeration (mass) of neighboring cells. Fig: 4d
show the raster representation of discrete features, Point, Line and Area.

Fig: 9 a Fig: 9 b

Fig: 9 c

Fig: 9 d

Figure 9 Raster Based Model:

 Since each cell is associated with a value called cell value or pixel value, it is very easy to
carry out overlay operations to compare attributes recorded in different layers.

9
 Each attribute associated with a grid cell can be combined logically or arithmetically with
attributes in corresponding cells of the other layers to create a new attribute value for the
resulting overlay.
 Transitional areas are poorly represented by raster-based model.

The Choice between Raster and Vector Models

The choice between raster and vector based model depend upon the type of data analysis and other
operations to be carried out for a project. However, there is always scope to convert one form to
other. i.e., raster to vector or vector to raster

 Raster method for spatial data structure requires large memory space as compared to vector
data.
 Certain kinds of data manipulation such as polygon intersection, union, clipping, merging etc
are complex in raster data model as compared to vector.
 However, multi-theme overlay operations are more easy in raster data model.
 Similarly, representation of surfaces is more common in raster-based model.

Vector Data Model:


Advantages:
 Good and real representation of geographic data
 Compact data structure
 Topology can be completely described
 Accurate graphic output
 Less storage space.
Disadvantages:
 Data structure is complex
 Combination of several vector polygons create difficulties in handling
 Simulation is difficult because each unit has a different topological form
 Display and plotting are expensive.

Raster Data Model:

Advantages:
 Simple data structure
 The overlay of mapped data with remote sensing data is easy

10
 Simulation is easy because each spatial unit has same size and shape
 Good for multiplayer overlay.
Disadvantages:
 Data is voluminous and require large storage space
 Use of large cell to reduce data volume loses significant information
 Crude raster maps have ugly look
 Network linkages are difficult to establish.

2.5 Database AND GIS

What is database?
A database is a storage area capable of storing large amounts of data.
A database management system (DBMS) is a software package that allows the user
to set up, to use and to maintain a database.
Database is a storage area capable of storing large amounts of data.
A large, computerized collection of structured data.
Like GIS software allows to set up a GIS application, a DBMS offers generic
functionality for database organization and data handling.
A DBMS is a software package that allows the user to set up, to use and to maintain
a database.
Ex, MS Access smaller (private) databases
Like a GIS software (Arc GIS) allows to set up a GIS application, a DBMS offers
generic functionality for database organization and data handling.
Some of the functions of database are:
▶ It allows concurrent use
▶ Supports storage optimization
▶ Supports data integrity
▶ Supports the use of a data model
▶ It has a query language facility
▶ It includes data backup and recovery
▶ It controls data redundancy
▶ Database can store almost any sort of data.
▶ A database design determines which tables will be present and what sort of
columns (attributes) each table will have.

The elements in a vector based GIS are then the DBMS (Data Base Management System) for the
attributes and the system that manages the topological data. In some GIS packages, the DBMS is
based in an existing software, i.e. dBASE.

11
Entity-Relation Model
Three elements are considered in this approach: (a) Entities as the relevant objects for the data
base. In a GIS, an entity is any fact that can be localized spatially. (b) Attributes or
characteristics attached to the entities. Each attribute has a limited domain of possible values, i.e.
the quality of a road can be bad, average, good, and very good. (c) Relations or mechanisms that
allow to relate entities. Some examples are: „located in‟, „contained in‟, „crossed with‟, etc.
DBMS

The data bases used in GIS are most commonly relational. Nevertheless, Object Oriented data bases
are progressively incorporated.
Relational data bases
In a relational data base, data is stored in tables where rows represent the objects or entities and
columns the attributes or variables. A data base is usually composed of several tables and the
relations between them is possible through a common identifier that is unique for each entity.
Most of the relational data bases in GIS present two variables with identifiers; one of them is
unique and correlative, it could be numeric or alphabetic, and the second one might be repeated
and helps to organize the attribute table.

The advantages of using this kind of data base are:

o The design is based in a methodology with heavy theoretical basis, which offers confidence
in its capacity to evolve.
o It is very easy to implement it, especially in comparison with other models such as
hierarchical, network, and object oriented.
o It is very flexible. New tables can be appended easily.
o Finally, many powerful DBMS using this approach contains query languages (like SQL)
which makes easy to include this tool in a GIS. Thus, some commercialized GIS packages
include a DBMS pre- existent.
Object Oriented Data Bases
Based on objects, it can be defined as an entity with a localisation represented by values and by a
group of operations. Thus, the advantage in comparison with relational data bases is based on the
inclusion, in the definition of an object, not only its attributes but also the methods or operations
that act on this object. In addition, the objects belong to classes that can have their own variables
and these classes can belong to super-classes.
Types of Database
1. Attribute database uses to store non spatial data. Ex, Relational database
▶ Relational database is the structures that used to store the base data,

12
2. Spatial (geographical) database store representations of geographic phenomena in the
real world to be used in a GIS environment.
Object based database (geodatabase) is an example of spatial database

2.6 Conceptual Model for Non-Spatial Information:

 Non-spatial information, also known as attribute data, is descriptive data that defines spatial
data.
 They are gathered and assembled into records and files
 A database is a collection of data that can be shared by different users. It is a group of
records and files that are organized so that there is little or no redundancy.
 A database consists of data in many files. In order to access data from one or more files
easily it is necessary to have some kind of structures or organization.
 Data Base Management System (DBMS) is a tool for representing, in computer, real world
oriented model of set of data in a predefined structure and organized manner.
 This high level representation or abstraction is referred to as Conceptual Model which
ensures the data linking, data security, sub-setting, query using logical / arithmetic syntax etc.
 Most commercial DBMS software‟s like Oracle, Dbase, MS Access etc. are implemented by
three types of data models namely Hierarchical data structure, Network structure and
Relational structure.
1.
2. Hierarchical Data Model:
 It is a tree-based structure. The tree is composed of nodes; the upper most node is
called a root.
 With the exception of this root, every node is related to a node at higher level called its
parent. The lower level is called child.

13
Example:

Root

Node/ parent Node / Parent

Node / Child Node / Child Node / Child Node / Child

Department

Job Employee
Description

Education Background Education Job History


Required Required

Figure 10 Hierarchical Data Model:

This approach is efficient if all desired access paths follow the parent child linkage.
 However, it requires a relatively inflexible structure and hence linkage with other
branch of database is tough or cumbersome. That is why this data base structure is not
very common in flexible GIS.
3. Network Structure:
 Network structure exists when child in a data relationship has more than one parent.
 An item in such a structure can be linked to any other item.
 It is good for network-based analysis.

14
Example:
Author 1 Author 2

Book 1 Book 2 Book 3

Figure 11 Network Structure:


3. Relational Structure:
 In this case data are organized in 2D tables consisting of rows and columns. The rows are
called records and columns are called items or fields.
 Such tables are easy to develop and understand.
 Different sets of tables are created within database and a relationship is established between
each table.
 Because of this, it is easy to create a subset of data fro one user or to join two tables for
other user to form a large table.
 The structure can be described mathematically, hence mathematics provide the basis for
extracting some columns from the table and for joining various columns.
 This capability to manipulate relations provides flexibility that is normally not available in
hierarchy and network structures.

Relational Operators:
 Retrieval of data sets from relational model involves creation of new relation, which is
a characteristic of permanently stored relations.
 There are several relations algebra operators that can be used to search and manipulate
relations.
 These operators are implemented by means of Structured Query Language (SQL) using
number of commands.
E.g SELECT Settlement Name, wareda name
FROM Settlement
Will create a new table from Settlement table that consists of only settlement name and wareda
name.
SELECT
FROM Settlement
WHERE wareda name = zero arath

15
Important Features of Relational database:
i) Primary and Foreign Keys
ii) Relational Joins
i) Primary and Foreign Keys:
 Relational approach is used to design database table
 Since each table or relation represents a set, it cannot have any rows whose entire contents
are duplicated.
 Secondly, as each row must be different to every other, a value in a single column or a
combination of values in multiple columns can be used to define a primary key for the table,
which allows each row to be uniquely identified.
 The uniqueness allows the primary key to serve as the sole row level addressing mechanism
in the relational database model.
 A field that stores the key of another table is called foreign key
Name Id Marks
Id Sex Year

Primary Key Foreign Key

Table 1 primary and foreign key

ii) Relational join:


 The mechanism for linking data in different tables is called relational join.
 Values in a column or columns in one table are matched to corresponding values in
column in second table.
 Matching is frequently based on primary key in one table and foreign key in the second
column.
e.g.
Name Designation Employee Id
MARTIN PROFFESSOR 1107
GEORGE READER 1206
Table 2 Relational join

16
2.7 Vector based spatial data analysis

The technology of GIS has developed so fast over the past one decade that it is now accepted as an
essential tool for the effective use of geographic information. There are many problems such as soil
erosion, deterioration of environment, deforestation, population growth, drought conditions,
shortage of drinking water etc. These are complex issues and require integrated responses. One
difficulty in organizing such integration e.g. among soil, water, vegetation has been the lack of
means to link the data in comparable and manageable sets. In order to overcome these difficulties
GIS offers entry of many types of data in a single spatial framework and has capability of
collection, compilation, storage, retrieval, analysis, manipulation, display and integration of
environmental, economic and social data in a single system.
It facilitates the following:
 Overlay of data for the purpose of comparison.
 Updating of information to illustrate changes over time.
 Changes of scale for micro-analysis.
 Derivation of non-available data through manipulation of known factors.
 Integration of physical and social science data sets.
 Incorporation of remotely sensed data such as satellite imagery for continuous environmental
monitoring.
 Modeling of physical, economic and social processes for the process of simulation & prediction.

Both remote sensing and GIS are involved in the analysis of phenomenon, which has geographic or
spatial significance because GIS technology is ideally suited for analysis of spatial phenomena, and
Remote Sensing is the most common source of spatially continuous data.
The information in a map database is normally geographically referenced using a map projection.
The map database or (Graphic database) in the GIS contains all the descriptive information related
to map feature. The attribute database contains all the descriptive information related to map
features. GIS links these two data bases and permits a wide range of integrated, queries, searches
and manipulations.

2.8 Introduction to Spatial Data Analysis:

 Geographic analysis allows us to study and understand the real world processes by developing
and applying manipulation / analysis criteria and models and to carryout integrated modeling.

17
 These criteria illuminate (highlights) underlying trends in geographic data, making new
information available.
 A GIS enhances this process by providing tools which can be combined in meaningful
sequence to reveal new or previously unidentified relationships within or between data sets,
thus increasing the better understanding of real world phenomena.
 Spatial analysis is the vital part of GIS and can be done in two ways:
a) Vector based analysis
b) Raster based analysis
In addition to the basic functions related to automated cartography and database management
system, the most important uses of GIS are spatial analysis capabilities. Making maps alone does
not justify the high cost of building GIS. The same maps may be produced using a simpler
cartographic package. Likewise, if the purpose is to generate tabular output, then simpler database
management software or a statistical package may be a more efficient system. It is a spatial analysis
that requires the logical connection between attribute data and map features. This capability makes
GIS a much more powerful and cost-effective tool than other automated cartographic packages or
database management system. Indeed, functions required for performing spatial analysis that are not
available in either cartographic package or database management systems are commonly
implemented in GIS.

2.9 GIS usage in Spatial Analysis:

i) Query (Spatial and Aspatial) and generation of new items from the original set.
ii) Single layer operation
iii) Multi layer operation
iv) Geometric modelling Network analysis
v) Raster/Grid analysis

1. Query: Spatial analysis in GIS involves three types of operations

a) Attribute query: also known as Aspatial query.


b) Spatial query
c) Generation of new data sets from the original database.

a) Attribute Query: It retrieves a data subset from a map by working with its attribute data. Here
the selection is done by asking logical questions
e.g. Arc view -> Query builder -> Region name – Bale.

18
b) Spatial Query: refers to the process of retrieving data from a map by working with map
features. Here selection of features is on location or spatial relationship which requires processing
of spatial information
E.g. select district through which NH-5 is passing.
- Villages falling within five Km along the canal.
- Villages where there is no sampling points.
c) Generation of new datasets from original database: The results of attribute data query and
spatial data query can be visually inspected or can be saved as new maps for further processing.
e.g. select region = Bale
Convert to shape file -> Bale

2. Single Layer Operation:

This includes the following analysis


 Creation of buffer zone around selected features
 Selected map features may be points, lines or polygons.
 A buffer zone is often treated as a protection zone and is used for planning and regulation.
 This analysis becomes important when calculating the impact area around well location or area
along road for any construction activities.

Figure 12 Single Layer Operation:

19
3. Multi Layer Operation:

 Here two or more themes are combined to generate a new theme


 The newly generated theme has characteristics of both parent layers.

Layer 1

Layer 2

Figure 13 Multi Layer Operation:

20
4. Geometric modelling:

a) Distance measurement : refers to measuring straight line distance between geographic features
b) Calculating area, length and perimeter
c) Geometric buffer (find the feature present at a specified distance of other feature.
5. Network analysis:
Designed specifically for line features having corrected topology. Used to solve
transportation problem and locational analysis like-

GIS Can Help in ……

• FIND BEST ROUTE


Shortest Route, Trafic Free, Less Time and Fuel etc.
• FIND CLOSEST FACILITY
School, College, Hospital, Market Place, etc.
• FIND SERVICE AREA
Petrol Pump, Telephone / Gas service station etc.

6. Raster grid analysis:

 Used for surface generation and analysis


 Many theme overlay can be carried out using arithmeticor weighted overlay technique
 Modelling is easy

Some analysis are:

• Polygon to Grid
• Point to Grid Surface
• Point Grid to Buffer
• Elevation grid to slope

21
Figure 14 Raster grid analysis

2.10 Vector Based Spatial Data Analysis:

Vector overlay is based on Topological Overlay. Map overlay combines the geometry and attribute
of two feature maps to create the output. One of the two maps is called Input must and other the
overlay map. The first consideration of map overlay is feature type. The input map may be point,
line or polygon. The output has the same feature type as input.

Polygon in polygon overlay:

a. Output in polygon coverage


b. Two coverage can be overlaid at a time
c. Each polygon contains attributes of both maps.
d. There is no limit of no. of layer to be combined
e. New FAT is created having information about each newly created features.

Line in polygon overlay

a. Output is a line coverage with additional attributes.


b. No polygon boundaries are copied.
c. New arc-node topology is created.
Point in polygon overlay

a. Output is a point with additional attribute


b. No new point features are created
c. No polygon boundaries are copied

22
Few examples of vector based spatial analysis are described below:

Erase coverage – erase features from a coverage that overlaps another coverage.

About Erase
This operation combines
features of an input theme
with the polygon from the
overlay theme to produce
output theme that contain
features outside the overlay
theme only

Figure 15 Erase coverage


Merge features
– Used to join two adjacent map features

Figure 16 Merge features


Dissolve – Dissolve polygons having same attribute value

Figure 16 Dissolve polygons having same attribute value

23
Union – Computes the geometric intersection of two features. All features and attributes of both
coverage‟s are preserved.

Figure 17 Union

Intersection: Computes the geometric intersection of two coverages. Only features common in both
are preserved.

Figure 18 Intersection
Clip: Extract feature from a coverage that overlaps another coverage
CLIP [in cover] [clip cover] [out cover]

Figure 19 Clip

24
Logical Operators:

Normally vector overlay operations are carried out through logical operators e.g.
AND: common area/Intersection/Clipping operation
OR: Union or Addition
NOT: Reverse
XOR: Minus
Conditional features are also used for editing feature attribute table
EQ: Equal to
NE#, <> Not equal to
GE >= Greater than or Equal to
LE <= Less than or equal to
GT > Greater than
LT < Less than
CN Containing
NC Not containing

Steps for performing geographic analysis:

1) Establish the objectives and criteria for analysis


2) Prepare data for spatial operations.
3) Perform spatial operations
4) Prepare data for tabular analysis
5) Perform tabular operations.
6) Evaluate and interpret the results
7) Refine the analysis as necessary

2.11 Data and File Structures

Binary and ASCII Numbers


No matter which spatial data model is used, the concepts must be translated into a set of numbers
stored on a computer. All information stored on a computer in a digital format may be represented
as a series of 0‟s and 1‟s. These data are said to be stored in a binary format, because each digit may
con tain one of two values, 0 or 1. Binary num bers are in a base of 2, so each successive column
of a number represents a power of two. We use a similar column convention in our familiar ten-
based (decimal) numbering system. As an example, consider the number 47, which we represent

25
using two columns. The seven in the first column indicates there are seven units of one. The four in
the tens column indicates there are four units of ten.
Each higher column represents a higher power of ten. The first column represents one (100 =1), the
next column represents tens (101 =10), the next column hundreds (102 =100), and upward for
successive powers of ten. We add up the values represented in the columns to decipher the number.
Binary numbers are also formed by rep resenting values in columns. In a binary system each
column represents a successively higher power of two (Figure 2-43). The first (rightmost) column
represents 1 (20 = 1), the second column (from right) represents twos (21 = 2), the third (from right)
represents fours (22 = 4), then eight (23 = 8), sixteen (24 = 16), and upward for successive powers
of two. Thus, the binary number 1001 represents the decimal number 9: a one from the rightmost
column, and eight from the fourth column .

Figure 20 Binary and ASCII Numbers

Each digit or column in a binary number is called a bit, and eight columns, or bits, are called a byte.
A byte is a common unit for defining data types and numbers, for example, a data file may be
referred to as containing 4-byte integer numbers. This means each number is represented by 4 bytes
of binary data (or 8 x 4 = 32 bits). Several bytes are required when representing larger numbers. For
example, one byte may be used to represent 256 different values. When a byte is used for
nonnegative integer numbers, then only values from 0 to 255 may be recorded. This will work when
all values are below 255, but consider an elevation data layer with values greater than 255. If the
data are not rescaled, then more than one byte of storage is required for each value. Two bytes will
store up to 65,536 different numbers. Terrestrial elevations measured in feet or meters are all below
this value, so two bytes of data are often used to store elevation data. Real numbers such as 12.19 or
865.3 typically require more bytes, and are effectively split, that is, two bytes for the whole part of
the real number, and four bytes for the fractional portion. Binary numbers are often used to
represent codes. Spatial and attribute data may then be represented as text or as standard codes. This

26
is particularly common when raster or vector data are converted for export or import among
different GIS software systems. For example, ArcGIS, a widely used GIS, produces several export
formats that are in text or binary formats. Idrisi, another popular GIS, supports binary and
alphanumeric raster formats. One of the most common number coding schemes uses ASCII
designators. ASCII stands for the American Standard Code for Information Interchange. ASCII is a
standardized, widespread data format that uses seven bits, or the numbers 0 through 126, to
represent text and other characters. An extended ASCII, or ANSI (American National Standards
Institute) scheme, uses these same codes, plus an extra binary bit to represent numbers between 127
and 255. These codes are then used in many pro grams, including GIS, particularly for data export
or exchange. ASCII codes allow us to easily and uni formly represent alphanumeric characters
such as letters, punctuation, other characters, and numbers. ASCII converts binary numbers to
alphanumeric characters through an index. Each alphanumeric character corresponds to a specific
number between 0 and 255, which allows any sequence of characters to be represented by a
number. One byte is required to represent each character in extended ASCII coding, so ASCII data
sets are typically much larger than binary data sets. Geographic data in a GIS may use a
combination of binary and ASCII data stored in files. Binary data are typically used for coordinate
information, and ASCII or other codes may be used for attribute data.

Pointers and Indexes


Data files may be linked by file pointers, indexes, or other structures. A pointer is an address or
index that connects one file location to another. Pointers are a common way to organize information
within and across multiple files. depicts an example of the use of pointers to organize spatial data.
In the polygon is composed of a set of lines. Pointers are used to link the set of lines that form each
polygon. There is a pointer from each line to the next line, forming a chain that defines the polygon
boundary. Pointers help by organizing data in such a way as to improve access speed. Unorganized
data would require time-consuming searches each time a polygon boundary was to be identified.
Pointers also allow efficient use of storage space. In our example, each line segment is stored only
once. Several polygons may point to the line segment as it is typically much more space efficient to
add pointers than to duplicate the line segment. Shapefiles are a common vector spatial data format
that uses an index to link files. Shapefiles were originally developed by ESRI, inc., as a way to store
point, line, and polygon features, although they have since been adopted as a common format for
data interchange and analysis. Shapefiles are supported by Autocad, QGIS, MapWindow, Manifold,
and most other GIS softwares that process vector data. Shapefiles represent layers with a clus ter
of files. Each file has the same base name but a different filename extension, indicated by a suffix,
for example, the “.shp” in the filename “boundary.shp.” A transportation data layer stored in
shapefile format might have the base name of roads, with different suffixes for different files:

27
roads.shp roads.shx roads.dbf roads.prj etc. The first three files above are all required to represent
a vector data layer using shapefiles. These files are connected using indices, numbers that identify
connections and groupings for various components. The .shp files contain the coordinates that
represent each road, organized by line segments. There is general information for each segment, and
then a list of coordinates and other data for the segment. This is followed by general information for
the next segment, and another list. Since road lengths vary, so will each record (string of numbers)
for each road. Note that adjacent road segments are often near each other in the file, but don‟t have
to be. When multiple segments connect at a junction, for example, at a crossroad, not all
connections can be sequentially ordered in the list. Segments in the roads.shp file are indexed by
pointers in the roads.shx file. Part of the information stored for a segment is the identifiers of
connecting segments. The roads.shx file contains indices that point to the segment records in the
.shp files, based on these identifiers. This speeds access, because without indexing, the soft ware
would have to search the .shp file each time it needed to find adjacent segments in a road. The
roads.dbf file also uses an index to point to the combined roads in the .shp and .shx files. A group of
segments may be used to form a line, and associated with a set of attributes stored in a dbf file, for
example, attributes on road name, surface type, or speed limit. By appropriate use of pointers and
indices, largely hidden to the user, this group of three shapefiles implements our vector data model.

Figure 21 Pointers and Indexes

Because pointers and indices are key elements in organizing the spatial data, alter ing them
directly will usually cause prob lems. Typically these indices are created by the software during
processing, and updated as needed when data are added, modified, or analyzed. Pointers may be
visible, for exam ple, the OID columns in the .dbf tables used with shapefiles, but manually
changing the values will often ruin the data layer. You should know the identity and use of pointers
in your data sets, so that you don‟t change them inadvertently. Pointers, indexing, and multifile
28
layers are not limited to vector data. Many raster formats store a majority of the cell data in one file,
and additional, linked information in an associated file. You must be careful when transferring a
data layer to include all the associated files. For example, copying the roads.shp and roads.dbf files
to a new location does not copy a usable data layer. The software expects a .shx file; an
incom plete file set is often useless.

Data Compression
We often compress spatial data files because they are large. Data compression reduces file size
while maintaining the information contained in the file. Compression algorithms may be “lossless,”
in that all information is maintained during compression, or “lossy,” in that some information is
lost. A lossless compression algorithm will produce an exact copy of the original when it is applied
and then the appropriate decom pression algorithm applied. A lossy algo rithm will alter the data
when it is applied and the appropriate decompression algo rithm applied. Lossy algorithms are
most often used with image data, where substan tial degradation still leaves a useful image, and
are uncommonly applied to thematic spatial data, where any data degradation is typically not
tolerated. Data compression is most often applied to discrete raster data, for example, when
representing polygon or area information in a raster GIS. There are redundant data ele ments in
raster representations of large homogenous areas. Each raster cell within a homogenous area will
have the same code as most or all of the adjacent cells. Data com pression algorithms remove
much of this redundancy. Run length coding is a common data compression method. This
compression technique is based on recording sequential runs of raster cell values. Each run is
recorded as the value found in the set of adjacent cells and the run length, or number of cells with
the same value. Seven sequential cells of type A might be listed as A7 instead of AAAAAAA.
Thus, seven cells would be represented by two characters. Consider the data recorded , where each
line of raster cells is represented by a set of run-length codes. In general, run length coding reduces
data volume, as shown for the top three rows.

Figure 22 Data Compression

29
2.12 Data Sources
The focus is on reviewing different data input techniques for spatial data. This chapter also
describes data input errors, spatial and attribute, and reviews typical procedures to correct input
errors.

SOURCES OF DATA

As previously identified, two types of data are input into a GIS, spatial and attribute. The data input
process is the operation of encoding both types of data into the GIS database formats. The creation
of a clean digital database is the most important and time consuming task upon which the usefulness
of the GIS depends. The establishment and maintenance of a robust spatial database is the
cornerstone of a successful GIS implementation. As well, the digital data is the most expensive part
of the GIS. Yet often, not enough attention is given to the quality of the data or the processes by
which they are prepared for automation. The general consensus among the GIS community is that
60 to 80 % of the cost incurred during implementation of GIS technology lies in data acquisition,
data compilation and database development.

A wide variety of data sources exist for both spatial and attribute data. The most common general
sources for spatial data are: hard copy maps; aerial photographs; remotely-sensed imagery; point
data samples from surveys; and existing digital data files. Existing hard copy maps, e.g. sometimes
referred to as analogue maps, provide the most popular source for any GIS project. Potential users
should be aware that while there are many private sector firms specializing in providing digital data,
federal, provincial and state government agencies are an excellent source of data. Because of the
large costs associated with data capture and input, government departments are often the only
agencies with financial resources and manpower funding to invest in data compilation. British
Columbia and Alberta government agencies are good examples. Both provincial governments have
defined and implemented province wide coverage of digital base map data at varying map scales,
e.g. 1:20,000 and 1:250,000. As well, the provincial forestry agencies also provide thematic forest
inventory data in digital format. Federal agencies are also often a good source for base map
information. An inherent advantage of digital data from government agencies is its cost. It is
typically inexpensive. However, this is often offset by the data's accuracy and quality. Thematic
coverages are often not up to date. However, it is important to note that specific characteristics of
government data varies greatly across North America. Attribute data has an even wider variety of
data sources. Any textual or tabular data than can be referenced to a geographic feature, e.g. a point,
line, or area, can be input into a GIS. Attribute data is usually input by manual keying or via a bulk

30
loading utility of the DBMS software. ASCII format is a de facto standard for the
transfer andconversion of attribute information.

You might also like