You are on page 1of 276

Chapter 1: Introduction to Spatial Databases

1.1 Overview
1.2 Application domains
1.3 Compare a SDBMS with a GIS
1.4 Categories of Users
1.5 An example of an SDBMS application
1.6 A Stroll though a spatial database
1.6.1 Data Models, 1.6.2 Query Language, 1.6.3 Query Processing,
1.6.4 File Organization and Indices, 1.6.5 Query Optimization,
1.6.6 Data Mining

www.spatial.cs.umn.edu/Book/slides/ch1revised.ppt
Learning Objectives
Learning Objectives (LO)
LO1 : Understand the value of SDBMS
• Application domains
• users
• How is different from a DBMS?
LO2: Understand the concept of spatial databases
LO3: Learn about the Components of SDBMS

Mapping Sections to learning objectives


LO1 - 1.1, 1.2, 1.4
LO2 - 1.3, 1.5
LO3 - 1.6
Value of SDBMS
Traditional (non-spatial) database management systems provide:
Persistence across failures
Allows concurrent access to data
Scalability to search queries on very large datasets which do not fit inside
main memories of computers
Efficient for non-spatial queries, but not for spatial queries
Non-spatial queries:
List the names of all bookstore with more than ten thousand titles.
List the names of ten customers, in terms of sales, in the year 2001
Spatial Queries:
List the names of all bookstores with ten miles of Minneapolis
List all customers who live in Tennessee and its adjoining states
Value of SDBMS – Spatial Data Examples
Examples of non-spatial data
Names, phone numbers, email addresses of people
Examples of Spatial data
Census Data
NASA satellites imagery - terabytes of data per day
Weather and Climate Data
Rivers, Farms, ecological impact
Medical Imaging
Exercise: Identify spatial and non-spatial data items in
A phone book
A cookbook with recipes
Value of SDBMS – Users, Application Domains
Many important application domains have spatial data and
queries. Some Examples follow:
Army Field Commander: Has there been any significant
enemy troop movement since last night?
Insurance Risk Manager: Which homes are most likely to
be affected in the next great flood on the Mississippi?
Medical Doctor: Based on this patient's MRI, have we
treated somebody with a similar condition ?
Molecular Biologist:Is the topology of the amino acid
biosynthesis gene in the genome found in any other
sequence feature map in the database ?
Astronomer:Find all blue galaxies within 2 arcmin of
quasars.

Exercise: List two ways you have used spatial data. Which
software did you use to manipulate spatial data?
Learning Objectives
Learning Objectives (LO)
LO1 : Understand the value of SDBMS
LO2: Understand the concept of spatial databases
• What is a SDBMS?
• How is it different from a GIS?
LO3: Learn about the Components of SDBMS
Sections for LO2
Section 1.5 provides an example SDBMS
Section 1.1 and 1.3 compare SDBMS with DBMS and GIS
What is a SDBMS ?
A SDBMS is a software module that
can work with an underlying DBMS
supports spatial data models, spatial abstract data types
(ADTs) and a query language from which these ADTs are
callable
supports spatial indexing, efficient algorithms for
processing spatial operations, and domain specific rules
for query optimization
Example: Oracle Spatial data cartridge, ESRI SDE
can work with Oracle 8i DBMS
Has spatial data types (e.g. polygon), operations (e.g.
overlap) callable from SQL3 query language
Has spatial indices, e.g. R-trees
SDBMS Example
Consider a spatial dataset with:
County boundary (dashed white line)
Census block - name, area,
population, boundary (dark line)
Water bodies (dark polygons)
Satellite Imagery (gray scale pixels)

Storage in a SDBMS table:


create table census_blocks (
name string,
area float,
population number,
boundary polygon );

Fig 1.2
Modeling Spatial Data in Traditional DBMS

•A row in the table census_blocks (Figure 1.3)


• Question: Is Polyline datatype supported in DBMS?

Figure 1.3
Spatial Data Types and Traditional Databases
Traditional relational DBMS
Support simple data types, e.g. number, strings, date
Modeling Spatial data types is tedious
Example: Figure 1.4 shows modeling of polygon using numbers
Three new tables: polygon, edge, points
• Note: Polygon is a polyline where last point and first point are same
A simple unit sqaure represented as 16 rows across 3 tables
Simple spatial operators, e.g. area(), require joining tables
Tedious and computationally inefficient

Question. Name post-relational database management systems which


facilitate modeling of spatial data types, e.g. polygon.
Mapping “census_table” into a Relational Database

Fig 1.4
Evolution of DBMS technology

Fig 1.5
Spatial Data Types and Post-relational Databases
Post-relational DBMS
Support user defined abstract data types
Spatial data types (e.g. polygon) can be added
Choice of post-relational DBMS
Object oriented (OO) DBMS
Object relational (OR) DBMS
A spatial database is a collection of spatial data types,
operators, indices, processing strategies, etc. and can work
with many post-relational DBMS as well as programming
languages like Java, Visual Basic etc.
How is a SDBMS different from a GIS ?
GIS is a software to visualize and analyze spatial data
using spatial analysis functions such as
Search Thematic search, search by region, (re-)classification
Location analysis Buffer, corridor, overlay
Terrain analysis Slope/aspect, catchment, drainage network
Flow analysis Connectivity, shortest path
Distribution Change detection, proximity, nearest neighbor
Spatial analysis/Statistics Pattern, centrality, autocorrelation,
indices of similarity, topology: hole description
Measurements Distance, perimeter, shape, adjacency, direction
GIS uses SDBMS
to store, search, query, share large spatial data sets
How is a SDBMS different from a GIS ?
SDBMS focusses on
Efficient storage, querying, sharing of large spatial datasets
Provides simpler set based query operations
Example operations: search by region, overlay, nearest neighbor,
distance, adjacency, perimeter etc.
Uses spatial indices and query optimization to speedup queries over
large spatial datasets.
SDBMS may be used by applications other than GIS
Astronomy, Genomics, Multimedia information systems, ...
Will one use a GIS or a SDBM to answer the following:
How many neighboring countries does USA have?
Which country has highest number of neighbors?
Evolution of acronym “GIS”
Geographic Information Systems (1980s)
Geographic Information Science (1990s)
Geographic Information Services (2000s)

Fig 1.1
Three meanings of the acronym GIS
Geographic Information Services
Web-sites and service centers for casual users, e.g. travelers
Example: Service (e.g. AAA, mapquest) for route planning
Geographic Information Systems
Software for professional users, e.g. cartographers
Example: ESRI Arc/View software
Geographic Information Science
Concepts, frameworks, theories to formalize use and
development of geographic information systems and services
Example: design spatial data types and operations for querying
Exercise: Which meaning of the term GIS is closest to the focus of
the book titled “Spatial Databases: A Tour”?
Learning Objectives
Learning Objectives (LO)
LO1 : Understand the value of SDBMS
LO2: Understand the concept of spatial databases
LO3: Learn about the Components of SDBMS
• Architecture choices
• SDBMS components:
– data model, query languages,
– query processing and optimization
– File organization and indices
– Data Mining

Chapter Sections
1.5 second half
1.6 – entire section
Components of a SDBMS
Recall: a SDBMS is a software module that
can work with an underlying DBMS
supports spatial data models, spatial ADTs and a query
language from which these ADTs are callable
supports spatial indexing, algorithms for processing
spatial operations, and domain specific rules for query
optimization
Components include
spatial data model, query language, query processing,
file organization and indices, query optimization, etc.
Figure 1.6 shows these components
We discuss each component briefly in chapter 1.6 and in
more detail in later chapters.
Three Layer Architecture Fig 1.6
1.6.1 Spatial Taxonomy, Data Models
Spatial Taxonomy:
multitude of descriptions available to organize space.
Topology models homeomorphic relationships, e.g. overlap
Euclidean space models distance and direction in a plane
Graphs models connectivity, Shortest-Path
Spatial data models
rules to identify identifiable objects and properties of space
Object model help manage identifiable things, e.g. mountains,
cities, land-parcels etc.
Field model help manage continuous and amorphous
phenomenon, e.g. wetlands, satellite imagery, snowfall etc.
More details in chapter 2.
1.6.2 Spatial Query Language
• Spatial query language
• Spatial data types, e.g. point, linestring, polygon, …
• Spatial operations, e.g. overlap, distance, nearest
neighbor, …
• Callable from a query language (e.g. SQL3) of
underlying DBMS
SELECT S.name
FROM Senator S
WHERE S.district.Area() > 300

• Standards
• SQL3 (a.k.a. SQL 1999) is a standard for query
languages
• OGIS is a standard for spatial data types and operators
• Both standards enjoy wide support in industry
• More details in chapters 2 and 3
Multi-scan Query Example
• Spatial join example
SELECT S.name FROM Senator S, Business B
WHERE S.district.Area() > 300 AND Within(B.location, S.district)
• Non-Spatial Join example
SELECT S.name FROM Senator S, Business B
WHERE S.soc-sec = B.soc-sec AND S.gender = ‘Female’

Fig 1.7
1.6.3 Query Processing
• Efficient algorithms to answer spatial queries
• Common Strategy - filter and refine
• Filter Step:Query Region overlaps with MBRs of B,C and D
• Refine Step: Query Region overlaps with B and C

Fig 1.8
Query Processing of Join Queries
•Example - Determining pairs of intersecting rectangles
• (a):Two sets R and S of rectangles, (b): A rectangle with 2 opposite corners
marked, (c ): Rectangles sorted by smallest X coordinate value
• Plane sweep filter identifies 5 pairs out of 12 for refinement step
•Details of plane sweep algorithm on page 15

Fig 1.9
1.6.4 File Organization and Indices
• A difference between GIS and SDBMS assumptions
•GIS algorithms: dataset is loaded in main memory (Fig. 1.10(a))
•SDBMS: dataset is on secondary storage e.g disk (Fig. 1.10(b))
•SDBMS uses space filling curves and spatial indices
•to efficiently search disk resident large spatial datasets

Fig 1.10
Organizing spatial data with space filling curves
•Issue:
•Sorting is not naturally defined on spatial data
•Many efficient search methods are based on sorting datasets
•Space filling curves
•Impose an ordering on the locations in a multi-dimensional space
•Examples: row-order (Fig. 1.11(a), z-order (Fig 1.11(b))
• Allow use of traditional efficient search methods on spatial data

Fig 1.11
Spatial Indexing: Search Data-Structures
•Choice for spatial indexing:
•B-tree is a hierarchical collection of ranges of linear keys, e.g. numbers
•B-tree index is used for efficient search of traditional data
•B-tree can be used with space filling curve on spatial data
•R-tree provides better search performance yet!
•R-tree is a hierarchical collection of rectangles
•More details in chapter 4

Fig 1.12: B-tree Fig. 1.13: R- tree


1.6.5 Query Optimization
•Query Optimization
• A spatial operation can be processed using different strategies
• Computation cost of each strategy depends on many parameters
•Query optimization is the process of
•ordering operations in a query and
•selecting efficient strategy for each operation
•based on the details of a given dataset
•Example Query:
SELECT S.name FROM Senator S, Business B
WHERE S.soc-sec = B.soc-sec AND S.gender = ‘Female’
•Optimization decision examples
•Process (S.gender = ‘Female’) before (S.soc-sec = B.soc-sec )
•Do not use index for processing (S.gender = ‘Female’)
1.6.6 Data Mining
• Analysis of spatial data is of many types
• Deductive Querying, e.g. searching, sorting, overlays
• Inductive Mining, e.g. statistics, correlation, clustering,classification, …
• Data mining is a systematic and semi-automated search for
interesting non-trivial patterns in large spatial databases

•Example applications include


•Infer land-use classification from satellite imagery
•Identify cancer clusters and geographic factors with high correlation
•Identify crime hotspots to assign police patrols and social workers
1.7 Summary
SDBMS is valuable to many important applications
SDBMS is a software module
works with an underlying DBMS
provides spatial ADTs callable from a query language
provides methods for efficient processing of spatial
queries
Components of SDBMS include
spatial data model, spatial data types and operators,
spatial query language, processing and optimization
spatial data mining
SDBMS is used to store, query and share spatial
data for GIS as well as other applications
Chapter 2: Spatial Concepts and Data Models
2.1 Introduction
2.2 Models of Spatial Information
2.3 Three-Step Database Design
2.4 Extending ER with Spatial Concepts
2.5 Summary
Learning Objectives
• Learning Objectives (LO)
• LO1: Understand concept of data models
• What is a data model?
• Why use data models?
• LO2 : Understand the models of spatial information
• LO3: Understand the 3-step design of databases
• LO4: Learn about the trends in spatial data models

• Mapping Sections to learning objectives


• LO2 - 2.1
• LO3 - 2.2
• LO4 - 2.3, 2.4
What is a Data Model?
•What is a model? (Dictionary meaning)
• A set of plans (blueprint drawing) for a building
•A miniature representation of a system to analyze properties of interest

•What is Data Model?


• Specify structure or schema of a data set
•Document description of data
•Facilitates early analysis of some properties, e.g. querying ability, redundancy,
consistency, storage space requirements, etc.

• Examples:
•GIS organize spatial set as a set of layers
•Databases organize dataset as a collection of tables
Why Data Models?
• Data models facilitate
• Early analysis of properties, e.g. storage cost, querying ability, ...
• Reuse of shared data among multiple applications
• Exchange of data across organization
• Conversion of data to new software / environment
• Example- Y2K crisis for year 2000
Many computer software systems were developed without well-defined data
models in 1960s and 1970s. These systems used a variety of data models for
representing time and date. Some of the representations used two digits to
represent years. In late 1990s, people worried that the 2 digit representation of
year may lead to errorneous behaviour. For example age of a person born in
1960 (represented as 60) in year 2000 (represented as 00) may appear
negative and may be flagged as illegal data item. A large amount of effort and
resources (hundreds of Billions of dollars) was spent in revising the software.
Proper use of data model may have significantly reduced the costs. If time and
date were modeled as abstract data types in a software, only a small portion of
the software implementing the date ADT had to be reviewed and revised.
Types of Data Models
•Two Types of data models
•Generic data models
•Developed for business data processing
•Support simple abstract data types (ADTs), e.g. numbers, strings, date
•Not convenient for spatial ADTs, e.g. polygons
•Recall a polygon becomes dozens of rows in 3 tables (Fig. 1.4, pp. 8)
•Need to extend with spatial concepts, e.g. ADTs
•Application Domain specific, e.g. spatial models
•Set of concepts developed in Geographic Info. Science
•Common spatial ADTs across different GIS applications
•Plan of Study
•First study concepts in spatial models
•Then study generic model
•Finally put the two together
Learning Objectives
• Learning Objectives (LO)
• LO1: Understand concept of data models
• LO2 : Understand the models of spatial information
• Field based model
• Object based model
• LO3: Understand the 3-step design of databases
• LO4: Learn about the trends in spatial data models

• Mapping Sections to learning objectives


• LO2 - 2.1
• LO3 - 2.2
• LO4 - 2.3, 2.4
2.1 Models of Spatial Information
• Two common models
• Field based
• Object based
• Example: Forest stands
• Fig. 2.1
• (a) forest stand map
• (b) Object view has 3
polygons
• (c ) Field view has a
function
2.1.1 Field based Model
• Three main concepts:
• Spatial Framework is a partitioning of space
• e.g., Grid imposed by Latitude and Longitude
• Field Functions:
f: Spatial Framework 🡪 Attribute Domain
• Field Operations
• Examples, addition(+) and composition(o).
Types of Field Operations
• Local: value of the new field at a given location in the spatial frame-work
depends only on the value of the input field at that location(e.g., Thresholding)
• Focal:value of the resulting field at a given location depends on the values
that the input field assumes in a small neighborhood of the location(e.g.,
Gradient)
• Zonal:Zonal operations are naturally associated with aggregate operators or
the integration function. An operation that calculates the average height of the
trees for each species is a zonal operation.

• Exercise: Classify following operations on elevation field


• (I) Identify peaks (points higher than its neighbors)
• (II) Identify mountain ranges (elevation over 2000 feet)
• (III) Determine average elevation of a set of river basins
Local Operations
Function f and g defines as:
Focal Operation
Example is Limit operation in calculus

Let E(x,y) be the elevation field of state-Park, i.e.,E(x,y) gives


the value of the elevation at the location (x,y) in spatial
framework F. Then the gradient of elevation field is given
by ΔE(x,y) is a focal operation.
2.1.2 Object Model
• Object model concepts
• Objects: distinct identifiable things relevant to an application
• Objects have attributes and operations
• Attribute: a simple (e.g. numeric, string) property of an object
• Operations: function maps object attributes to other objects
• Example from a roadmap
• Objects: roads, landmarks, ...
• Attributes of road objects:
• spatial: location, e.g. polygon boundary of land-parcel
• non-spatial: name (e.g. Route 66), type (e.g. interstate, residential
street), number of lanes, speed limit, …
• Operations on road objects: determine center line, determine
length, determine intersection with other roads, ...
Classifying Spatial objects
• Spatial objets are spatial attributes of general objects
• Spatial objects are of many types
•Simple
•0- dimensional (points), 1 dimensional (curves), 2 dimensional (surfaces)
•Example given at the bottom of this slide
•Collections
•Polygon collection (e.g. boundary of Japan or Hawaii), …
•See more complete list in Figure 2.2

Spatial Object Types Example Object Dimension


Point City 0
Curve River 1
Surface Country 2
Spatial Object Types in OGIS Data Model
Fig 2.2: Each rectangle shows a distinct spatial object type
Classifying Operations on spatial objects in Object Model
•Classifying operations (Tables 2.1, 2.2, pp. 29-31)
• Set based: 2-dimensional spatial objects (e.g. polygons) are sets of points
• a set operation (e.g. intersection) of 2 polygons produce another polygon
• Topological operations: Boundary of USA touches boundary of Canada
•Directional: New York city is to east of Chicago
•Metric: Chicago is about 700 miles from New York city.
• Q? Identify classes of spatial operations not listed in this slide.

Set theory based Union, Intersection, Containment,

Toplogical Touches, Disjoint, Overlap, etc.


Directional East,North-West, etc.
Metric Distance
Topological Relationships
• Topological Relationships
• invariant under elastic deformation (without tear, merge).
• Two countries which touch each other in a planar paper map will
continue to do so in spherical globe maps.
• Topology is the study of topological relationships
• Example queries with topological operations
• What is the topological relationship between two objects A and B ?
• Find all objects which have a given topological relationship to
object A ?
Topological Concepts
• Interior, boundary, exterior
• Let A be an object in a “Universe” U.

Green is A interior
U
Red is boundary of A

Blue –(Green + Red) is


A A exterior

• Question: Define Interior, boundary, exterior on curves and points.


Nine-Intersection Model of Topological Relationships
•Many toplogical Relationship between A and B can be
•specified using 9 intersection model
•Examples on next slide
•Nine intersections
•intersections between interior, boundary, exterior of A, B
•A and B are spatial objects in a two dimensional plane.
•Can be arranged as a 3 by 3 matrix
•Matrix element take a value of 0 (false) or 1 (true).
•Q? Determine the number of many distinct 3 by 3 boolean matrices .
Specifying topological operation in 9-Intersection Model
Fig 2.3: 9 intersection matrices for a few topological operations

Question: Can this model specify topological operation between a polygon


and a curve?
Using Object Model of Spatial Data
• Object model of spatial data
• OGIS standard set of spatial data types and operations
• Similar to the object model in computer software
• Easily used with many computer software systems
• Programming languages like Java, C++, Visual basic
• Example use in a Java program is in section 2.1.6
• Post-relational databases, e.g. OODBMS, ORDBMS
• Example usage in chapter 3 through 6
Learning Objectives
• Learning Objectives (LO)
• LO1: Understand concept of data models
• LO2 : Understand the models of spatial information
• LO3: Understand the 3-step design of databases
• Conceptual - ER model
• Logical - Relational model
• Physical
• Translation from Conceptual to Logical
• LO4: Learn about the trends in spatial data models

• Mapping chapter sections to learning objectives


• LO2 - 2.1
• LO3 - 2.2
• LO4 - 2.3, 2.4
2.2 Three-Step Database Design
• Database applications are modeled using a three-step
design process
• Conceptual-datatypes,relationships and constraints(ER model)
• Logical-mapping to a Relational model and associated query
language(Relational Algebra)
• Physical-file structures, indexing,
• Scope
• We discuss conceptual and logical data models in section 2.3
• Physical model is discussed in chapter 4
Example Application Domain
• Database design is for a specific application domain
• Often a requirements document is available
• Designers discuss requirements with end-users as needed
• We will use a simple spatial application domain
• to illustrate concepts in conceptual and logical data models
• to illustrate translation of conceptual DM to logical DM
• Spatial application domain
• A state-park consists of forests.
• A forest is a collection of forest-stands of different species
• State-Park is accessed by roads and has a manager
• State-Park has faciltities
• River runs through state-park and supplies water to the facilities
2.2.1 Conceptual DM: The ER Model
• 3 basic concepts
• Entities have an independent conceptual or physical existence.
• Examples: Forest, Road, Manager, ...
• Entities are characterized by Attributes
• Example: Forest has attributes of name, elevation, etc.
• An Entity interacts with another Entity through relationships.
• Road allow access to Forest interiors.
• This relationship may be name “Accesses”
• Comparison with Object model of spatial information
• Entities are collections of attributes are like objects
• However ER model does not permit general user defined operations
• Relationships are not directly supported in Object model
• but may be simulated via operations
Relationship Types
• Relationships can be categorized by
• cardinality constraints
• other properties, e.g. number of participating entities
• Binary relationship: two entities participate
• Types of Cardinality constraints for binary relationships
• One-One: An instance of an entity relates to a unique instance of other entity.
• Many-One: Many instances of an entity relate to an instance of an other.
• Many-Many: Many instances of one entity relate to multiple instances of another.

• Exercise: Identify type of cardinality constraint for following:


• Many facilities belong to a forest. Each facility belong to one forest.
• A manager manages 1 forest. Each forest has 1 manager.
• A river supplies water to many facilities. A facility gets water from many rivers.
ER Diagrams Graphical Notation
•ER Diagrams are graphic representation of ER models
•Several different graphic notation are used
•We use a simple notation summarized below
•Example ER Diagram for Forest exampl in next slide

•Q? Compare and contrast “Atributes” and “Multi-valued attributes”.

Concept Sy
m
bo
l
Entities

Attributes

Multi-valued Attributes

Relationships
ER Diagram for “State-Park”

Fig 2.4

•Exercise:
•List the entities, attributes, relationships in this ER diagram
•Identify cardinality constraint for each relationship.
•How many roads “Accesses” a “Forest_stand”? (one or many)
2.2.2 Logical Data Model: The Relational Model
• Relational model is based on set theory
• Main concepts
• Domain: a set of values for a simple attribute
• Relation: cross-product of a set of domains
• Represents a table, i.e. homogeneous collection of rows (tuples)
• The set of columns (i.e. attributes) are same for each row
• Comparison to concepts in conceptual data model
• Relations are similar to but not identical to entities
• Domains are similar to attributes
• Translation rules establishing exact correspondence are discussed in 2.2.3
Relational Schema
• Schema of a Relation
• Enumerates columns, identifies primary key and foreign keys.
• Primary Key :
• one or more attributes uniquely identify each row within a table
• Foreign keys
• R’s attributes which form primary key of another relation S
• Value of a foreign key in any tuple of R match values in some row of S
• Relational schema of a database
• collection of schemas of all relations in the database
• Example: Figure 2.5 (next slide)
• Ablue print summary drawing of the database table structures
• Allows analysis of storage costs, data redundancy, querying capabilities
• Some databases were designed as relational schema in 1980s
• Nowadays, databases are designed as E R models and relational schema is
generated via CASE tools
Relational Schema Example

•Exercise:
•Identify relations with
•primary keys
•foreign keys
•other attributes
•Compare with ER diagram
•Figure 2.4, pp. 37

Fig 2.5
Relational Schema for “Point”, “Line”, “Polygon” and “Elevation”

•Relational model restricts attribute domains


•simple atomic values, e.g. a number
•Disallows complex values (e.g. polygons) for columns
•Complex values need to be decomposed into simpler domains
•A polygon may be decomposed into edges and vertices (Fig. 2.5)

Fig 2.5
More on Relational Model
• Integrity Constraints
• Key: Every relation has a primary key.
• Entity Integrity: Value of primary key in a row is never undefined
• Referential Integrity: Value of an attribute of a Foreign Key must appear as a value
in the primary key of another relationship or must be null.

• Normal Forms (NF) for Relational schema


• Reduce data redundancy and facilitate querying
• 1st NF: Each column in a relation contains an atomic value.
• 2nd and 3rd NF: Values of non-key attributes are fully determined by the values of
the primary key, only the primary key, and nothing but the primary key.
• Other normal forms exists but are seldom used
• Translating a well-designed ER model yields a relational schema in 3rd NF
• satisfying definition of 1st, 2nd and 3rd normal forms
2.2.3 Mapping ER to Relational
•Highlights of transaltion rules (section 2.2.3)
•Entity becomes Relation
•Attributes become columns in the relation
•Multi-valued attributes become a new relation
•includes foreign key to link to relation for the entity
•Relationships (1:1, 1:N) become foreign keys
•M:N Relationships become a relation
•containing foreign keys or relations from participating entities
•Example and Exercise
•Compare Fig. 2.4 and Fig. 2.5
•Identify the relational schema components for
•entity Facility, its attributes and its relationships
•Note an empty relation box in Fig. 2.5. Fill in its schema.
Learning Objectives
• Learning Objectives (LO)
• LO1: Understand concept of data models
• LO2 : Understand the models of spatial information
• LO3: Understand the 3-step design of databases
• LO4: Learn about the trends in spatial data models
• Pictograms in conceptual models
• UML class diagrams

• Mapping Sections to learning objectives


• LO2 - 2.1
• LO3 - 2.2
• LO4 - 2.3, 2.4
2.3 Extending ER with Spatial Concepts
•Motivation
•ER Model is based on discrete sets with no implicit relationships
•Spatial data comes from a continuous set with implicit relationships
•Any pair of spatial entities has relationships like distance, direction, …
•Explicitly drawing all spatial relationship
•clutters ER diagram
•generates additional tables in relational schema
•Misses implicit constraints in spatial relationships (e.g. partition)
•Pictograms
•Label spatial entities along with their spatial data types
•Allows inference of spatial relationships and constraints
•Reduces clutter in ER diagram and relational schema
•Example: Fig. 2.7 (next slide) is simpler than Fig. 2.4
ER Diagram with Pictograms: An Example

Fig 2.7
Specifying Pictograms
•Grammar based approach
•Rewrite rule
•like English syntax diagrams
•Classes of pictograms
•Entity pictograms
•basic: point, line, polygon
•collection of basic
•...
•Relationship pictograms
•partition, network
Entity Pictograms: Basic shapes, Collections
Entity Pictograms: Derived and Alternate Shapes
•Derived shape example is city center point from boundary polygon
•Alternate shape example: A road is represented as a polygon for construction
•or as a line for navigation
2.4 Conceptual Data Modeling with UML
•Motivation
•ER Model does not allow user defined operations
•Object oriented software development uses UML
•UML stands for Unified Modeling Language
•It is a standard consisting of several diagrams
•class diagrams are most relevant for data modeling
•UML class diagrams concepts
•Attributes are simple or composite properties
•Methods represent operations, functions and procedures
•Class is a collection of attributes and methods
•Relationship relate classes
•Example UML class diagram: Figure 2.8
UML Class Diagram with Pictograms: Example
•Exercise: Identify classes, attributes, methods, relationships in Fig. 2.8.
•Compare Fig. 2.8 with corresponding ER diagram in Fig. 2.7.

Fig 2.8
Comparing UML Class Diagrams to ER Diagrams
•Concepts in UML class diagram vs. those in ER diagrams
•Class without methods is an Entity
•Attributes are common in both models
•UML does not have key attributes and integrity constraints
• ERD does not have methods
•Relationships properties are richer in ERDs
•Entities in ER diagram relate to datasets, but UML class diagram
•can contain classes which have little to do with data
2.5 Summary
• Spatial Information modeling can be classed into Field
based and Object based
• Field based for modeling smoothly varying entities, like
rainfall
• Object based for modeling discrete entities, like country
Summary
• A data model is a high level description of the data
• it can help in early analysis of storage cost, data quality
• There are two popular models of spatial information
• Field based and Object based
• Database are designed in 3-steps
• Conceptual, Logical and Physical
• Pictograms can simplify Conceptual data models
Chapter 3:Spatial Query Languages
3.1 Standard Database Query Languages
3.2 Relational Algebra
3.3 Basic SQL Primer
3.4 Extending SQL for Spatial Data
3.5 Example Queries that emphasize spatial aspects
3.6 Trends: Object-Relational SQL
Learning Objectives
• Learning Objectives (LO)
• LO1: Understand concept of a query language
• What is a query language?
• Why use query languages?
• LO2 : Learn to use standard query language (SQL)
• LO3: Learn to use spatial ADTs with SQL
• LO4: Learn about the trends in query languages

• Mapping Sections to learning objectives


• LO2 - 3.2, 3.3
• LO3 - 3.4, 3.5
• LO4 - 3.6
What is a query?
• What is a Query ?
• A query is a “question” posed to a database
• Queries are expressed in a high-level declarative manner
• Algorithms needed to answer the query are not specified in the query
• Examples:
• Mouse click on a map symbol (e.g. road) may mean
• What is the name of road pointed to by mouse cursor ?
• Typing a keyword in a search engine (e.g. google, yahoo) means
• Which documents on web contain given keywords?
• SELECT S.name FROM Senator S WHERE S.gender = ‘F’ means
• Which senators are female?
What is a query language?
• What is a query language?
• A language to express interesting questions about data
• A query language restricts the set of possible queries
• Examples:
• Natural language, e.g. English, can express almost all queries
• Computer programming languages, e.g. Java,
• can express computable queries
• however algorithms to answer the query is needed
• Structured Query Language(SQL)
• Can express common data intensive queries
• Not suitable for recursive queries
• Graphical interfaces, e.g. web-search, mouse clicks on a map
• can express few different kinds of queries
An Example World Database
• Purpose: Use an example database to learn query language SQL
• Conceptual Model
• 3 Entities: Country, City, River
• 2 Relationships: capital-of, originates-in
• Attributes listed in Figure 3.1
An Example Database - Logical Model

•3 Relations
Country(Name, Cont, Pop, GDP, Life-Exp, Shape)
City(Name, Country, Pop,Capital, Shape)
River(Name, Origin, Length, Shape)
• Keys
•Primary keys are Country.Name, City.Name, River.Name
• Foreign keys are River.Origin, City.Country
•Data for 3 tables
•Shown on next slide
World database data tables
RA(Relational Algebra)
• Two distinct elements
• Ωa : set of operands
• Ωa : set of operation
• Basic operation
• Select
• Project
• Union
• Cross-product
• Difference
• intersection
Select and Project Operations
Output of select and Project operation
Set Operations
• Union
RUS
• Difference
R-S
• Intersection
R∩S
• cross-product
RXS
Join Operation
• Conditional Join

• Natural Join
Learning Objectives
• Learning Objectives (LO)
• LO1: Understand concept of a query language
• LO2 : Learn to use standard query language (SQL)
• How to create and populate tables?
• How to query given tables?
• LO3: Learn to use spatial ADTs with SQL
• LO4: Learn about the trends in query languages

• Mapping Sections to learning objectives


• LO2 - 3.2, 3.3
• LO3 - 3.4, 3.5
• LO4 - 3.6
What is SQL?
• SQL - General Information
• is a standard query language for relational databases
• It support logical data model concepts, such as relations, keys, ...
• Supported by major brands, e.g. IBM DB2, Oracle, MS SQL Server, Sybase, ...
• 3 versions: SQL1 (1986), SQL2 (1992), SQL 3 (1999)
• Can express common data intensive queries
• SQL 1 and SQL 2 are not suitable for recursive queries
• SQL and spatial data management
• ESRI Arc/Info included a custom relational DBMS named Info
• Other GIS software can interact with DBMS using SQL
• using open database connectivity (ODBC) or other protocols
• In fact, many software use SQL to manage data in back-end DBMS
• And a vast majority of SQL queries are generated by other software
• Although we will be writing SQL queries manually!
Three Components of SQL?
• Data Definition Language (DDL)
• Creation and modification of relational schema
• Schema objects include relations, indexes, etc.
• Data Manipulation Language (DML)
• Insert, delete, update rows in tables
• Query data in tables
• Data Control Language (DCL)
• Concurrency control, transactions
• Administrative tasks, e.g. set up database users, security permissions
• Focus for now
• A little bit of table creation (DDL) and population (DML)
• Primarily Querying (DML)
Creating Tables in SQL
• Table definition
• “CREATE TABLE” statement
• Specifies table name, attribute names and data types
• Create a table with no rows.
• See an example at the bottom
• Related statements
• ALTER TABLE statement modifies table schema if needed
• DROP TABLE statement removes an empty table
Populating Tables in SQL
• Adding a row to an existing table
• “INSERT INTO” statement
• Specifies table name, attribute names and values
• Example:
INSERT INTO River(Name, Origin, Length) VALUES(‘Mississippi’, ‘USA’, 6000)

• Related statements
• SELECT statement with INTO clause can insert multiple rows in a table
• Bulk load, import commands also add multiple rows
• DELETE statement removes rows
•UPDATE statement can change values within selected rows
Querying populated Tables in SQL
• SELECT statement
• The commonly used statement to query data in one or more tables
•Returns a relation (table) as result
• Has many clauses
• Can refer to many operators and functions
• Allows nested queries which can be hard to understand
• Scope of our discussion
• Learn enough SQL to appreciate spatial extensions
•Observe example queries
• Read and write simple SELECT statement
• Understand frequently used clauses, e.g. SELECT, FROM, WHERE
• Understand a few operators and function
SELECT Statement- General Information
• Clauses
•SELECT specifies desired columns
•FROM specifies relevant tables
•WHERE specifies qualifying conditions for rows
•ORDER BY specifies sorting columns for results
•GROUP BY, HAVING specifies aggregation and statistics
•Operators and functions
•arithmetic operators, e.g. +, -, …
•comparison operators, e.g. =, <, >, BETWEEN, LIKE…
•logical operators, e.g. AND, OR, NOT, EXISTS,
•set operators, e.g. UNION, IN, ALL, ANY, …
•statistical functions, e.g. SUM, COUNT, ...
• many other operators on strings, date, currency, ...
SELECT Example 1.
• Simplest Query has SELECT and FROM clauses
• Query: List all the cities and the country they belong to.

SELECT Name, Country


FROM CITY

Result 🡪
SELECT Example 2.
• Commonly 3 clauses (SELECT, FROM, WHERE) are used
•Query: List the names of the capital cities in the CITY table.
SELECT *
FROM CITY
WHERE CAPITAL=‘Y ’

Result 🡪
Query Example…Where clause
Query: List the attributes of countries in the Country relation
where the life-expectancy is less than seventy years.

SELECT Co.Name,Co.Life-Exp
FROM Country Co
WHERE Co.Life-Exp <70

Note: use of alias ‘Co’ for Table ‘Country’

Result 🡪
Multi-table Query Examples
Query: List the capital cities and populations of countries
whose GDP exceeds one trillion dollars.
Note:Tables City and Country are joined by matching “City.Country =
Country.Name”. This simulates relational operator “join” discussed in 3.2

SELECT Ci.Name,Co.Pop
FROM City Ci,Country Co
WHERE Ci.Country =Co.Name
AND Co.GDP >1000.0
AND Ci.Capital=‘Y ’
Multi-table Query Example
Query: What is the name and population of the capital city in the
country where the St. Lawrence River originates?

SELECT Ci.Name, Ci.Pop


FROM City Ci, Country Co, River R
WHERE R.Origin =Co.Name
AND Co.Name =Ci.Country
AND R.Name =‘St.Lawrence ’
AND Ci.Capital=‘Y ’

Note: Three tables are joined together pair at a time. River.Origin is matched
with Country.Name and City.Country is matched with Country.Name. The
order of join is decided by query optimizer and does not affect the result.
Exercise
• Write a query to find the names of the customers and
salesman who live in same city.
Salesman table
salesman_id | name | city | commission
-------------+------------+----------+------------
5001 | James Hoog | New York | 0.15
5002 | Nail Knite | Paris | 0.13
5005 | Pit Alex | London | 0.11
5006 | Mc Lyon | Paris | 0.14
5007 | Paul Adam | Rome | 0.13
5003 | Lauson Hen | San Jose | 0.12
Cont...
Customer table

customer_id | cust_name | city | grade |


salesman_id

-------------+---------------------+--------------+--------+-------
---------
3002 | Nick Rimando | New York | 100 | 5001
3007 | Brad Davis | New York | 200 | 5001
3005 | Graham Zusi | California | 200 | 5002
3008 | Julian Green | London | 300 | 5002
3004 | Fabian Johnson | Paris | 300 | 5006
Solution
SELECT customer.cust_name,
salesman.name, salesman.city
FROM salesman, customer
WHERE salesman.city = customer.city;

Output
cust_name name city
Nick Rimando James Hoog New York
Brad Davis James Hoog New York
Julian Green Pit Alex London
Fabian Johnson Mc Lyon Paris
Exercise
• Write a query to find the names of all the customer along
with the salesman who works with them.
Solution
SELECT customer.cust_name, salesman.name
FROM customer,salesman
WHERE salesman.salesman_id = customer.salesman_id;

Output
cust_name name
Nick Rimando James Hoog
Brad Davis James Hoog
Graham Zusi Nail Knite
Julian Green Nail Knite
Fabian Johnson Mc Lyon
Exercise
Write a SQL statement to display all those orders by the
customers not located in the same cities where their
salesmen live.
ord_no purch_amt ord_date customer_id salesman_id
---------- ---------- ---------- ----------- -----------
70001 150.5 2012-10-05 3005 5002
70009 270.65 2012-09-10 3001 5005
70002 65.26 2012-10-05 3002 5001
70004 110.5 2012-08-17 3009 5003
70007 948.5 2012-09-10 3005 5002
70005 2400.6 2012-07-27 3007 5001
70008 5760 2012-09-10 3002 5001
70010 1983.43 2012-10-10 3004 5006
70003 2480.4 2012-10-10 3009 5003
70012 250.45 2012-06-27 3008 5002
70011 75.29 2012-08-17 3003 5007
Solution
SELECT ord_no, cust_name, orders.customer_id,
orders.salesman_id
FROM salesman, customer, orders
WHERE customer.city <> salesman.city
AND orders.customer_id = customer.customer_id
AND orders.salesman_id = salesman.salesman_id;
Output
ord_no cust_name customer_id salesman_id
70004 Geoff Cameron 3009 5003
70003 Geoff Cameron 3009 5003
70011 Jozy Altidor 3003 5007
70001 Graham Zusi 3005 5002
Exercise
Write a SQL statement that finds out each order number
followed by the name of the customers who made the
order.
Solution
SELECT orders.ord_no, customer.cust_name
FROM orders, customer
WHERE orders.customer_id = customer.customer_id;
ord_no cust_name
70009 Brad Guzan
70002 Nick Rimando
70004 Geoff Cameron
70005 Brad Davis
70008 Nick Rimando
70010 Fabian Johnson
70003 Geoff Cameron
Query Examples…Aggregate Staistics
Query: What is the average population of the noncapital cities listed in the
City table?

SELECT AVG(Ci.Pop)
FROM City Ci
WHERE Ci.Capital=‘N ’

Query: For each continent, find the average GDP.

SELECT Co.Cont,Avg(Co.GDP)AS Continent-GDP


FROM Country Co
GROUP BY Co.Cont
Query Example..Having clause, Nested queries
Query: For each country in which at least two rivers originate, find the length
of the smallest river.

SELECT R.Origin, MIN(R.length) AS Min-length


FROM River
GROUP BY R.Origin
HAVING COUNT(*) > 1

Query: List the countries whose GDP is greater than that of Canada.

SELECT Co.Name
FROM Country Co
WHERE Co.GDP >ANY(SELECT Co1.GDP
FROM Country Co1
WHERE Co1.Name =‘Canada ’)
Learning Objectives
• Learning Objectives (LO)
• LO1: Understand concept of a query language
• LO2 : Learn to use standard query language (SQL)
• LO3: Learn to use spatial ADTs with SQL
• Learn about OGIS standard spatial data types and operations
• Learn to use OGIS spatial ADTs with SQL
• LO4: Learn about the trends in query languages

• Mapping Sections to learning objectives


• LO2 - 3.2, 3.3
• LO3 - 3.4, 3.5
• LO4 - 3.6
3.4 Extending SQL for Spatial Data
• Motivation
• SQL has simple atomic data-types, like integer, dates and string
• Not convenient for spatial data and queries
• Spatial data (e.g. polygons) is complex
• Spatial operation: topological, euclidean, directional, metric
• SQL 3 allows user defined data types and operations
• Spatial data types and operations can be added to SQL3
• Open Geodata Interchange Standard (OGIS)
• Half a dozen spatial data types
• Several spatial operations
• Supported by major vendors, e.g. ESRI, Intergraph, Oracle, IBM,...
OGIS Spatial Data Model
• Consists of base-class Geometry and four sub-classes:
• Point, Curve, Surface and GeometryCollection
• Figure 2.2 (pp. 27) lists the spatial data types in OGIS

• Operations fall into three categories:


• Apply to all geometry types
• SpatialReference, Envelope, Export,IsSimple, Boundary
• Predicates for Topological relationships
• Equal, Disjoint, Intersect, Touch, Cross, Within, Contains
• Spatial Data Analysis
• Distance,Buffer,Union, Intersection, ConvexHull, SymDiff
• Table 3.9 (pp. 66) details spatial operations
Spatial Queries with SQL/OGIS
• SQL/OGIS - General Information
•Both standard are being adopted by many vendors
•The choice of spatial data types and operations is similar
•Syntax differs from vendor to vendor
• Readers may need to alter SQL/OGIS queries given in text to make
them run on specific commercial products
• Using OGIS with SQL
• Spatial data types can be used in DML to type columns
• Spatial operations can be used in DML
• Scope of discussion
• Illustrate use of spatial data types with SQL
• Via a set of examples
List of Spatial Query Examples
• Simple SQL SELECT_FROM_WHERE examples
•Spatial analysis operations
•Unary operator: Area (Q5, pp.69)
•Binary operator: Distance (Q3, pp.68)
•Boolean Topological spatial operations - WHERE clause
•Touch (Q1, pp. 67)
•Cross (Q2, pp. 68)
•Using spatial analysis and topological operations
•Buffer, overlap (Q4)
•Complex SQL examples
• Aggregate SQL queries
• Nested queries
Using spatial operation in SELECT clause
Query: List the name, population, and area of each country listed in
the Country table.

SELECT C.Name,C.Pop, Area(C.Shape)AS "Area"


FROM Country C

Note: This query uses spatial operation, Area().Note the use of


spatial
operation in place of a column in SELECT clause.
Using spatial operator Distance
Query: List the GDP and the distance of a country’s capital
city to the equator for all countries.

SELECT Co.GDP, Distance(Point(0,Ci.Shape.y),Ci.Shape) AS


"Distance"
FROM Country Co,City Ci
WHERE Co.Name = Ci.Country
AND Ci.Capital =‘Y ’
Using Spatial Operation in WHERE clause
Query: Find the names of all countries which are neighbors of the United
States (USA) in the Country table.

SELECT C1.Name AS "Neighbors of USA"


FROM Country C1,Country C2
WHERE Touch(C1.Shape,C2.Shape)=1
AND C2.Name =‘USA ’

Note: Spatial operator Touch() is used in WHERE clause to join Country


table with itself. This query is an example of spatial self join operation.
Spatial Query with multiple tables
Query: For all the rivers listed in the River table, find the countries through
which they pass.

SELECT R.Name, C.Name


FROM River R, Country C
WHERE Cross(R.Shape,C.Shape)=1

Note: Spatial operation “Cross” is used to join River and Country


tables. This query represents a spatial join operation.

Exercise: Modify above query to report length of river in each country.


Hint: Q6, pp. 69
Example Spatial Query…Buffer and Overlap

Query: The St. Lawrence River can supply water to cities that are
within 300 km. List the cities that can use water from the St.
Lawrence River.

SELECT Ci.Name
FROM City Ci, River R
WHERE Overlap(Ci.Shape, Buffer(R.Shape,300))=1
AND R.Name =‘St.Lawrence ’

Note: This query uses spatial operation of Buffer, which is


illustrated in Figure 3.2 (pp. 69).
Recall List of Spatial Query Examples
• Simple SQL SELECT_FROM_WHERE examples
•Spatial analysis operations
•Unary operator: Area
•Binary operator: Distance
•Boolean Topological spatial operations - WHERE clause
•Touch
•Cross
•Using spatial analysis and topological operations
•Buffer, overlap
•Complex SQL examples
• Aggreagate SQL queries (Q9, pp. 70)
• Nested queries (Q3 pp. 68, Q10, pp. 70)
Using spatial operation in an aggregate query
Query: List all countries, ordered by number of neighboring countries.

SELECT Co.Name, Count(Co1.Name)


FROM Country Co, Country Co1
WHERE Touch(Co.Shape,Co1.Shape)
GROUP BY Co.Name
ORDER BY Count(Co1.Name)

Notes: This query can be used to differentiate querying capabilities of


simple GIS software (e.g. Arc/View) and a spatial database. It is quite
tedious to carry out this query in GIS.

Earlier version of OGIS did not provide spatial aggregate operation to


support GIS operations like reclassify.
Using Spatial Operation in Nested Queries
Query: For each river, identify the closest city.

SELECT C1.Name, R1.Name


FROM City C1, River R1
WHERE Distance (C1.Shape,R1.Shape) <= ALL ( SELECT
Distance(C2.Shape)
FROM City C2
WHERE C1.Name <> C2.Name
)

Note: Spatial operation Distance used in context of a nested query.


Exercise: It is interesting to note that SQL query expression to find smallest
distance from each river to nearest city is much simpler and does not require
nested query. Audience is encouraged to write a SQL expression for this query.
Nested Spatial Query
Query: List the countries with only one neighboring country. A country is a
neighbor of another country if their land masses share a boundary. According to
this definition, island countries, like Iceland, have no neighbors.
SELECT Co.Name
FROM Country Co
WHERE Co.Name IN (SELECT Co.Name
FROM Country Co,Country Co1
WHERE Touch(Co.Shape,Co1.Shape)
GROUP BY Co.Name
HAVING Count(*)=1)

Note: It shows a complex nested query with aggregate operations. Such queries can be
written into two expression, namely a view definition, and a query on the view. The inner
query becomes a view and outer query is runon the view. This is illustrated in the next slide.
Rewriting nested queries using Views
•Views are like tables
•Represent derived data or result of a query
•Can be used to simplify complex nested queries
•Example follows:
CREATE VIEW Neighbor AS
SELECT Co.Name, Count(Co1.Name)AS num neighbors
FROM Country Co,Country Co1
WHERE Touch(Co.Shape,Co1.Shape)
GROUP BY Co.Name

SELECT Co.Name,num neighbors


FROM Neighbor
WHERE num neighbor = ( SELECT Max(num neighbors)
FROM Neighbor )
Learning Objectives
• Learning Objectives (LO)
• LO1: Understand concept of a query language
• LO2 : Learn to use standard query language (SQL)
• LO3: Learn to use spatial ADTs with SQL
• LO4: Learn about the trends in query languages
• Facilities for user defined data types in SQL3

• Mapping Sections to learning objectives


• LO2 - 3.2, 3.3
• LO3 - 3.4, 3.5
• LO4 - 3.6
Defining Spatial Data Types in SQL3
• SQL3 User defined data type - Overview
• CREATE TYPE statements
• Defines a new data types
• Attributes and methods are defined
• Separate statements for interface and implementation
•Examples of interface in Table 3.12 (pp. 74)

• Additional effort is needed at physical data model level


Defining Spatial Data Types in SQL3
• Libraries, Data cartridge/blades
• Third party libraries implementing OGIS are available
• Almost all user use these libraries
• Few users need to define their own data types

• We will not discuss the detailed syntax of CREATE TYPE


•Interested readers are encouraged to look at section 3.6
Summary
• Queries to databases are posed in high level declarative manner
• SQL is the “lingua-franca” in the commercial database world
• Standard SQL operates on relatively simple data types
• SQL3/OGIS supports several spatial data types and operations
• Additional spatial data types and operations can be defined
• CREATE TYPE statement
Chap4: Spatial Storage and Indexing
4.1 Storage:Disk and Files
4.2 Spatial Indexing
4.3 Trends
4.4 Summary
Learning Objectives
• Learning Objectives (LO)
• LO1: Understand concept of a physical data model
• What is a physical data model?
• Why learn about physical data models?
• LO2 : Learn how to efficiently use storage devices
• LO3: Learn how to structure data files
• LO4: Learn how to use auxiliary data-structures
• LO5: Learn about technology trends in physical data model
• Focus on concepts not procedures!
• Mapping Sections to learning objectives
• LO2, LO3 - 4.1
• LO4 - 4.2
• LO5 - 4.3
Physical model in 3 level design?
• Recall 3 levels of database design
• Conceptual model: high level abstract description
• Logical model: description of a concrete realization
• Physical model: implementation using basic components
• Analogy with vehicles
• Conceptual model: mechanisms to move, turn, stop, ...
• Logical models:
• Car: accelerator pedal, steering wheel, brake pedal, …
• Bicycle: pedal forward to move, turn handle, pull brakes on handle
• Physical models :
• Car: engine, transmission, master cylinder, break lines, brake pads, …
• Bicycle: chain from pedal to wheels, gears, wire from handle to brake pads
• We now go, so to speak, “under the hood”
What is a physical data model?
• What is a physical data model of a database?
• Concepts to implement logical data model
• Using current components, e.g. computer hardware, operating systems
• In an efficient and fault-tolerant manner
• Why learn physical data model concepts?
• To be able to choose between DBMS brand names
• Some brand names do not have spatial indices!
• To be able to use DBMS facilities for performance tuning
• For example, If a query is running slow,
• one may create an index to speed it up
• For example, if loading of a large number of tuples takes for ever
• one may drop indices on the table before the inserts
• and recreate index after inserts are done!
Concepts in a physical data model
• Database concepts
• Conceptual data model - entity, (multi-valued) attributes, relationship, …
• Logical model - relations, atomic attributes, primary and foreign keys
• Physical model - secondary storage hardware, file structures, indices, …
• Examples of physical model concepts from relational DBMS
• Secondary storage hardware: Disk drives
• File structures - sorted
• Auxiliary search structure -
• search trees (hierarchical collections of one-dimensional ranges)
An interesting fact about physical data model
• Physical data model design is a trade-off between
• Efficiently support a small set of basic operations of a few data types
• Simplicity of overall system
• Each DBMS physical model
• Choose a few physical DM techniques
• Choice depends chosen sets of operations and data types
• Relational DBMS physical model
• Data types: numbers, strings, date, currency
• one-dimensional, totally ordered
• Operations:
• search on one-dimensional totally order data types
• insert, delete, ...
Physical data model for SDBMS
• Is relational DBMS physical data model suitable for spatial data?
• Relational DBMS has simple values like numbers
• Sorting, search trees are efficient for numbers
• These concepts are not natural for Spatial data (e.g. points in a plane)
• Reusing relational physical data model concepts
• Space filling curves define a total order for points
• This total order helps in using ordered files, search trees
• But may lead to computational inefficiency!
• New spatial techniques
• Spatial indices, e.g. grids, hierarchical collection of rectangles
• Provide better computational performance
Common assumptions for SDBMS physical model
• Spatial data
• Dimensionality of space is low, e.g. 2 or 3
• Data types: OGIS data types
• Approximations for extended objects (e.g. linestrings, polygons)
• Minimum Orthogonal Bounding Rectangle (MOBR or MBR)
• MBR(O) is the smallest axis-parallel rectangle enclosing an object O
• Supports filter and refine processing of queries
• Spatial operations
• OGIS operations, e.g. topological, spatial analysis
• Many topological operations are approximated by “Overlap”
• Common spatial queries - listed in next slide
Common Spatial Queries and Operations
•Physical model provides simpler operations needed by spatial queries!
•Common Queries
•Point query: Find all rectangles containing a given point.
•Range query: Find all objects within a query rectangle.
•Nearest neighbor: Find the point closest to a query point.
•Intersection query: Find all the rectangles intersecting a query rectangle.
•Distance scan: Enumerate points in increasing distance from a query point.
•Containment query: Find all the rectangles completely within a query rectangle.
•Spatial join query: Find all pairs of rectangles that overlap each other.
•Common operations across spatial queries
•find : retrieve records satisfying a condition on attribute(s)
•findnext : retrieve next record in a dataset with total order
•after the last one retrieved via previous find or findnext
•Nearest neighbor of a given object in a spatial dataset
Scope of discussion
• Learn basic concepts in physical data model of SDBMS
• Review related concepts from physical DM of relational DBMS
• Reusing relational physical data model concepts
• Space filling curves define a total order for points
• This total order helps in using ordered files, search trees
• But may lead to computational inefficiency!
• New techniques
• Spatial indices, e.g. grids, hierarchical collection of rectangles
• Provide better computational performance
Learning Objectives
• Learning Objectives (LO)
• LO1: Understand concept of a physical data model
• LO2 : Learn how to efficiently use storage devices
• Concepts in Storage Hierarchy
• Characteristics of secondary storage
• Using secondary storage efficiently
• LO3: Learn how to structure data files
• LO4: Learn how to use auxiliary data-structures
• LO5: Learn about technology trends in physical data model

• Mapping Sections to learning objectives


• LO2, LO3 - 4.1 (4.1.1)
• LO4 - 4.2
• LO5 - 4.3
Storage Hierarchy in Computers
•Computers have several components
•Central Processing Unit (CPU)
•Input, output devices, e.g. mouse, keyword, monitors, printers
•Communication mechanisms, e.g. internal bus, network card, modem
•Storage Hierarchy
•Types of storage Devices
•Main memories - fast but content is lost when power is off
•Secondary storage - slower, retains content without power
•Tertiary storage - very slow, retains content, very large capacity
•DBMS usually manage data
•on secondary storage, e.g. disks
•Use main memory to improve performance
•User tertiary storage (e.g. tapes) for backup, archival etc.
Secondary Storage Hardware: Disk Drives
• Disk concepts
• Circular platters with magnetic storage medium
• Multiple platters are mounted on a spindle
• Platters are divided into concentric tracks
• A cylinder is a collection of tracks across platters with common radium
• Tracks are divided into sectors
• A sector size may a few kilo-Bytes
•Disk drive concepts
• Disk heads to read and write
• There is disk head for each platter (recording surface)
• A head assembly moves all the heads together in radial direction
• Spindle rotates at a high speed, e.g. thousands revolution per minute
• Accessing a sector has three major steps:
• Seek: Move head assembly to relevant track
• Latency: Wait for spindle to rotate relevant sector under disk head
• Transfer: Read or write the sector
• Other steps involve communication between disk controller and CPU
Using Disk Hardware Efficiently
• Disk access cost are affected by
• Placement of data one the disk
• Fact than seek cost > latency cost > transfer (See Table 4.2, pp. 86)
• A few common observations follow

• Size of sectors
• Larger sector provide faster transfer of large data sets
• But waste storage space inside sectors for small data sets

•Placement of most frequently accessed data items


• On middle tracks rather than innermost or outermost tracks
• Reason: minimize average seek time

• Placement of items in a large data set requiring many sectors


• Choose sectors from a single cylinder
• Reason: Minimize seek cost in scanning the entire data set.
Software view of Disks: Fields, Records and File
• Views of secondary storage (e.g. disks)
• Hardware views - discussed in last few slides
• Software views - Data on disks is organized into fields, records, files
• Concepts
• Field presents a property or attribute of a relation or an entity
• Records represent a row in a relational table
•Collection of fields for attributes in relational schema of the table
•Files are collections of records
•Homogeneous collection of records may represent a relation
•Heterogeneous collections may be a union of related relations.
Comparison
Mapping Records and files to Disk
• Records Fig 4.1
•Often smaller than a sector
•Many records in a sector
• Files with many records
•Many sectors per file
• File system
•Collection of files
•Organized into directories
•Mapping tables to disk
•Figure 4.1
•City table takes 2 sectors
•Others take 1 sector each
Cont...
● There are many ways of organizing fields, records and files to
suit a particular application.
● There are pros and cons of each types of organisation.
● The Binary Large Object(BLOB) field type has played an
important role in the development of spatial database.
● Traditional databases store complex data in BLOB field, it
consider BLOB field as unformatted data with no structure.
● No query operations are available for BLOB field.
4.1.2 Buffer Management
•Motivation
• Accessing a sector on disk is much slower than accessing main memory
• Idea: Keep repeatedly accessed data in main memory buffers
•To improve the completion time of queries
•Reducing load on disk drive
•Buffer Manager software module decides
•Which sectors stay in main memory buffers?
•Which sector is moved out if we run out of memory buffer space?
•When to pre-fetch sector before access request from users?
•These decision are based on the disk access patterns of queries!
Learning Objectives
• Learning Objectives (LO)
• LO1: Understand concept of a physical data model
• LO2 : Learn how to efficiently use storage devices
• LO3: Learn how to structure data files
• What is a file structure? Why structure files?
• What are common structures for spatial datafile?
• LO4: Learn how to use auxiliary data-structures
• LO5: Learn about technology trends in physical data model

• Mapping Sections to learning objectives


• LO2, LO3 - 4.1
• LO4 - 4.2
• LO5 - 4.3
4.1.4 File Structures
• What is a file structure?
• A method of organizing records in a file
• For efficient implementation of common file operations on disks
•Example: ordered files
• Measure of efficiency
• I/O cost: Number of disk sectors retrieved from secondary storage
• CPU cost: Number of CPU instruction used
• See Table 4.1 for relative importance of cost components
•Total cost = sum of I/O cost and CPU cost
4.1.4 File Structures - selected file operations
•Common file operations
•Find: key value --> record matching key values
•Findnext --> Return next record after find if records were sorted
•Insert --> Add a new record to file without changing file-structure
•Nearest neighbor of a object in a spatial dataset
•Examples using Figure 4.1, pp. 88
•find(Name = Canada) on Country table returns record about Canada
•findnext() on Country table returns record about Cuba
•since Cuba is next value after Canada in sorted order of Name
•insert(record about Panama) into Country table
•adds a new record
•location of record in Country file depends on file-structure
•nearest neighbor Argentina in country table is Brazil
4.1.4 Common File Structures
• Common file structures
• Heap or unordered or unstructured
• Ordered
• Hashed
• Clustered
• Descriptions follow
• Basic Comparison of Common File Structures
• Heap file is efficient for inserts and used for logfiles
•But find, findnext, etc. are very slow
• Hashed files are efficient for find, insert, delete etc.
•But findext is very slow
•Orderd file oranization are very fast for findnext
•and pretty competent for find, insert, etc.
4.1.4 File Structures: Heap, Ordered
•Heap
• Records are in no particular order (Example: Figure 4.1)
•insert can simple add record to the last sector
•find, findnext, nearest neighbor scan the entire files
•Ordered
• Records are sorted by a selected field (Example Fig. 4.3 below)
• findnext can simply pick up physically next record
• find, insert, delete may use binary search, is is very efficient
• nearest neighbor processed as a range query (seepp. 95 for details)

Figure 4.3
File Structure : Hash
•Components of a Hash file structure (Fig. 4.2)
• A set of buckets (sectors)
• Hash function : key value --> bucket
• Hash directory: bucket --> sector
• Operations
•find, insert, delete are fast
•compute hash function
•lookup directory
•fetch relevant sector
•findnext, nearest neighbor are slow
•no order among records

Fig 4.2
4.1.5 Spatial File Structures: Clustering
● The goal of clustering is to reduce seek (ts) and latency (tl) time
in answering common large queries.
● Three types of clustering supported by SDBMS to provide
efficient query processing are:
● Internal clustering: to speed up access to single object by storing in
single disk page.
● Local clustering: to speed up access to several objects by storing
grouped set of spatial objects onto one page.
● Global clustering: Spatially adjacent objects stored on several
physically consecutive pages that can be accessed by a single read
request.
Clustering

• Motivation:
•Ordered files are not natural for spatial data
• Clustering records in sector by space filling curve is an alternative
•In general, clustering groups records
•accessed by common queries
•into common disk sectors
•to reduce I/O costs for selected queries
•Clustering using Space filling curves
•Z-curve
•Hilbert-curve
•Details on following 3 slides
Z-Curve
•What is a Z-curve?
• A space filling curve
• Generated from interleaving bits
Fig 4.6
•x, y coordinate
•See Fig. 4.6
•Alternative generation method
•see Fig. 4.5
•Connecting points by z-order
•see Fig. 4.4
•looks like Ns or Zs
•Implementing file operations
•similar to ordered files
Fig 4.4
Z-curve
Example of Z-values
•Figure 4.7
• Left part shows a map with spatial object A, B, C
• Right part and Left bottom part Z-values within A, B and C
•Note C gets z-values of 2 and 8, which are not close
•Exercise: Compute z-values for B.

Fig 4.7
Hilbert Curve
Fig 4.5
• A space filling curve
•Example: Fig. 4.5

•More complex to generate


•due to rotations
•See details on pp. 92-93
•Illustration on next slide!

• Implementing file operations


•similar to ordered files
Hilbert Curve
Calculating Hilbert Values (Optional Topic)

•Procedure on pp. 92

Fig 4.8
Handling Regions with Z-curve

Fig 4.9
Learning Objectives
• Learning Objectives (LO)
• LO1: Understand concept of a physical data model
• LO2 : Learn how to efficiently use storage devices
• LO3: Learn how to structure data files
• LO4: Learn how to use auxiliary data-structures
• Concept of index
• Spatial indices, e.g. Grids / Grid-file and R-tree families
• Focus on concepts not procedures!
• LO5: Learn about technology trends in physical data model

• Mapping Sections to learning objectives


• LO2, LO3 - 4.1
• LO4 - 4.2
• LO5 - 4.3
What is an index?
• Concept of an index
•auxiliary file to search a data file Fig 4.10
•Example: Fig. 4.10
•index records have
•key value
•address of relevant data sector
•see arrows in Fig. 4.10
•Index records are ordered
•find, findnext, insert are fast
•Note assumption of total order
•on values of indexed attributes
Classifying indexes
• Classification criteria Fig 4.11
• Data-file-structure
• Key data type
• others
•Secondary index
•Heap data file
•1 index record per data record
•Example Fig. 4.10
•Primary index
•Data file ordered by indexed attribute
•1 index record per data sector
•Example: Fig. 4.11

•Q? A table can have at most one


primary index. Why?
Attribute data types and Indices
• Index file structure depends on data type of indexed attribute
• Attributes with total order
•Example, numbers, points ordered by space filling curves
•B-tree is a popular index organization
• See Figure 1.12 (pp. 18) and section 1.6.4
• Spatial objects (e.g. polygons)
•Spatial organization are more efficient
•Hundreds of organizations are proposed in literature
•Two main families are Grid Files and R-trees
What is Grid File?
Grid file is a balanced multi-key file structure, uniform
access to multidimensional data.

Grid file has relatively good I/O performance for


exact-match and partial-match retrieval.

Goal to achieve two disk-access principle: one access to


get the directory entry and the other to get the actual
buckets to retrieve the record.
Ideas behind Grid Files
•Basic idea- Divide space into cells by a grid
•Example: Fig. 4.12,
•Example:latitude-longitude, ESRI Arc/SDE Fig 4.12
•Store data in each cell in distinct disk sector
•Efficient for find, insert, nearest neighbor
•But may have wastage of disk storage space
•non-uniform data distribution over space

• Refinement of basic idea into Grid Files


•1. Use non-uniform grids (Fig. 4.14)
•Linear scale store row and column boundaries
•2. Allow sharing of disk sectors across grid cells
•See Figure 4.13 on next slide
Fig 4.14
Grid Files

• Grid File component


• Linear scale - row/column boundaries
• Grid directory: cell --> disk sector address
• data sectors on disk
• Operation implementation
•Scales and grid directory in main memory
•Steps for find, nearest neighbour
•Search linear scales
• Identify selected grid directory cells
•Retrieve selected disk sectors
• Performance overview
•Efficient in terms of I/O costs
•Needs large main memory for grid directory

Fig 4.13
What is R-tree?
● A Dynamic Index structure for Spatial searching

● It is height balanced tree

● Index record in its leaf nodes( containing pointers to


object data)
4.2.2 R-Tree Family
•Basic Idea
• Use a hierarchical collection of rectangles to organize spatial data
• Generalizes B-tree to spatial data sets
• Classifying members of R-tree family
• Handling of large spatial objects
• Allow rectangles to overlap - R-tree
• Duplicate objects but keep interior node rectangles disjoint - R+tree
• Selection of rectangles for interior nodes
• greedy procedures - R-tree, R+tree
• procedure to minimize of coverage, overlap - packed R-tree
• Other criteria exist
• Scope of our discussion
• Basics of R-tree and R+tree
• Focus on concepts not procedures!
Spatial Objects with R-Tree
•Properties of R-trees
•Balanced
• Nodes are rectangle
• child’s rectangle within parent’s
• possible overlap among rectangles!
• Other properties in section 4.2.2
•Implementation of find operation
•Search root to identify relevant children Fig 4.15
•Search selected children recursively
•Ex.: find record for rectangle 5 Fig 4.16
•Root search identifies child x
•Search of x identifies children b and c
•Search of b does not find object 5
•Search of c find object 5
R+tree
• Properties of R+trees
•Balanced
• Interior nodes are rectangle
• child’s rectangle within parent’s
•disjoint rectangles
• Leaf nodes - MOBR of polygons or lines
•leaf’s rectangle overlaps with parent’s
• Data objects may be duplicated across leafs
• Other properties in section 4.2.2
Fig 4.18
•find operation - same as R-tree
•But only one child is followed down
•Ex.: find record for rectangle 5 Fig 4.17
•Root search identifies child x
•Search of x identifies children b and c
•Search either b or c to find object 5
Learning Objectives
• Learning Objectives (LO)
• LO1: Understand concept of a physical data model
• LO2 : Learn how to efficiently use storage devices
• LO3: Learn how to structure data files
• LO4: Learn how to use auxiliary data-structures
• LO5: Learn about technology trends in physical data model

• Mapping Sections to learning objectives


• LO2, LO3 - 4.1
• LO4 - 4.2
• LO5 - 4.3
4.3 Trends
• New developments in physical model
• Use of intra-object indexes
• Support for multiple Concurrent operations
• Index to support spatial join operations
•Use of intra-object indexes
•Motivation: large objects (e.g. polygon boundary of USA has 1000s of edges
•Algorithms for OGIS operations (e.g. touch, crosses)
•often need to check only a few edges of the polygon
•Relevant edges can be identified by spatial index on edges
•Example: Fig. 4.19, pp. 105, section 4.3.1
•Uniqueness
•intra-object index organizes components within a large spatial object
•traditional index organizes a collection of spatial objects
4.3.2 Trends - Concurrency support
• Why support Concurrent operations?
• SDBMS is shared among many users and applications
• Simultaneous requests from multiple users on a spatial table
• serial processing of request is not acceptable for performance
• concurrent updates and find can provide incorrect results
• Concurrency control idea for R-tree index
•R-link tree: Add links to chain nodes at each level
•Use links to ensure correct answer from find operations
•Use locks on nodes to coordinate conflicting updates
•Details in section 4.3.2 and Fig. 4.20, pp. 107
4.3.3 Trends: Join Index
• Ideas
• Spatial join is a common operation. Expensive to compute using traditional indexes
• Spatial join index pre-computes and stores id-pairs of matched rows across tables
• Example in Fig. 4.21
• Speeds up computation of spatial join
•Details in section 4.3.3

Fig 4.21
Spatial Join-index Details

Fig 4.22

Fig 4.23
Summary
• Physical DM efficiently implements logical DM on computer hardware
• Physical DM has file-structure, indexes
• Classical methods were designed for data with total ordering
• fall short in handling spatial data
• because spatial data is multi-dimensional
• Two approaches to support spatial data and queries
• Reuse classical method
• Use Space-Filling curves to impose a total order on multi-dimensional data
• Use new methods
• R-trees, Grid files
Ch. 5: Query Processing and Optimization

5.1 Evaluation of Spatial Operations


5.2 Query Optimization
5.3 Analysis of Spatial Index Structures
5.4 Distributed Spatial Database Systems
5.5 Parallel Spatial Database Systems
5.6 Summary
Learning Objectives
Learning Objectives (LO)
LO1: Understand concept of query processing and optimization (QPO)
• What is a QPO ?
• Why learn about QPO ?
LO2 : Learn about alternative algorithms to process spatial queries
LO3: Learn about query optimizer
LO4: Learn about trends
Focus on concepts not procedures!
Mapping Sections to learning objectives
LO2 - 5.1
LO3 - 5.2, 5.3
LO4 - 5.4, 5,5
Analogy of Automatic Transmission in Cars
Manual transmission : automatic :: Java : SQL
Recall Java program (Section 2.1.6, pp. 32-34)
• Algorithm to answer the query was coded in the program
• Similar to manual gear change at start and stop in Cars
In contrast, SQL queries are declarative
• Users do not specify the procedure to answer it
• DBMS needs to pick an algorithm to answer query
• Analogy: automatic transmission choosing gear (1, 2, 3, …)
Relevant SDBMS component
Query processing and optimization (QPO)
• Picks algorithms to process a SQL query
Physical data model : QPO :: engine : automatic transmission
What is Query Processing and Optimization (QPO)?
Basic idea of QPO
In SQL, queries are expressed in high level declarative form
QPO translates a SQL query to an execution plan
• over physical data model
• using operations on file-structures, indices, etc.
Ideal execution plan answers Q in as little time as possible
Constraints: QPO overheads are small
• Computation time for QPO steps << that for execution plan
Why learn about QPO?
Why learn about automatic transmission in a car?
Identify cause of lack of power in a car
• Is it the engine or the transmission ?
Solve performance problem with manual override
• Uphill, downhill driving => lower gears
Why learn about QPO in a SDBMS?
Identify performance bottleneck for a query
• Is it the physical data model or QPO ?
How to help QPO speed up processing of a query ?
• Providing hints, rewriting query, etc.
How to enhance physical data model to speed up queries?
• Add indices, change file- structures, …
Three Key Concepts in QPO
1. Building blocks
Most cars have few motions, e.g. forward, reverse
Similar most DBMS have few building blocks:
• select (point query, range query), join, sorting, ...
A SQL queries is decomposed in building blocks
2. Query processing strategies for building blocks
Cars have a few gears for forward motion: 1st, 2nd, 3rd, overdrive
DBMS keeps a few processing strategies for each building block
• e.g. a point query can be answer via an index or via scanning data-file
3. Query optimization
Automatic transmission tries to picks best gear given motion parameters
For each building block of a given query, DBMS QPO tries to choose
• “Most efficient” strategy given database parameters
• Parameter examples: Table size, available indices, …
• Ex. Index search is chosen for a point query if the index is available
QPO Challenges
Choice of building blocks
SQL Queries are based on relational algebra (RA)
Building blocks of RA are select, project, join
• Details in section 3.2 (Note symbols sigma, pi and join)
SQL3 adds new building blocks like transitive closure
• Will be discussed in chapter 6
Choice of processing strategies for building blocks
Constraints: Too many strategies=> higher complexity
Commercial DBMS have a total of 10 to 30 strategies
• 2 to 4 strategies for each building block
How to choose the “best” strategy from among the applicable ones?
May use a fixed priority scheme
May use a simple cost model based on DBMS parameters
QPO Challenges in SDBMS
Building Blocks for spatial queries
Rich set of spatial data types, operations
A consensus on “building blocks” is lacking
Current choices include spatial select, spatial join, nearest neighbor
Choice of strategies
Limited choice for some building blocks, e.g. nearest neighbor
Choosing best strategies
Cost models are more complex since
• Spatial Queries are both CPU and I/O intensive
• while traditional queries are I/O intensive
Cost models of spatial strategies are in not mature.
QPO Challenges in SDBMS - Exercise
Learning Aid
Often helpful for readers to try to solve the QPO problem
Before looking at the current solutions
Particularly when solutions are not mature
Try following exercise to get an insight into chapter 5 topics

Exercise:
Propose a few additional building blocks for spatial queries
• besides spatial selection, spatial join and nearest neighbor
• Use GIS operations (Table 1.1, pp. 3) as a guide if needed
Justify the proposal by listing spatial queries needing the component
Detail the proposal by listing a few algorithms for the building block
How would one choose between the available algorithms?
Scope of Discussion
Chapter 5 will discuss
Choice of building blocks for spatial queries
Choice of processing strategies for building blocks
How to choose the “best” strategy from among the applicable ones?
Focus on concepts not procedures
Procedures change with change in computer hardware
Concepts do not change as often
Readers are more likely to remember the concepts after the course
Learning Objectives
Learning Objectives (LO)
LO1: Understand concept of query processing and optimization (QPO)
LO2 : Learn about alternative algorithms to process spatial queries
• What are the building blocks of spatial queries?
• What are common strategies for each building block?
LO3: Learn about query optimizer
LO4: Learn about trends
Focus on concepts not procedures!
Mapping Sections to learning objectives
LO2 - 5.1
LO3 - 5.2, 5.3
LO4 - 5.4, 5,5
Building Blocks for Spatial Queries
Challenges in choosing building blocks
Rich set of data types - point, line string, polygon, …
Rich set of operators - topological, euclidean, set-based, …
Large collection of computation geometric algorithms
• for different spatial operations on different spatial data types
Desire to limit complexity of SDBMS
How to simplify choice of data types and operators?
Reusing a Geographic Information System (GIS)
• which already implements spatial data types and operations
• however may have difficulties processing large data set on disk
SDBMS reduces set of objects to be processed by a GIS
SDBMS is used as a filter
This is filter and refinement approach
The Filter-Refine Paradigm
• Processing a spatial query Q
•Filter step : find a superset S of object in answer to Q
•Using approximate of spatial data type and operator
•Refinement step : find exact answer to Q reusing a GIS to process S
•Using exact spatial data type and operation

Fig 5.1
Approximate Spatial Data types
Approximating spatial data types
Minimum orthogonal bounding rectangle (MOBR or MBR)
• approximates line string, polygon, …
• See Examples below (Bblack rectangle are MBRs for red objects)
MBRs are used by spatial indexes, e.g. R-tree
Algorithms for spatial operations MBRs are simple

Q? Which OGIS operation (Table 3.9, pp. 66) returns MBRs ?


Approximate Spatial Operations
Approximating spatial operations
SDBMS processes MBRs for refinement step
Overlap predicate used to approximate topological operations
Example: inside(A, B) replaced by
• overlap(MBR(A), MBR(B)) in filter step
• See picture below - Let A be outer polygon and B be the inner one
• inside(A, B) is true only if overlap(MBR(A), MBR(B))
• However overlap is only a filter for inside predicate needing refinement later
Filter Step Example
Query:
List objects in front of a viewer V
Equivalent overlap query
Direction region is a polygon
List objects overlapping with
• polygon( front(V))
Approximate query
List objects overlapping with
• MBR(polygon (front (V)))
Approximate Spatial Operations - 2
Exercise: Approximate following using overlap predicate
Cross(A, B), Touch(A, B), Disjoint(A, B)
See Table 3.9, pp. 66 for definition of these operations.
Exercise: Given MBRs R and S, Provide conditions to test
Overlap(A, B)
Use coordinates of left-lower and upper-right corners of MBRs
Choice of building blocks
Choice of building blocks
Varies across software vendors and products
Representative building blocks are listed here
List of building blocks
Point Query- Name a highlighted city on a digital map.
• Return one spatial object out of a table
Range Query- List all countries crossed by of the river Amazon.
• Returns several objects within a spatial region from a table
Spatial Join: List all pairs of overlapping rivers and countries.
• Return pairs from 2 tables satisfying a spatial predicate
Nearest Neighbor: Find the city closest to Mount Everest.
• Return one spatial object from a collection
Strategies for Each Building Block
Choice of strategies
Varies across software vendors and products
Representative strategies are listed here
Some strategies need special file-structures or indices
Description of strategies
Main message: there are multiple strategies for each building block!
Focus on concepts rather than procedures
Readers interested in procedural details (e.g. algorithms)
• Refer to papers in Bibliographic notes
• Note: better algorithms appear in literature every year!
Strategies for Point Queries
Recall Point Query Example
Name a highlighted city on a digital map.
Return one spatial object out of a table
List of strategies
Scan all B disk sectors of the data file
If records are ordered using space filling curve (say Z-order)
• then use binary search on the Z-order of search point
• to examine about log(B, base = 2) disk sectors
If an index is available on spatial location of data objects,
• then use find() operation on the index
• number of disk sector examined = depth of index (typically 4 to 5)
Strategies for Range Queries
Recall Range Query Example-
List all countries crossed by of the river Amazon.
Returns several objects within a spatial region from a table
List of strategies
Scan all B disk sectors of the data file
If records are ordered using space filling curve (say Z-order)
• then determine range of Z-order values satisfying range query
• Use binary search to get lowest Z-order within query answer
• Scan forward in the data file till the highest z-order satisfying query
If an index is available on spatial location of data objects,
• then use range-query operation on the index
Strategies for Spatial Joins
Recall Spatial Join Example:
List all pairs of overlapping rivers and countries.
Return pairs from 2 tables satisfying a spatial predicate
List of strategies
Nested loop:
• Test all possible pairs for spatial predicate
• All rivers are paired with all countries
Space Partitioning:
• Test pairs of objects from common spatial regions only
• Rivers in Africa are tested with countries in Africa only!
Tree Matching
• Hierarchical pairing of object groups from each table
Other, e.g. spatial-join-index based, external plane-sweep, …
Strategies for Nearest Neighbor Queries
Recall Nearest Neighbor Example
Find the city closest to Mount Everest.
Return one spatial object from city data file C
List of strategies
Two phase approach
• Fetch C’s disk sector(s) containing the location of Mt. Everest
• M = minimum distance( Mt. Everest, cities in fetched sectors)
• Test all cities within distance M of Mt. Everest (Range Query)
Single phase approach
• Recursive algorithm for R-tree
• Eliminate candidates dominated by some other candidate
Learning Objectives
Learning Objectives (LO)
LO1: Understand concept of query processing and optimization (QPO)
LO2 : Learn about alternative algorithms to process spatial queries
LO3: Learn about query optimizers (QOs)
• Steps in Query processing and optimization
• How to compare strategies for a building block?
LO4: Learn about trends
Focus on concepts not procedures!
Mapping Sections to learning objectives
LO2 - 5.1
LO3 - 5.2, 5.3
LO4 - 5.4, 5,5
Query Processing and Optimizer process

• A site-seeing trip
•Start : A SQL Query
•End: An execution plan
•Intermediate Stopovers
•query trees
•logical tree transforms
•strategy selection
• What happens after the journey?
•Execution plan is executed
•Query answer returned

Fig 5.2
Query Trees
• Nodes = building blocks of (spatial) queries
• See section 3.2 (pp.55) for symbols sigma, pi and join
• Children = inputs to a building block
• Leafs = Tables
• Example SQL query and its query tree follows:

Fig 5.3
Logical Transformation of Query Trees
• Motivation
• Transformation do not change the answer of the query
• But can reduce computational cost by
• reducing data produced by sub-queries
• reducing computation needs of parent node
• Example Transformation
• Push down select operation below join
• Example: Fig. 5.4 (compare w/ Fig 5.3, last slide)
• Reduces size of table for join operation
• Other common transformations
• Push project down
• Reorder join operations
• ...

Fig 5.4
Logical Transformation and Spatial Queries
• Traditional logical transform rules
•For relational queries with simple data types and operations
• CPU costs are much smaller and I/O costs
• Need to be reviewed for spatial queries
• complex data types, operations
• CPU cost is hgher
•Example:
• Push down spatial selection beow join
• May not decrease cost if
•area() is costlier than distance()

Fig 5.5
Execution Plans
An execution plan has 3 components
A query tree
A strategy selected for each non-leaf node
An ordering of evaluation of non-leaf nodes
Example
Strategies for Query tree in Fig. 5.5
• Use scan for Area(L.Geometry) > 20 Fig 5.5
• Use index for Fa.Name = ‘Campground’
• Use space-partitioning join for
– Distance(Fa, L) < 50
• Use on-the-fly for projection
Ordering
• As listed above
Choosing strategies for building blocks
A priority scheme
Check applicability of each strategies given file-structures and indices
Choose highest priority strategy
This procedure is fast, Used for complex queries
Rule based approach
System has a set of rules mapping situations to strategy choices
Example: Use scan for range query if result size > 10 % of data file
Cost based approach
See next slide
Choosing strategies for building blocks - 2
Cost model based approach
Single building block
• Use formulas to estimate cost of each strategy, given table size etc.
• Choose the strategy with least cost
• Example cost models for spatial operation in section 5.3
A query tree
• Least cost combination of strategy choices for non-leaf nodes
• Dynamic programming algorithm
Commercial practice
RDBMS use cost based approach for relational building blocks
But cost models for spatial strategies are not mature
Rule based approach is often used for spatial strategies
Learning Objectives
Learning Objectives (LO)
LO1: Understand concept of query processing and optimization (QPO)
LO2 : Learn about alternative algorithms to process spatial queries
LO3: Learn about query optimizer
LO4: Learn about trends
• Impact of Distributed, Web-based, Parallel Computing Environment
Focus on concepts not procedures!
Mapping Sections to learning objectives
LO2 - 5.1
LO3 - 5.2, 5.3
LO4 - 5.4, 5,5
Trends in Query Processing and Optimization
Motivation
SDBMS and GIS are invaluable to many organizations
Price of success is to get new requests from customers
• to support new computing hardware and environment
• to support new applications
New computing environments
Distributed computing (Section 5.4)
Internet and web (Section 5.4)
Parallel computers (Section 5.5)
New applications
Location based services, transportation (Chapter 6)
Data Mining (Chapter 7)
Raster data (Chapter 8)
5.4 Distributed Spatial Databases
Distributed Environments
Collection of autonomous heterogeneous computers
Connected by networks
Client-server architectures
• Server computer provides well-defined services
• Client computers use the services
New issues for SDBMS
Conceptual data model -
• Translation between heterogeneous schemas
Logical data model
• Naming and querying tables in other SDBMSs
• Keeping copies of tables (in other SDBMs) consistent with original table
Query Processing and Optimization
• Cost of data transfer over network may dominate CPU and I/O costs
• New strategies to control data transfer costs
5.4 Internet and (World-wide-)web
Internet and Web Environments
Very popular medium of information access in last few years
A distributed environment
Web servers, web clients
• Common data formats (e.g. HTML, XML)
• Common communication protocols (e.g. http)
• Naming - uniform resource locator (url), e.g. www.cs.umn.edu
New issues for SDBMS
Offer SDBMS service on web
Use Web data formats, communication protocols etc.
• Example on next slide
Evaluate and improve web for SDBMS clients and servers
5.4 Web-based Spatial Database Systems

• SDBMS on web
•MapServer case study
• SDBMS talks to a web server
• web server talks to web clients
•Commercial practice
•Several web based products
•Web data formats for spatial data
•GML
•WMS

•Fig 5.10
5.5 Parallel Spatial Databases
Parallel Environments
Computer with multiple CPUs, Disk drives (See Fig. 5.11 for examples)
All CPUs and disk available to a SDBMS
Can speed-up processing of spatial queries!

Fig 5.11
5.5 Parallel Spatial Databases - 2
New issues for DBMS
Physical Data Model
• Declustering: How to partition tables, indices across disk drives?
Query Processing and Optimization
• Query partitioning: How to divide queries among CPUs?
• Cost model of strategies on parallel computers
Exmaple: Techniques for declustering (Fig. 5.12)
Simple technique: round robin based on an order (space filling curve)
Disk
Declustering for Data Partitioning
• Exmaple
• A Simple Techniques for declustering (Fig. 5.12)
•1. Order the spatial objects using a space filling curve
•2. Allocate to disk drives in a round robin manner
• Effective for point objects, e.g. pixels in an image
• Many queries, e.g. large MBRs are parallelized well
•Ex. Consider a query to retrieve dat in bottom-left quarter of the space
•Two data points retrieved fromeach disk drive for Z-curve
A Case Study: High Performance GIS
Goal: Meet the response time constraint for real
time battlefield terrain visualization in flight

simulator.
Methodology:
Data-partitioning approach
Evaluation on parallel computers,
e.g. Cray T3D, SGI Challenge.
Significance:
A major improvement in capability of
geographic information systems for determining
the subset of terrain polygons within the view
point (Range Query) of a soldier in a flight
simulator using real geographic terrain data set.

Dividing a Map among 4


processors. Polygons within a
processor have common color
A Case Study: High Performance GIS
•(1/30) second Response time constraint on Range Query
••Parallel processing necessary since best sequential computer cannot meet requirement
•Green rectangle = a range query, Polygon colors shows processor assignment
Set of Set of
Polygons Polygons
Graphics Local Remote
Display Engine Terrain Terrain
2Hz. 25 Km X 25 Km
Database Databases
8Km X 8Km Bounding Box
Bounding Box
30 Hz. View High
Graphics Performance
GIS Component

Dividing a Map among 4 processors. Polygons


within a processor have common color
Summary
Query processing and optimization (QPO)
translates SQL Queries to execution plan
QPO process steps include
Creation of a query tree for the SQL query
Choice of strategies to process each node in query tree
Ordering the nodes for execution
Key ideas for SDBMS include
Filter-Refine paradigm to reduce complexity
New building blocks and strategies for spatial queries
Chapter 6: Graph Database
6.1 Graph
6.2 High level view of the Graph Space
6.3 Models and Graphs
6.4 Querying Graph:
6.4.1 Introduction to Cypher
What is Graph?

Graphs represent entities as nodes and the ways in which those entities relate to the
world as relationships.
Why graphs are important?
• Modeling of biological data
• Road network data
• Social network data
• Hierarchical data
• The web data
What is Graph Database?

A graph database stores data in a graph, the most generic of data


structures, capable of elegantly representing any kind of data in a
highly accessible way.

A graph is a collection of vertices representing entities and edges


representing the relationships among them.

In a property graph both nodes and relationships can have


properties.

Graph data model means that data are modelled such a graph.
What is Graph Database?
A database for storing, managing and querying highly connected
and complex data.

A graph database’s architecture makes it particularly well suited for


exploring data to find commonalities and anomalies in large data
volumes and unlocking the value in the data’s relationships.

A (property) graph database is an online database management


system with Create, Read, Update and Delete methods that expose a
(property) graph data model.
Graph Database example
Relational Database example
A High-level View of the Graph Space
Main three graph data models:
1. Property graph
2. Resource Description Framework (RDF) triples
3. Hypergraph
What is property graph?
• Each node/edge is uniquely identified
• Each node has a set of incoming and outgoing edges
• Each node/edge has a collection of properties
• Each edge has a label that defines the relationship between its
two nodes
Resource Description Framework (RDF) triples
Graph Database space
Pictorial overview of some of the graph databases on the market
today, based on their storage and processing models
High-level view of Graph Compute Engine
Periodically, an Extract, Transform, and Load (ETL) job moves
data from the system of record database into the graph compute
engine for offline querying and analysis.
Power of Graph Database
• Performance
• Flexibility
• Agility
Database comparison
Graph Vs RDBMS
A key difference between a graph database and an RDBMS is how
relationships between entities/vertexes are prioritized and managed.
While an RDBMS uses foreign keys to connect entities in a
secondary fashion, edges (the relationships) in a graph database are
of first order importance.
Relationships are explicitly embedded in a graph data model.
A graph-shaped business problem is one in which the concern is
with the relationships (edges) among entities (vertexes) than with
the entities in isolation.
Relational database lack relationship
• Lack of relationship
• Disambiguate the semantics of the relationships that connect
entities
• Outlier data
Cont...
Relational schema
Relational schema
Simple query on social network domain
Who is friend with Bob?
SELECT p1.Person
FROM Person p1 JOIN PersonFriend
ON PersonFriend.PersonID = p1.ID
JOIN Person p2
ON PersonFriend.FriendID = p2.ID
WHERE p2.Person = 'Bob'
Alice’s friends-of-friends
SELECT p1.Person AS PERSON, p2.Person AS FRIEND_OF_FRIEND
FROM PersonFriend pf1 JOIN Person p1
ON pf1.PersonID = p1.ID
JOIN PersonFriend pf2
ON pf2.PersonID = pf1.FriendID
JOIN Person p2
ON pf2.FriendID = p2.ID
WHERE p1.Person = 'Alice' AND pf2.FriendID <> p1.ID
Querying Graph: An introduction to cypher
• Cypher is an graph database query language.
• Cypher is composed of clauses.
• The simplest queries consist of a MATCH clause followed by
a RETURN clause
Cont..
Here’s an example of a Cypher query that uses these three clauses to
find the mutual friends of a user named Jim:
MATCH (a:Person)-[:KNOWS]->(b)-[:KNOWS]->(c), (a)-[:KNOWS]->(c)
WHERE a.name = 'Jim'
RETURN b, c

OR
MATCH (a:Person {name:'Jim'})-[:KNOWS]->(b)-[:KNOWS]->(c),
(a)-[:KNOWS]->(c)
RETURN b, c
Cypher clauses
• WHERE
• CREATE and CREATE
UNIQUE
• MERGE
• DELETE
• SET
• FOREACH
• UNION
• WITH
• START
Graph model for data center deployment
Query Processing in Graph Database
Summary
• A data model is a high level description of the data
• it can help in early analysis of storage cost, data quality
• There are two popular models of spatial information
• Field based and Object based
• Database are designed in 3-steps
• Conceptual, Logical and Physical
• Pictograms can simplify Conceptual data models
Chapter 7
Chapter 6: Building Graph Database Application
6.1 Data Modeling
6.2 Application Architecture
6.3 Testing
6.4 Graph Database Internals:
6.4.1 Native Graph Processing
6.4.2 Native Graph Storage
6.4.3 Advances in the domain
Data Modeling

Graph data modeling is the process in which a user describes an arbitrary domain as
a connected graph of nodes and relationships with properties and labels
Data model for the book reviews user story
AS A reader who likes a book, I WANT to know which books
other readers who like the same book have liked, SO THAT I can
find other books to read.
MATCH (:Reader {name:'Alice'})-[:LIKES]->(:Book {title:'Dune'})
<-[:LIKES]-(:Reader)-[:LIKES]->(books:Book)
RETURN books.title
Model facts as Nodes
When two or more domain entities interact for a period of time, a
fact emerges.
Eg: Employment
CREATE (:Person {name:'Ian'})-[:EMPLOYMENT]->
(employment:Job {start_date:'2011-01-05'})
-[:EMPLOYER]->(:Company {name:'Neo'}),
(employment)-[:ROLE]->(:Role {name:'engineer'})
Cont...
How the fact that William Hartnell played The Doctor in the story
The Sensorites can be represented in the graph.
CREATE (:Actor {name:'William Hartnell'})-[:PERFORMED_IN]->
(performance:Performance {year:1964})-[:PLAYED]->
(:Role {name:'The Doctor'}),
(performance)-[:FOR]->(:Story {title:'The Sensorites'})
Cont...
Ian emailed Jim, and copied in Alistair
CREATE (:Person {name:'Ian'})-[:SENT]->(e:Email {content:'...'})
-[:TO]->(:Person {name:'Jim'}),
(e)-[:CC]->(:Person {name:'Alistair'})
Cont...
How the act of Alistair reviewing a film can be represented in the
graph.
CREATE (:Person {name:'Alistair'})-[:WROTE]->
(review:Review {text:'...'})-[:OF]->(:Film {title:'...'}),
(review)-[:PUBLISHED_IN]->(:Publication {title:'...'})
Represent Complex Value Types as Nodes
Value types are things that do not have an identity, and whose
equivalence is based solely on their values.
Eg: Time.

Time can be modeled in several different ways in the graph, eg:


Timeline trees, linked lists..
Timeline trees
To find all the events that have occurred over a specific period,
timeline tree is helpful.
Linked lists
Events that have temporal relationships to the events that precede
and follow them. Can use NEXT and/or PREVIOUS relationships.
Iterative and Incremental Development
MATCH (user:User {id:{userId}})
MATCH (user)-[:DELIVERY_ADDRESS]->(address:Address)
RETURN address

MATCH (user:User {id:{userId}})


MATCH (user)-[:ADDRESS]->(address:Address)
RETURN address
Application Architecture
• Embedded versus Server
• Clustering
• Load Balancing
Embedded Neo4j
• Low latency
• Choice of APIs
• Explicit transactions
• JVM only
• GC behavior
• Database life cycle
Server Mode
• Using same embedded instance of neo4j
• REST API
• Platform independent
• Isolation from application GC behavior
• Network overhead
• Support for server extensions
Embedded versus Server
Databases run as a server through client library.
Neo4j can run in embedded as well as server mode.
Embedded Neo4j
Embedded Neo4j is ideal for hardware devices, desktop applications, own
application servers.
Advantages: Low latency, Choice of APIs, Explicit transactions.
Server mode
Benefits: REST API, Platform independence, Scaling independence, Isolation
from application GC behaviors, Network overhead, Transaction state.
Server extensions
Benefits: Complex transactions, Choice of APIs, Encapsulation, Response
formats,
Clustering
Neo4j clusters for high availability and horizontal read scaling using
master-slave replication.
•Replication
•Buffer writes using queues
•Global clusters
Load Balancing
Load balancing traffic across the cluster to help maximize
throughput and reduce latency. Neo4j rely on the load balancing
capabilities of the network infrastructure.
• Separate read traffic from write traffic
• Cache sharding
• Read your own writes
Testing
Testing is to verify that the query or application features behaves
correctly and also way of designing and documenting application
and its data model.
• Test-Driven Data Model Development
• Performance Testing
Query performance tests
Application performance tests
Testing with representative data
Graph Database Internals
Its about the implementation of graph database with native graph
storage and native processing engine for storing and querying
complex, variably-structured, densely connected data.
Native Graph Processing
Graph database have native processing capabilities if it exhibits a
property called index-free adjacency.
Index free adjacency : connected nodes physically point to each
other in the database.
Non-native: index look-up
Native Graph Storage
Native Graph Storage: Optimized and designed for storing and
managing graphs.

Non-native Graph Storage: Serialize the graph data into a


relational database, an object oriented database, or some other
general purpose data store.
Cont..

You might also like