You are on page 1of 20

Multidimensional Access Methods

for Point and Spatial Data

Ramakrishnan&Gehrke: Chap. 28.1-28.3


Various articles

Database Systems Implementation, Bongki Moon 1

Single-Dimensional Indexes
 B+ trees are fundamentally single-dimensional
indexes.
 When we create a composite search key B+ tree,
e.g., an index on <age, sal>, we effectively linearize
the 2-dimensional space since we sort entries first
by age and then by sal. 80
70
60
Consider entries:
50
<11, 80>, <12, 10> 40 B+ tree
<12, 20>, <13, 75> 30 order
20
10
Database Systems Implementation, Bongki Moon 11 12 13 2

1
Multidimensional Queries
 Multidimensional equality queries
– (e.g.) age = 50 AND sal = 70K
 Multidimensional range queries
– 50 < age < 55 AND 80K < sal < 90K
 Similarity queries (NN; content-based retrieval)
– Given a face, find the five most similar faces.
 Spatial queries (GIS, CAD)
– Find all hotels within 5 miles from the conference venue.
– Find the city with population 500,000 or more that is
nearest to Kalamazoo, MI.
– Find all cities that lie on the Nile in Egypt.
– Find all parts that touch the fuselage (in a plane design).
Database Systems Implementation, Bongki Moon 3

Desired Properties
 A multidimensional index clusters entries so as to
exploit “nearness” in multidimensional space.
 Symmetric: efficient access w.r.t. all indexed
attributes.

80
70
60
Consider entries:
50 Spatial
<11, 80>, <12, 10> 40 clusters
<12, 20>, <13, 75> 30
20
10
Database Systems Implementation, Bongki Moon 11 12 13 4

2
What’s the difficulty?

 An index based on spatial location needed.


– Must support range queries as well as point queries.
– Must support inserts and deletes gracefully.
 Need to support non-point data as well.
– (e.g.) lines, shapes, polygons, etc.
 Numerous index structures have been proposed.
– Points: Grid file, k-d-B tree, hB-tree, etc.
– Spatial: R-tree and its variants (R+-, R*-, P-trees).

Database Systems Implementation, Bongki Moon 5

K-d Tree [Bentley 1975]


 Structurally, a binary tree.
 Different (alternate) discriminator at each level.
 Not necessarily balanced.
 Efficient exact-matching, range queries.
 But, main memory data structure.

Database Systems Implementation, Bongki Moon 6

3
Point Quad-tree [Finkel&Bentley 1974]
 Essentially, multidimensional (binary) search tree.
– Fanout is 2d for a d-dimensional space.
 Like k-d trees, decomposes the space by hyperplanes.
 Searching is similar to binary search tree.
– At each level, all d key values are compared.
 To insert a new data point, first perform a search
– then, insert the new data as a new leaf node,
– its corresponding partition is divided into 2d-way
subspaces.
 Not necessarily balanced.

Database Systems Implementation, Bongki Moon 7

Point quad-tree: example

Database Systems Implementation, Bongki Moon 8

4
Grid Files [Nievergelt 1984]
 Dynamic version of multi-attribute hashing.
 Superimpose a d-dimensional orthogonal grid.
 The subspace represented by each grid cell
(bucket) may be of different shape and size.
 Structure
– Scales (for x and y dimensions): stay in memory.
– Directory: mapping grid buckets to data blocks.
– Each bucket corresponds to a disk block. But adjacent
buckets can share the same disk block.

Database Systems Implementation, Bongki Moon 9

Grid Files (contd)


 For any exact-match query, only 2 disk accesses.
– One for directory, and the other for a data block.
 Symmetric structure w.r.t. the index attributes
allows efficient range queries w.r.t. all attributes.
 Adapts to non-uniform data distributions.
– More cuts in scales in the dense data regions.
– Typically cuts occur on pre-defined points in scales.
 If an insertion causes overflow in a data block,
– If the block is associated with several buckets, split.
– O/W, introduce a new split on a scale; Very expensive.
 How about deletion?
Database Systems Implementation, Bongki Moon 10

5
Grid File: example

Database Systems Implementation, Bongki Moon 11

Grid File: disadvantages

 Split and Merge are quite complicated.


 If attributes are correlated, large directory and
many empty buckets; A variant: Rotated grid file.
 The most serious problem with grid files is the
size of directory.
– Directory grows faster than linear even for uniformly
distributed data set: O(n1+(d-1)/(dB+1)) [Regnier 1984].
– The Curse of Dimensionality!!

Database Systems Implementation, Bongki Moon 12

6
k-d-B tree [Robinson 1981]
 Combines (adaptive) k-d tree and B-tree.
 Dynamic, multi-way, balanced, fixed-size page.
 Unlike B-tree, 50% utilization is not guaranteed.
 Node structures
– Internal node: a collection of <d-region, node pointer>.
– Leaf node: <d-point, data pointer>.
 Space decomposition
– Space decomposed in a manner similar to k-d tree.
– Subspaces at the same level are mutually disjoint (no
overlap).
– Union of subspaces constitutes the complete space of their
parent.
Database Systems Implementation, Bongki Moon 13

k-d-B tree: example

Database Systems Implementation, Bongki Moon 14

7
k-d-B tree: Insertion

 If a leaf node overflows, it is split; about half the


entries are shifted to the new leaf node.
– Various heuristics for optimal split.
 Like B-tree, the split may be propagated up the tree.
– However, unlike B-tree, split of an internal node may
affect the space decomposition of its child nodes.
– The hyperplane used for the new split will split all the
descendant nodes as well. Cascading splits!!
– So, impossible to guarantee minimum storage utilization.

Database Systems Implementation, Bongki Moon 15

hB-tree [Lomet&Salzberg 1989]

 A variant of the k-d-B tree.


 Unlike k-d-B tree, split may be done by more than
one hyperplane, allowing L-shaped subspaces.
 Structure
– Each node is a k-d tree, which represents a L-shaped or brick
with holes (holey brick).
– The leaves of the k-d tree point to the lower-level nodes.
– More than one pointer can reference the same node.
 Split guarantees 1:2 data distribution in the worst
case.
 Splits are localized; avoid cascading splits.
Database Systems Implementation, Bongki Moon 16

8
hB-tree: example
 The node u represents an L-shaped subspace
A+B+E+G. G in turn represents an L-shaped.

w u

Database Systems Implementation, Bongki Moon 17

Summary: Point Data


 Data objects are points in a multidimensional space.
 Popular queries are
– Exact-match (point) query,
– Range (including partial-match) query,
– Nearest-neighbor query.
 K-d trees and quad-trees are main memory indexes.
 Grid file, k-d-B tree, hB-tree are for point data.
– Grid file: superlinear directory growth.
– K-d-B tree: cascading splits.
– hB-tree: 33% guaranteed utilization, duplicate references.

Database Systems Implementation, Bongki Moon 18

9
The R-Tree [Guttman 1984]

 Like B-tree, R-tree remains balanced on inserts


and deletes, and guarantees 50% storage
utilization.
 Each key is a minimum bounding rectangle
(MBR). Root of
 Example in 2-D: R Tree

Y Leaf
X level
Database Systems Implementation, Bongki Moon 19

Example: 2-D View


Leaf entry
Index entry
R1 R4
R11 Spatial object
R3 R5 R13
R9 approximated by
R8 bounding box R8
R14
R10

R12
R7
R18
R6 R17
R16
R19

R15 R2

Database Systems Implementation, Bongki Moon 20

10
Example: Tree Structure
R1 R2

R3 R4 R5 R6 R7

R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19

Database Systems Implementation, Bongki Moon 21

R-Tree Properties
 Internal node: a set of pairs <d-MBR, node pointer>
– An MBR covers all the MBRs in subtree.
 Leaf node: a set of pairs <d-MBR, data pointer>
– MBR is the tightest bounding box for a data object.
 Nodes can be kept 50% full (except root).
– Root node has at least two entries.
– Each node except root contains between m and M entries.
The 50% utilization can be achieved by properly setting
m.
 MBRs (even at the same level) are allowed to overlap.
Database Systems Implementation, Bongki Moon 22

11
Intersection (or Range) Query

Start at root.
1. If current node is non-leaf, for each
entry <E, ptr>, if box E overlaps Q,
search subtree identified by ptr.
2. If current node is leaf, for each entry
<E, rid>, if E overlaps Q, rid identifies
an object that might overlap Q.

Note: May have to search several subtrees at each node!


(In contrast, a B-tree equality search goes to just one leaf.)

Database Systems Implementation, Bongki Moon 23

Inserting an Entry <MBR, ptr>

 Algorithm ChooseLeaf
– Start at root and go down to best-fit leaf node L.
– At each level, choose one among nodes whose MBR needs least
enlargement to cover the MBR of the new object.
– Tie-breaking by going to a child node with smallest area.
 Algorithm SplitNode
– If best-fit leaf L is already full, split L into L1 and L2.
 Algorithm AdjustTree
– Adjust the MBR of L (or L1) in its parent while ascending to the root.
– If L is split, add an entry for L2 to its parent. (This could cause the
parent node to recursively split.)

Database Systems Implementation, Bongki Moon 24

12
Splitting a Node During Insertion
 The entries in node L plus the newly inserted entry must
be distributed between L1 and L2.
 Goal is to reduce likelihood of both L1 and L2 being
searched on subsequent queries.
 Two alternative goals to avoid multiple paths from the
root and to prune the search space:
• Minimize coverage (area)
• Minimize overlap

GOOD SPLIT!
BAD!
Database Systems Implementation, Bongki Moon 25

Guttman’s Split Algorithm

 Exhaustive : O(2M) to try all possible groupings.


 Quadratic : O(M2*d)
– First, find a pair (R,S) that maximizes
area(MBR(R,S)) - area(MBR(R) ∪ MBR(S)).
– For the rest of entries,
1) Choose an entry E such that |δ δR – δS| is maximal, where δx is the
area increase required in GroupR or GroupS to include the entry E.
2) Add E to a group so that coverage will be enlarged by the least
amount. Tie-breaking by adding to a group with smaller area,
then with fewer entries.
3) Repeat this until all the entries are assigned to groups.

Database Systems Implementation, Bongki Moon 26

13
Guttman’s Split Algorithm

 Linear : O(M*d)
– First, find a pair (R,S) for each dimension that maximizes
disti(R, S). Then, pick one among the d pairs with the
maximum normalized distance.
 disti(R,S) is the distance between R and S along i-axis.
 Normalized distance is disti(R,S) / max_distancei.
– Then, add entries to groups so that coverage will be
enlarged by the least amount.

Database Systems Implementation, Bongki Moon 27

Deleting an Entry

 Deletion consists of searching for the entry to be


deleted, removing it.
 If the node becomes underflow (i.e., contains less
then the minimum entries), re-insert the
remaining entries.

Database Systems Implementation, Bongki Moon 28

14
R*-tree [Beckmann et al. 1990]
 Major insight : node splitting is critical for performance.
 Basically, a set of optimizing techniques for R-tree.
 Algorithm ChooseSubtree
– If the current node is a leaf, done.
– If the current node is a parent of leaf nodes, use minimum overlap
enlargement for selecting a leaf node, with tie-breaking by least area
enlargement then smallest area. Complexity is O(M2). An O(MlogM)
approximate algorithm was suggested.
– If the current node is a parent of internal nodes, use minimum area
enlargement for selecting an internal node, with tie-breaking by
smallest area. [Same as Guttman’s ChooseLeaf algorithm]
 Split algorithm that considers area, margin and overlap.
– Based on plane-sweeping paradigm; O(d*MlogM).

Database Systems Implementation, Bongki Moon 29

R*-tree : Deferred Splitting

 Algorithm Reinsert: a new idea Deferred Splitting


– Dynamic reorganization by forced-reinsert; at the first
time overflow at any level, reinsert p out of M+1
entries.
– The p entries whose centers are farthest from the
center of the enclosing MBR are chosen for reinsertion.
– Significant performance improvement for point data
sets.

Database Systems Implementation, Bongki Moon 30

15
R*-tree : Node Splitting
 Node splitting based on Plane-Sweeping:
(1) Sort MBR’s by xlow coordinates (for each dimension).
(2) Generate (M-2m+2) 2-way partitions:
k=1: m M-m+1
k=2: m+1 M-m
…..
k=M-2m+2: M-m+1 m
(3) For each dimension i, compute Si = ∑margink, where margink is the
sum of perimeters of two MBR’s (by each 2-way partitioning).
(4) Choose a dimension j, such that Sj is minimal.
(5) For the dimension j, select one out of (M-2m+2) 2-way partitions
such that overlap is minimized. Tie-break by the minimum area.

Database Systems Implementation, Bongki Moon 31

R*-tree : Performance Evaluation

Database Systems Implementation, Bongki Moon 32

16
Summary on R-Trees
 Overall, works quite well for low dimensional (d < 10)
data sets; widely used.
 Other variants:
– P-tree [Jagadish 1990]: use a convex polygon for MBR.
– SR-tree [Katayama 1997]: use a spherical bounding region.
– X-tree [Berchtold 1996]: proposed for high dimensional data sets.
– Parallel R-tree [Kamel&Faloutsos 1992]: for multi-disk systems.
– Hilbert R-tree [Kamel&Faloutsos 1994]: use the Hilbert space-
filling curve for deferred splitting.
– Packed R-tree [Rous. 1985; Kamel 1993]: to bulk-load static data.
 Still, one of the hottest research tools!!!

Database Systems Implementation, Bongki Moon 33

Space-filling Curves: background

 Historical background:
– [Peano 1890]: showed there exists a continuous and
surjective mapping, f: [0,1] ⇒ [0,2]2.
– [Hilbert 1891]: demonstrated its existence by a
geometric construction.

Database Systems Implementation, Bongki Moon 34

17
Space-filling Curves for
Multidimensional Indexing
 Main difficulty in multidimensional indexing:
– No total order that preserves spatial proximity.
– If total order found, then one-dimensional access methods can be
used.
 UB-Tree (Universal B-Tree): z-ordering with a variant of B+-Tree
 Linear Mapping by a Space-Filling Curve
– First, partition the m-D space with a grid.
– Map the coordinates of each grid cell into a unique number by an
SFC.
– A spatial object is represented by a list of grid cells intersected.
– The way of mapping determines how closely adjacent cells (or
data in the cells) are stored in an index, or a file on disk. [data
clustering]

Database Systems Implementation, Bongki Moon 35

Z-Curve by Bit-Interleaving

 Popular curves are:


– Z-curve (aka, Morton curve), Hilbert, Gray-coded.
 Z-curve (aka Morton curve)
– Easy to implement, adopted by commercial DBMS.
– A single common prefix can be used to represent a set
of grid cells. [If z1 is the prefix of z2, then the region of z1
encloses that of z2.]
– Clustering not as good as Hilbert curve.

Database Systems Implementation, Bongki Moon 36

18
Z-curve for spatial objects: example

 Nine regions
approximate the
polygon:
– {011100,011110,110100,01
001*,0110**,1100**,000111
,001***,10010*}
 01001*=010010+010011
 0110**=011000+011001+
011010+011011

Database Systems Implementation, Bongki Moon 37

Data Clustering by SFC

 SFC preserves data locality while mapping.


 Cluster is a set of consecutive grid points.
 A 2x2 range query below retrieves 4 grid points in
two clusters, which may amount to two disk seeks.

Database Systems Implementation, Bongki Moon 38

19
Clustering Properties of SFC

 Comparative studies show SFCs have different clustering


properties.
 [Jagadish 1990] For a 2x2 range query, the avg number of clusters is
2.65 (z-curve), 2.5 (Gray-coded), 2.0 (Hilbert curve).
 [Rong 1991] For a 2-d rectangular query, the avg number of clusters
by z-curve is (perimeter + 2*one-side)/3.
 [Moon 1996] For a d-dimensional query region of any rectilinear
polyhedron, the avg number of clusters by Hilbert curve is Sq/2d,
where Sq is the hyper-surface area of the query.
 For a 2-d rectangular query, the avg number of clusters by
Hilbert curve is perimeter/4.

Database Systems Implementation, Bongki Moon 39

20

You might also like