You are on page 1of 220

2-4 Trees

COL 106
Shweta Agrawal, Amit Kumar, Dr.
Ilyas Cicekli, Naveen Garg
Multi-Way Trees

" A binary search tree:


3 One value in each node
3 At most 2 children

" An M-way search tree:


3 Between 1 to (M-1) values in each node
3 At most M children per node
M-way Search Tree Details

Each internal node of an M-way search has:


3 Between 1 and M children
3 Up to M-1 keys k1 , k2 , ... , kM-1

k1 ... ki-1 ki ... kM-1

Keys are ordered such that:


k1 < k2 < ... < kM-1
Multi-way Searching

Searching 22
" Similar to binary searching
for s = 8
Searching
3 If search key s<k1 search the 5 10
for s = 12
25
leftmost child
3 If s>kd-1 , search the rightmost
child 33 44 6 8 14 23 24 27

" Multiway search tree ?


11 13 17 18 19 20 21
3 Find two keys ki-1 and ki between
which s falls, and search the child Not found!
vi.
" What would an in-order
traversal look like?
February 16, 2023 4
(2,4) Trees

" Properties:
12
3 Each node has at
most 4 children
3 All external nodes
5 10 15
have same depth
3 Height h of (2,4)
tree is O(log n).
3 4 6 8 11 13 14 17
" How is the last
fact useful in
searching?

February 16, 2023 5


Insertion
" No problem if the node has
21 23 40 29 7 empty space

13 22 32

3 8 10 18 25 35

1 2 4 5 6 9 11 12 14 15 20 24 26 28 30 33 37 39
Insertion(2)
" Nodes get split if there is
29 7 insufficient space.

13 22 32

3 8 10 18 25 35

1 2 4 5 6 9 11 12 14 15 20 21 23 24 26 28 30 33 37 39 40
Insertion(3)
7
" One key is promoted to
parent and inserted in there

13 22 32

3 8 10 18 25 35

1 2 4 5 6 9 11 12 14 15 20 21 23 24 33 37 39 40
26 28 29 30
Insertion(4)
" If parent node does not have sufficient space then it
is split.
" In this manner splits can cascade.

13 22 32

3 8 10 18 25 28 35

1 2 9 11 12 14 15 20 21 23 24 33 37 39 40
4 5 6 7 26 29 30
Insertion(5)
" Eventually we may have to create a new root.
" This increases the height of the tree

5 13 22 32

3 8 10
18 25 28 35

1 2 9 11 12 14 15 20 21 23 24 33 37 39 40
4 6 7 26 29 30
Time for Search and Insertion
" A search visits O(log N) nodes
" An insertion requires O(log N) node splits
" Each node split takes constant time
" Hence, operations Search and Insert each take time O(log
N)
Deletion
" Delete 21.
" No problem if key to be deleted is in a leaf with at least 2 keys
13

5 22

3 8 10
18 25 28

1 2 9 11 12 14 15 20 21 23 24
4 6 7 26 29 30
Deletion(2)
" If key to be deleted is in an internal node then we
swap it with its predecessor (which is in a leaf) and
then delete it.
" Delete 25 13

5 22

3 8 10
18 25 28

1 2 9 11 12 14 15 20 23 24
4 6 7 26 29 30
Deletion(3)
" If after deleting a key a node becomes empty then
we borrow a key from its sibling.
" Delete 20
13

5 22

3 8 10
18 24 28

1 2 9 11 12 14 15 20 23
4 6 7 26 29 30
Deletion(4)
" If sibling has only one key then we merge with it.
" The key in the parent node separating these two siblings
moves down into the merged node.
" Delete 23 13

5 22

3 8 10
15 24 28

1 2 9 11 12 14 18 23
4 6 7 26 29 30
Delete(5)
" Moving a key down from the parent corresponds to
deletion in the parent node.
" The procedure is the same as for a leaf node.
" Can lead to a cascade . 13

" Delete 18 5 22

3 8 10
15 28

1 2 9 11 12 14 18
4 6 7 29 30
24 26
(2,4) Conclusion
" The height of a (2,4) tree is O(log n).
" Split, transfer, and merge each take O(1).
" Search, insertion and deletion each take
O(log n) .
" Why are we doing this?
3 (2,4) trees are fun! Why else would we do it?
3 Well, there9s another reason, too.
3 They can be extended to what are called B-trees.
(a,b) Trees
" A multiway search tree.
" Each node has at least a
and at most b children. 12

" Root can have less than a


children but it has at least 2 5 10 15
children.
" All leaf nodes are at the 3 4 6 8 11 13 14 17
same level.
" Height h of (a,b) tree is at
least logb n and at most
loga n.
Insertion
" No problem if the node has
21 23 29 7 empty space

13 22

3 8 18 25

1 2 4 5 9 14 15 20 24 26 28
Insertion(2)
" Nodes get split if there is insufficient
space.
29 7
" The median key is promoted to the
parent node and inserted there

13 22

3 8 18 25

1 2 4 5 9 14 15 20 23 24 26 28
Insertion(3)

" A node is split when it has exactly b keys.


" One of these is promoted to the parent and the
remaining are split between two nodes.
" Thus one node gets and the other
keys.

" This implies that a-1 >=


Deletion
" If after deleting a key a node becomes empty then
we borrow a key from its sibling.
" Delete 20
13

5 22

3 8 10
18 24 28

1 2 4 6 7 9 11 12 14 15 20 23 26 29 30
Deletion(2)
" If sibling has only one key then we merge with it.
" The key in the parent node separating these two
siblings moves down into the merged node.
" Delete 23
13

5 22

3 8 10
18 24 28

1 2 4 6 7 9 11 12 14 15 20 23 26 29 30
Deletion(3)
" In an (a,b) tree we will merge a node with its
sibling if the node has a-2 keys and its sibling
has a-1 keys.
" Thus the merged node has 2(a-1) keys.
" This implies that 2(a-1) <= b-1 which is
equivalent to a-1 <= .
" Earlier too we argued that a-1 <=
" This implies b >= 2a-1
" For a=2, b >= 3
Conclusion
" The height of a (a,b) tree is O(log n).
" b >= 2a-1.
" For insertion and deletion we take time
proportional to the height.
Disk Based Data Structures

" So far search trees were limited to main


memory structures
3 Assumption: the dataset organized in a search tree
fits in main memory (including the tree overhead)
" Counter-example: transaction data of a bank >
1 GB per day
3 use secondary storage media (punch cards, hard
disks, magnetic tapes, etc.)
" Consequence: make a search tree structure
secondary-storage-enabled

February 16, 2023 26


Hard Disks

" Large amounts of


storage, but slow
access!
" Identifying a page takes
a long time (seek time
plus rotational delay 3 5-
10ms), reading it is fast
3 pays off to read or write
data in pages (or blocks) of
2-16 Kb in size.

February 16, 2023 27


Algorithm analysis
" The running time of disk-based algorithms is
measured in terms of
3 computing time (CPU)
3 number of disk accesses
" sequential reads
" random reads

" Regular main-memory algorithms that work one data


element at a time can not be <ported= to secondary
storage in a straight-forward way
Principles
" Pointers in data structures are no longer
addresses in main memory but locations in
files
" If x is a pointer to an object
3 if x is in main memory key[x] refers to it
3 otherwise DiskRead(x) reads the object from
disk into main memory (DiskWrite(x) 3 writes it
back to disk)
Principles (2)
" A typical working pattern
01 &
02 x ü a pointer to some object
03 DiskRead(x)
04 operations that access and/or modify x
05 DiskWrite(x) //omitted if nothing changed
06 other operations, only access no modify
07 &

" Operations:
3 DiskRead(x:pointer_to_a_node)
3 DiskWrite(x:pointer_to_a_node)
3 AllocateNode():pointer_to_a_node
Binary-trees vs. B-trees
" Size of B-tree nodes is determined by the page size. One
page 3 one node.
" A B-tree of height 2 may contain > 1 billion keys!
" Heights of Binary-tree and B-tree are logarithmic
3 B-tree: logarithm of base, e.g., 1000
3 Binary-tree: logarithm of base 2
1 node
1000 1000 keys
1001
1001 nodes,
1000 1000 & 1000
1,001,000 keys
1001 1001 1001
1,002,001 nodes,
1000 1000 & 1000
1,002,001,000 keys
B-tree Definitions
" Node x has fields
3 n[x]: the number of keys of that the node
3 key1[x] ó & ó keyn[x][x]: the keys in ascending order
3 leaf[x]: true if leaf node, false if internal node
3 if internal node, then c1[x], &, cn[x]+1[x]: pointers to children
" Keys separate the ranges of keys in the sub-trees. If ki
is an arbitrary key in the subtree ci[x] then kió keyi[x]
ó ki+1
B-tree Definitions (2)
" Every leaf has the same depth
" In a B-tree of a degree t all nodes except the
root node have between t and 2t children (i.e.,
between t31 and 2t31 keys).
" The root node has between 0 and 2t children
(i.e., between 0 and 2t31 keys)
Height of a B-tree
" B-tree T of height h, containing n ó 1 keys and minimum
degree t ó 2, the following restriction on the height holds:
n û1
h ó log t depth
#of
2 nodes
1
0 1

t-1 t-1 1 2
t t

t-1 t-1 & t-1 t-1 t-1 & t-1 2 2t


h
n ó 1 û (t ý 1)õ 2t i ý1 ý 2t h ý 1
i ý1
B-tree Operations
" An implementation needs to suport the
following B-tree operations
3 Searching (simple)
3 Creating an empty tree (trivial)
3 Insertion (complex)
3 Deletion (complex)
Searching
" Straightforward generalization of a binary tree
search
BTreeSearch(x,k)
01 i ü 1
02 while i ó n[x] and k > keyi[x]
03 i ü i+1
04 if i ó n[x] and k = keyi[x] then
05 return(x,i)
06 if leaf[x] then
08 return NIL
09 else DiskRead(ci[x])
10 return BTtreeSearch(ci[x],k)
Creating an Empty Tree
" Empty B-tree = create a root & write it to disk!

BTreeCreate(T)
01 x ü AllocateNode();
02 leaf[x] ü TRUE;
03 n[x] ü 0;
04 DiskWrite(x);
05 root[T] ü x
Splitting Nodes

" Nodes fill up and reach their maximum


capacity 2t 3 1
" Before we can insert a new key, we have to
<make room,= i.e., split nodes
Splitting Nodes (2)
" Result: one key of x moves up to parent + 2
nodes with t-1 keys

x x
... N W ... ... N S W ...
y = ci[x]
y = ci[x] z = ci+1[x]
P Q R S T V W
P Q R T V W

T1 ... T8
Splitting Nodes (2)
BTreeSplitChild(x,i,y)
z ü AllocateNode()
x: parent node
leaf[z] ü leaf[y]
n[z] ü t-1
y: node to be split and child of x
for j ü 1 to t-1 i: index in x
keyj[z] ü keyj+t[y] z: new node
if not leaf[y] then
for j ü 1 to t
cj[z] ü cj+t[y]
n[y] ü t-1 x
for j ü n[x]+1 downto i+1 ... N W ...
cj+1[x] ü cj[x]
ci+1[x] ü z y = ci[x]
for j ü n[x] downto i
keyj+1[x] ü keyj[x] P Q R S T V W
keyi[x] ü keyt[y]
n[x] ü n[x]+1
DiskWrite(y)
DiskWrite(z) T1 ... T8
DiskWrite(x)
Split: Running Time
" A local operation that does not traverse the
tree
" Q(t) CPU-time, since two loops run t times
" 3 I/Os
Inserting Keys
" Done recursively, by starting from the root and
recursively traversing down the tree to the
leaf level
" Before descending to a lower level in the tree,
make sure that the node contains < 2t 3 1
keys:
3 so that if we split a node in a lower level we will
have space to include a new key
Inserting Keys (2)
" Special case: root is full (BtreeInsert)

BTreeInsert(T)
r ü root[T]
if n[r] = 2t 3 1 then
s ü AllocateNode()
root[T] ü s
leaf[s] ü FALSE
n[s] ü 0
c1[s] ü r
BTreeSplitChild(s,1,r)
BTreeInsertNonFull(s,k)
else BTreeInsertNonFull(r,k)
Splitting the Root
" Splitting the root requires the creation of a
new root
root[T]
root[T] s
r
H
A D F H L N P
r
A D F L N P
T1 ... T8

" The tree grows at the top instead of the


bottom
Inserting Keys
" BtreeNonFull tries to insert a key k into a
node x, which is assumed to be non-full
when the procedure is called
" BTreeInsert and the recursion in
BTreeInsertNonFull guarantee that this
assumption is true!
Inserting Keys: Pseudo Code
BTreeInsertNonFull(x,k)
01 i ü n[x]
02 if leaf[x] then
03 while i ó 1 and k < keyi[x]
04 keyi+1[x] ü keyi[x] leaf insertion
05 i ü i - 1
06 keyi+1[x] ü k
07 n[x] ü n[x] + 1
08 DiskWrite(x)
09 else while i ó 1 and k < keyi[x]
10 i ü i - 1 internal node:
11 i ü i + 1 traversing tree
12 DiskRead ci[x]
13 if n[ci[x]] = 2t 3 1 then
14 BTreeSplitChild(x,i,ci[x])
15 if k > keyi[x] then
16 i ü i + 1
17 BTreeInsertNonFull(c [x],k)
Insertion: Example
initial tree (t = 3)
G M P X

A C D E J K N O R S T U V Y Z
B inserted
G M P X

A B C D E J K N O R S T U V Y Z

Q inserted
G M P T X

A B C D E J K N O Q R S U V Y Z
Insertion: Example (2)
L inserted P

G M T X

A B C D E J K L N O Q R S U V Y Z

F inserted P

C G M T X

A B D E F J K L N O Q R S U V Y Z
Insertion: Running Time
" Disk I/O: O(h), since only O(1) disk accesses
are performed during recursive calls of
BTreeInsertNonFull
" CPU: O(th) = O(t logtn)
" At any given time there are O(1) number of
disk pages in main memory
Deleting Keys
" Done recursively, by starting from the root and
recursively traversing down the tree to the leaf level
" Before descending to a lower level in the tree, make
sure that the node contains ó t keys (cf. insertion < 2t
3 1 keys)
" BtreeDelete distinguishes three different
stages/scenarios for deletion
3 Case 1: key k found in leaf node
3 Case 2: key k found in internal node
3 Case 3: key k suspected in lower level node
Deleting Keys (2)
initial tree P

C G M T X

A B D E F J K L N O Q R S U V Y Z

F deleted: P
case 1
C G M T X

A B D E J K L N O Q R S U V Y Z
" Case 1: Ifxthe key k is in node x, and x is a leaf, delete
k from x
Deleting Keys (3)
" Case 2: If the key k is in node x, and x is not a leaf,
delete k from x
3 a) If the child y that precedes k in node x has at least t
keys, then find the predecessor k9 of k in the sub-tree
rooted at y. Recursively delete k9, and replace k with k9 in x.
3 b) Symmetrically for successor node z

M deleted: P
case 2a
C G L x T X

A B D E J K N O Q R S U V Y Z
y
Deleting Keys (4)
" If both y and z have only t 31 keys, merge k with the
contents of z into y, so that x loses both k and the
pointers to z, and y now contains 2t 3 1 keys. Free z
and recursively delete k from y.

G deleted: P
case 2c
C L x-k T X

A B D E J K N O Q R S U V Y Z

y = y+k + z - k
Deleting Keys - Distribution
" Descending down the tree: if k not found in current
node x, find the sub-tree ci[x] that has to contain k.
" If ci[x] has only t 3 1 keys take action to ensure that
we descent to a node of size at least t.
" We can encounter two cases.
3 If ci[x] has only t-1 keys, but a sibling with at least t keys,
give ci[x] an extra key by moving a key from x to ci[x],
moving a key from ci[x]9s immediate left and right sibling up
into x, and moving the appropriate child from the sibling
into ci[x] - distribution
Deleting Keys 3 Distribution(2)
x ... k ... x ... k9 ...

ci[x] ... k9 ci[x] ... k

A B A B
C L P T X
delete B
ci[x] A B E J K N O Q R S U V Y Z
sibling

B deleted: E L P T X

A C J K N O Q R S U V Y Z
Deleting Keys - Merging
" If ci[x] and both of ci[x]9s siblings have t 3 1
keys, merge ci with one sibling, which involves
moving a key from x down into the new
merged node to become the median key for
that node

x ... l9 k m9... x ... l9 m9 ...

ci[x] ... l m& ...l k m ...

A B A B
Deleting Keys 3 Merging (2)
P

delete D ci[x] C L sibling T X

A B D E J K N O Q R S U V Y Z

D deleted:
C L P T X

A B E J K N O Q R S U V Y Z

tree shrinks in height


Deletion: Running Time
" Most of the keys are in the leaf, thus deletion most
often occurs there!
" In this case deletion happens in one downward pass
to the leaf level of the tree
" Deletion from an internal node might require
<backing up= (case 2)
" Disk I/O: O(h), since only O(1) disk operations are
produced during recursive calls
" CPU: O(th) = O(t logtn)
Two-pass vs One pass Operations
" Two pass simpler to implement
" One pass saves time in traversing the tree
from root to leaf twice, but may cause more
splits/merges than one pass.
B- Trees

COL 106
Shweta Agrawal, Amit Kumar

Slide Credit : Yael Moses, IDC Herzliya

Animated demo: http://ats.oka.nu/b-tree/b-tree.html


https://www.youtube.com/watch?v=coRJrcIYbF4
Motivation
" Large differences between time access to disk,
cash memory and core memory
" Minimize expensive access
(e.g., disk access)
" B-tree: Dynamic sets that is optimized for
disks
B-Trees
A B-tree is an M-way search tree with two properties :
1. It is perfectly balanced: every leaf node is at the same
depth
2. Every internal node other than the root, is at least half-
full, i.e. M/2-1 f #keys f M-1
3. Every internal node with k keys has k+1 non-null
children

For simplicity we consider M even and we use t=M/2:


2.* Every internal node other than the root is at least half-
full, i.e. t-1f #keys f2t-1, tf #children f2t
Example: a 4-way B-tree
20 40 20 40

0 5 10 25 35 45 55 0 5 25 35 45 55

10
B-tree 4-way tree
B-tree
1. It is perfectly balanced: every leaf node is at the same depth.
2. Every node, except maybe the root, is at least half-full
t-1f #keys f2t-1
3. Every internal node with k keys has k+1 non-null children
B-tree Height
Claim: any B-tree with n keys, height h and minimum degree t
satisfies:
n û1
h ó log t
2

Proof:
" The minimum number of KEYS for a tree with height h is
obtained when:
3 The root contains one key
3 All other nodes contain t-1 keys
B-Tree: Insert X

1. As in M-way tree find the leaf node to which X should be


added
2. Add X to this node in the appropriate place among the
values already there
(there are no subtrees to worry about)
3. Number of values in the node after adding the key:
3 Fewer than 2t-1: done
3 Equal to 2t: overflowed
4. Fix overflowed node
Fix an Overflowed
1. Split the node into three parts, M=2t:
3 Left: the first t values, become a left child node
3 Middle: the middle value at position t, goes up to parent
3 Right: the last t-1 values, become a right child node
2. Continue with the parent:
1. Until no overflow occurs in the parent
2. If the root overflows, split it too, and create a new root node

J
x & 56 98 &. x & 56 68 98 &.
split
y 60 65 68 83 86 90
y 60 65 z 83 86 90
Insert example

20 40 60 80 M ý 6; t ý 3

0 5 10 15 25 35 45 55 62 66 70 74 78 87 98

Insert 3:
20 40 60 80

0 3 5 10 15 25 35 45 55 62 66 70 74 78 87 98
20 40 60 80 M ý 6; t ý 3

0 3 5 10 15 25 35 45 55 62 66 70 74 78 87 98

Insert 61: 20 40 60 80
OVERFLOW

0 3 5 10 15 25 35 45 55 61 62 66 70 74 78 87 98

SPLIT IT
20 40 60 70 80

0 3 5 10 15 25 35 45 55 61 62 66 74 78 87 98
M ý 6; t ý 3
20 40 60 70 80
Insert 38:

0 3 5 10 15 25 35 45 55 61 62 66 74 78 87 98

20 40 60 70 80

0 3 5 10 15 25 35 38 45 55 61 62 66 74 78 87 98
Insert 4: M ý 6; t ý 3
20 40 60 70 80

0 3 5 10 15 25 35 38 45 55 61 62 66 74 78 87 98

20 40 60 70 80
OVERFLOW

0 3 4 5 10 15 25 35 38 45 55 61 62 66 74 78 87 98

SPLIT IT
OVERFLOW
5 20 40 60 70 80
SPLIT IT

0 3 4 10 15 25 35 38 45 55 61 62 66 74 78 87 98
M ý 6; t ý 3

OVERFLOW
5 20 40 60 70 80
SPLIT IT

0 3 4 10 15 25 35 38 45 55 61 62 66 74 78 87 98

60

5 20 40 70 80

0 3 4 10 15 25 35 38 45 55 61 62 66 74 78 87 98
Complexity Insert
" Inserting a key into a B-tree of height h is done in a
single pass down the tree and a single pass up the
tree

Complexity: O(h) ý O(log t n)


B-Tree: Delete X
" Delete as in M-way tree
" A problem:
3 might cause underflow: the number of keys
remain in a node < t-1

Recall: The root should have at least 1 value in it, and all other nodes should
have at least t-1 values in them
M ý 6; t ý 3

Underflow Example
Delete 87: 60

5 20 40 70 80

0 3 4 10 15 25 35 38 45 55 61 62 66 74 78 87 98

60 B-tree
UNDERFLOW

5 20 40 70 80

0 3 4 10 15 25 35 38 45 55 61 62 66 74 78 98
B-Tree: Delete X,k
" Delete as in M-way tree
" A problem:
3 might cause underflow: the number of keys remain in a
node < t-1
" Solution:
3 make sure a node that is visited has at least t instead of t-1
keys.
3 If it doesn9t have k
" (1) either take from sibling via a rotate, or
" (2) merge with the parent
3 If it does have k
" See next slides

Recall: The root should have at least 1 value in it, and all other nodes should
have at least t-1 (at most 2t-1) values in them
B-Tree-Delete(x,k)
1st case: k is in x and x is a leaf ð delete k

k=66

x 62 66 70 74 x 62 70 74

How many keys are left?


Example t=3
k=50
x 30 50 70 90 x 30 45 70 90

5 6 7 5 6 7
y 35 40 45 y 35 40 45
Example t=3
2nd case cont.:
c. Both a and b are not satisfied: y and z have t-1
keys
3 Merge the two children, y and z
3 Recursively delete k from the merged cell

x 30 70 90
x 30 50 70 90

y
35 40 55 60 z y 35 40 50 55 65

1 2 3 4 5 6 1 2 3 4 5 6
Example t=3
Questions
" When does the height of the tree shrink?
" Why do we need the number of keys to be at least t
and not t-1 when we proceed down in the tree?
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Delete Complexity
" Basically downward pass:
3 Most of the keys are in the leaves 3 one
downward pass
3 When deleting a key in internal node 3 may
have to go one step up to replace the key with
its predecessor or successor

Complexity O(h) ý O(log t n)


Run Time Analysis of
B-Tree Operations
" For a B-Tree of order M=2t
3 #keys in internal node: M-1
3 #children of internal node: between M/2 and M
3 ø Depth of B-Tree storing n items is O(log M/2 N)
" Find run time is:
3 O(log M) to binary search which branch to take at each node, since M is
constant it is O(1).
3 Total time to find an item is O(h*log M) = O(log n)
" Insert & Delete
3 Similar to find but update a node may take : O(M)=O(1)

Note: if M is >32 it worth using binary search at each node


A typical B-Tree

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Why B-Tree?

" B-trees is an implementation of dynamic sets that is


optimized for disks
3 The memory has an hierarchy and there is a tradeoff
between size of units/blocks and access time
3 The goal is to optimize the number of times needed to
access an <expensive access time memory=
3 The size of a node is determined by characteristics of the
disk 3 block size 3 page size
3 The number of access is proportional to the tree depth
Binary Heaps

COL 106
Shweta Agrawal and Amit Kumar
Revisiting FindMin
" Application: Find the smallest ( or highest
priority) item quickly
3 Operating system needs to schedule jobs
according to priority instead of FIFO
3 Event simulation (bank customers arriving and
departing, ordered according to when the event
happened)
3 Find student with highest grade, employee with
highest salary etc.

2
Priority Queue ADT
" Priority Queue can efficiently do:
3 FindMin (and DeleteMin)
3 Insert
" What if we use&
3 Lists: If sorted, what is the run time for Insert and
FindMin? Unsorted?
3 Binary Search Trees: What is the run time for Insert and
FindMin?
3 Hash Tables: What is the run time for Insert and FindMin?

3
Less flexibility ø More speed

" Lists
3 If sorted: FindMin is O(1) but Insert is O(N)
3 If not sorted: Insert is O(1) but FindMin is O(N)
" Balanced Binary Search Trees (BSTs)
3 Insert is O(log N) and FindMin is O(log N)
" Hash Tables
3 Insert O(1) but no hope for FindMin
" BSTs look good but&
3 BSTs are efficient for all Finds, not just FindMin
3 We only need FindMin

4
Better than a speeding BST

" We can do better than Balanced Binary


Search Trees?
3 Very limited requirements: Insert, FindMin,
DeleteMin. The goals are:
3 FindMin is O(1)
3 Insert is O(log N)
3 DeleteMin is O(log N)

5
Binary Heaps
" A binary heap is a binary tree (NOT a BST) that is:
3 Complete: the tree is completely filled except possibly the
bottom level, which is filled from left to right
3 Satisfies the heap order property
" every node is less than or equal to its children
" or every node is greater than or equal to its children
" The root node is always the smallest node
3 or the largest, depending on the heap order

6
Heap order property
" A heap provides limited ordering information
" Each path is sorted, but the subtrees are not sorted
relative to each other
3 A binary heap is NOT a binary search tree

-1
2 1
0 1
4 6 2 6
0
7 5 8 4 7
These are all valid binary heaps (minimum)

7
Binary Heap vs Binary Search Tree
Binary Heap Binary Search Tree
min value
5 94

10 94 10 97
min
value

97 24 5 24

Parent is less than both Parent is greater than left


left and right children child, less than right child
8
Structure property

" A binary heap is a complete tree


3 All nodes are in use except for possibly the right
end of the bottom row

9
Examples

6 2
4 2 4 6
complete tree,
heap order is "max"
5
not complete
2 2
4 6 5 6
7 5 7 4
complete tree, complete tree, but min
heap order is "min" heap order is broken

10
Array Implementation of Heaps
(Implicit Pointers)
" Root node = A[1]
" Children of A[i] = A[2i], A[2i + 1]
" Parent of A[j] = A[j/2]
" Keep track of current size N (number of nodes)
1
2
2 3
value - 2 4 6 7 5 4 6
index 0 1 2 3 4 5 6 7 7 5
4 5
N=5
11
FindMin and DeleteMin

" FindMin: Easy!


3 Return root value A[1] 2

3 Run time = ? 4 3
7 5 8 10

" DeleteMin: 11 9 6 14

3 Delete (and return) value at


root node

12
DeleteMin

" Delete (and return) value 4 3


at root node 7 5 8 10

11 9 6 14

13
Maintain the Structure Property

" We now have a <Hole= at


4 3
the root
7 5 8 10
3 Need to fill the hole with
another value 11 9 6 14
" When we get done, the tree
will have one less node and 4 3
must still be complete
7 5 8 10

11 9 6 14
14
Maintain the Heap Property

" The last value has lost its


node 14

3 we need to find a new place


for it
4 3
7 5 8 10
11 9 6

15
DeleteMin: Percolate Down

? 14
14 3
3 ?

4 3 4 4 8
7 5 8 10 7 5 8 10 7 5 14 10

11 9 6 11 9 6 11 9 6

" Keep comparing with children A[2i] and A[2i + 1]


" Copy smaller child up and go down one level
" Done if both children are g item or reached a leaf node
" What is the run time?
16
1 2 3 4 5 6
6 | 10 | 8 | 13 | 14 | 25
Percolate Down

PercDown(i:integer, x: integer): {
// N is the number elements, i is the hole,
x is the value to insert
Case{
no children
2i > N : A[i] := x; //at bottom//
2i = N : if A[2i] < x then
one child
at the end
A[i] := A[2i]; A[2i] := x;
else A[i] := x;
2i < N : if A[2i] < A[2i+1] then j := 2i;
2 children else j := 2i+1;
if A[j] < x then
A[i] := A[j]; PercDown(j,x);
else A[i] := x;
}}

12/26/03 Binary Heaps - Lecture 11 17


DeleteMin: Run Time Analysis
" Run time is O(depth of heap)
" A heap is a complete binary tree
" Depth of a complete binary tree of N nodes?
3 depth = log2(N)
" Run time of DeleteMin is O(log N)

18
Insert

" Add a value to the tree 2


" Structure and heap order 3
properties must still be
4 8
correct when we are
done 7 5 14 10

11 9 6

19
Maintain the Structure Property

" The only valid place for a


new node in a complete 2
tree is at the end of the
array 3

" We need to decide on the 4 8


correct value for the new 7 5 14 10
node, and adjust the heap 11 9 6
accordingly
20
Maintain the Heap Property

" The new value goes where?

3
4 8
7 5 14 10

11 9 6
2

21
Insert: Percolate Up

2 ?
3 3 3
4 8 4 8 8
7 5 14 10 7 14 10 7 4 14 10
? ?
11 9 6 11 9 6 5 11 9 6 5
2 2
" Start at last node and keep comparing with parent A[i/2]
" If parent larger, copy parent down and go up one level
" Done if parent f item or reached top node A[1]

22
Insert: Done

2
3 8
7 4 14 10

11 9 6 5

" Run time?

23
Binary Heap Analysis

" Space needed for heap of N nodes: O(MaxN)


3 An array of size MaxN, plus a variable to store the size N
" Time
3 FindMin: O(1)
3 DeleteMin and Insert: O(log N)
3 BuildHeap from N inputs ???

12/26/03 Binary Heaps - Lecture 11 24


Build Heap
BuildHeap {
for i = N/2 to 1
PercDown(i, A[i])
}

1
N=11
11 11
2 3
5 10 5 10
4 5 6 7
9 4 8 12 9 3 8 12

8
2 7 6 3 11 2 7 6 4
9 10

12/26/03 Binary Heaps - Lecture 11 25


Build Heap

11 11
5 10 5 8
2 3 8 9 2 3 10 12
9 7 6 4 9 7 6 4

12/26/03 Binary Heaps - Lecture 11 26


Build Heap

11 2
2 8 3 8
5 3 10 12 5 4 10 12
9 7 6 4 9 7 6 11

12/26/03 Binary Heaps - Lecture 11 27


Time Complexity
" Naïve considerations:
3 n/2 calls to PercDown, each takes
clog)n)
3 Total:ýÿ ýýý ÿ
" More careful considerations:
3 Only O(n)
Analysis of Build Heap
" Assume n = 2h+1 31 where h is height of the tree

3 Thus, level h has 2h nodes but there is nothing to PercDown


3 At level h-1 there are 2h-1 nodes, each might percolate down 1 level
3 At level h-j, there are 2h 3j nodes, each might percolate down j levels

Total Time

= O(n)

12/26/03 Binary Heaps - Lecture 11 29


Other Heap Operations

" Find(X, H): Find the element X in heap H of N


elements
3 What is the running time? O(N)
" FindMax(H): Find the maximum element in H
" Where FindMin is O(1)
3 What is the running time? O(N)
" We sacrificed performance of these operations in
order to get O(1) performance for FindMin

12/26/03 Binary Heaps - Lecture 11 31


Other Heap Operations
" DecreaseKey(P,—,H): Decrease the key value of
node at position P by a positive amount —,
e.g., to increase priority
3 First, subtract — from current value at P
3 Heap order property may be violated
3 so percolate up to fix
3 Running Time: O(log N)

12/26/03 Binary Heaps - Lecture 11 32


Other Heap Operations
" IncreaseKey(P, —,H): Increase the key value of
node at position P by a positive amount —,
e.g., to decrease priority
3 First, add — to current value at P
3 Heap order property may be violated
3 so percolate down to fix
3 Running Time: O(log N)

12/26/03 Binary Heaps - Lecture 11 33


Other Heap Operations

" Delete(P,H): E.g. Delete a job waiting in


queue that has been preemptively
terminated by user
3 Use DecreaseKey(P, —,H) followed by
DeleteMin
3 Running Time: O(log N)

12/26/03 Binary Heaps - Lecture 11 34


Other Heap Operations

" Merge(H1,H2): Merge two heaps H1 and H2


of size O(N). H1 and H2 are stored in two
arrays.
3 Can do O(N) Insert operations: O(N log N) time
3 Better: Copy H2 at the end of H1 and use
BuildHeap. Running Time: O(N)

12/26/03 Binary Heaps - Lecture 11 35


Heap Sort
" Idea: buildHeap then call deleteMin n times
E[] input = buildHeap(...);
E[] output = new E[n];
for (int i = 0; i < n; i++) {
output[i] = deleteMin(input);
}

" Runtime?
Best-case ___ Worst-case ___ Average-case ___
" Stable? _____
" In-place? _____
CSE373: Data Structures &
36
Algorithms
Heap Sort
" Idea: buildHeap then call deleteMin n times
E[] input = buildHeap(...);
E[] output = new E[n];
for (int i = 0; i < n; i++) {
output[i] = deleteMin(input);
}

" Runtime?
Best-case, Worst-case, and Average-case: O(n log(n))
" Stable? No
" In-place? No. But it could be, with a slight trick...
CSE373: Data Structures &
37
Algorithms
But this reverse sorts 3
In-place Heap Sort how would you fix that?

3 Treat the initial array as a heap (via buildHeap)


3 When you delete the ith element, put it at arr[n-i]
" That array location isn9t needed for the heap anymore!

4 7 5 9 8 6 10 3 2 1

heap part sorted part

put the min at the end of the heap data

5 7 6 9 8 10 4 3 2 1
arr[n-i]=
deleteMin() heap part sorted part
CSE373: Data Structures &
38
Algorithms
<AVL sort=? <Hash sort=?
AVL Tree: sure, we can also use an AVL tree to:
3 insert each element: total time O(n log n)
3 Repeatedly deleteMin: total time O(n log n)
" Better: in-order traversal O(n), but still O(n log n)
overall
3 But this cannot be done in-place and has worse
constant factors than heap sort

CSE373: Data Structures &


39
Algorithms
Searching on Multi-
Dimensional Data
COL 106
Slide Courtesy: Dan Tromer,
Piotyr Indyk, George Bebis
Query Types
" Exact match query: Asks for the
object(s) whose key matches query key
exactly.

" Range query: Asks for the objects


whose key lies in a specioed query
range (interval).

" Nearest-neighbor query: Asks for the


objects whose key is <close= to query 2
Range Query
" Example:
3 key=Age: retrieve all records satisfying
20 < Age < 50
3 key= #Children: retrieve all records satisfying
1 < #Children < 4

ID Name Age Salary #Children

3
Nearest Neighbor Search
Problem deonition
" Given: a set P of n points in Rd
Over some metric
" ond the nearest neighbor p of q in P

Q?

Distance metric
Nearest Neighbor Query in High
Dimensions
" Very important and practical problem!
3 Image retrieval

find N closest
matches (i.e., N
nearest neighbors)
(f1,f2, .., fk)

5
Nearest Neighbor Query in High
Dimensions
3 Face recognition

find closest match


(i.e., nearest neighbor)

6
Nearest Neighbor(s) Query
3 What is the closest restaurant to my hotel?

7
Other Applications
" Classiocation " Copyright violation
" Clustering detection
" Indexing

Weight

q?

color
We will see three solutions
(or as many as time permits)&
" Quad Trees
" K-D Trees
" Locality Sensitive Hashing
Interpreting Queries
Geometrically
" Multi-dimensional keys can be thought
as <points= in high dimensional spaces.

Queries about records ð Queries about


points

10
Example 1- Range Search in
2D

age = 10,000 x year + 100 x month + day

11
Example 2 3 Range Search in
3D

12
1D Range Search
Range: [x, x9]

" Updates take O(n) time

" Does not generalize well to high dimensions.

Example: retrieve all points in [25, 90]

13
1D Range Search
" Example: retrieve all points in [25, 90]
3 The search path for 25 is:

14
1D Range Search
3 The search for 90 is:

15
1D Range Search
" Examine the leaves in the sub-trees between
the two traversing paths from the root.

split node

retrieve all points in [25, 90] 16


Quad Trees N

" A tree in which each internal node


has up to four children. W E

" Every node in the quadtree S


corresponds to a square.

" The children of a node v


correspond to the four quadrants
of the square of v.

" The children of a node are labelled


NE, NW, SW, and SE to indicate to
which quadrant they correspond.
Quadtree Construction(data stored at leaves)
400 a
Input: point set P b
while Some cell C contains more than c
<k= points do d e
Split cell C
Y
g f
end h l
j
i k
X 50, Y 200 0 X 100
SW SE NW NE

X 75, Y 100 c e X 25, Y 300


i h

j k f g l d a b
Quadtree 3 Exact Match Query

Partitioning of the plane The quad tree


D(35,85) A(50,50)
• •B(75,80) NE
P SE
NW
•C(90,65) SW B(75,80)
•A(50,50) E D SE
NE
SW NW

E(25,25) C

To search for P(55, 75):


"Since XA< XP and YA < YP → go to NE (i.e., B).
"Since XB > XP and YB > YP → go to SW, which in this case is null.
Quadtree 3 Nearest Neighbor Query

SW
X1,Y1
NE
NW SE

X2,Y2
Y

X
Quadtree 3 Nearest Neighbor Query

SW X1,Y1 NE
NW SE

X2,Y2
Y NW

X
Quadtree3 Nearest Neighbor Query

SW X1,Y1 NE
NW SE

X2,Y2
SW
Y NW SE NE

X
Quadtree3 Nearest Neighbor Search
Algorithm
Initialize range search with large r
Put the root on a stack
Repeat
3 Pop the next node T from the stack q
3 For each child C of T
" if C intersects with a circle (ball) of radius
r around q, add C to the stack
" if C is a leaf, examine point(s) in C and
update r

" Whenever a point is found, update r (i.e., current


minimum)
" Only investigate nodes with respect to current r.
Quadtree (cont9d)
" Simple data structure.
" Easy to implement.

" But, it might not be eïcient: two


close points may require a lot of
levels in the tree to split them
The following image shows original
image and its PR quad tree
decomposition.
KD Tree
" A binary search tree where every node is a
k-dimensional point.

Example: k=2 53, 14

27, 28 65, 51

30, 11 31, 85 70, 3 99, 90

40, 26 7, 39 32, 29 82, 64


29, 16

38, 23 55,62 73, 75


15, 61
KD Tree (cont9d)
Example: data stored at the leaves
KD Tree (cont9d)
" Every node (except leaves) represents a hyperplane
that divides the space into two parts.
" Points to the left (right) of this hyperplane represent
the left (right) sub-tree of that node.

Pleft Pright
KD Tree (cont9d)
As we move down the tree, we divide the space along
alternating (but not always) axis-aligned hyperplanes:

Split by x-coordinate: split by a vertical line that


has (ideally) half the points left or on, and half
right.

Split by y-coordinate: split by a horizontal line


that has (ideally) half the points below or on and
half above.
KD Tree - Example
Split by x-coordinate: split by a vertical line that
has approximately half the points left or on, and
half right.

x
KD Tree - Example
Split by y-coordinate: split by a horizontal line that
has half the points below or on and half above.

y y
KD Tree - Example
Split by x-coordinate: split by a vertical line that
has half the points left or on, and half right.

y y

x
x x x
KD Tree - Example
Split by y-coordinate: split by a horizontal line that
has half the points below or on and half above.

y y

x
x x x

y y
Node Structure
" A KD-tree node has 5 oelds
3 Splitting axis
3 Splitting value
3 Data
3 Left pointer
3 Right pointer
Splitting Strategies
" Divide based on order of point insertion
3 Assumes that points are given one at a time.

" Divide by onding median


3 Assumes all the points are available ahead of time.

" Divide perpendicular to the axis with widest


spread
3 Split axes might not alternate & and more!
Example 3 using order of point insertion
(data stored at nodes)
Example 3 using median
(data stored at the leaves)
Example 3 using median
(data stored at the leaves)
Example 3 using median
(data stored at the leaves)
Example 3 using median
(data stored at the leaves)
Example 3 using median
(data stored at the leaves)
Example 3 using median
(data stored at the leaves)
Example 3 using median
(data stored at the leaves)
Example 3 using median
(data stored at the leaves)
Example 3 using median
(data stored at the leaves)
KD Tree 3 Exact Search
KD Tree 3 Exact Search
KD Tree 3 Exact Search
KD Tree 3 Exact Search
KD Tree 3 Exact Search
KD Tree 3 Exact Search
KD Tree 3 Exact Search
KD Tree 3 Exact Search
KD Tree 3 Exact Search
KD Tree 3 Exact Search
KD Tree 3 Exact Search
Nearest Neighbor with KD
Trees

Traverse the tree, looking for the rectangle that


contains the query.
Nearest Neighbor with KD
Trees

Explore the branch of the tree that is closest to the


query point orst.
Nearest Neighbor with KD
Trees

Explore the branch of the tree that is closest to the query


point first.
Nearest Neighbor with KD
Trees

When we reach a leaf, compute the distance to each


point in the node.
Nearest Neighbor with KD
Trees

When we reach a leaf, compute the distance to each


point in the node.
Nearest Neighbor with KD
Trees

Then, backtrack and try the other branch at each


node visited.
Nearest Neighbor with KD
Trees

Each time a new closest node is found, we can


update the distance bounds.
Nearest Neighbor with KD
Trees

Each time a new closest node is found, we can


update the distance bounds.
Nearest Neighbor with KD
Trees

Using the distance bounds and the bounds of the


data below each node, we can prune parts of the tree
that could NOT include the nearest neighbor.
Nearest Neighbor with KD
Trees

Using the distance bounds and the bounds of the


data below each node, we can prune parts of the tree
that could NOT include the nearest neighbor.
Nearest Neighbor with KD
Trees

Using the distance bounds and the bounds of the


data below each node, we can prune parts of the tree
that could NOT include the nearest neighbor.
<Curse= of dimensionality
" Much real world data is high
dimensional
" Quad Trees or KD-trees are not
suitable for eïciently onding the
nearest neighbor in high dimensional
spaces -- searching is exponential in d.
" As d grows large, this quickly becomes
intractable.

72
Dimensionality Reduction
Idea: Find a mapping T to reduce the
dimensionality of the data.
Drawback: May not be able to ond all similar
objects (i.e., distance relationships might not be
preserved)

73
Locality Sensitive Hashing
" Hash the high dimensional points down to a
smaller space

" Use a family of hash functions such that close


points tend to hash to the same bucket.

" Put all points of P in their buckets. Ideally we


want the query q to ond its nearest neighbor
in its bucket
Locality-Sensitive Hashing

" Hash functions are locality-sensitive,


if, for a random hash random
function h, for any pair of points p,q
we have:
3 Pr[h(p)=h(q)] is <high= if p is <close= to q
3 Pr[h(p)=h(q)] is <low= if p is <far= from q
Do such functions exist ?
" Consider the hypercube, i.e.,
3 points from {0,1}d
3 Hamming distance D(p,q)= # positions on
which p and q difer
" Deone hash function h by choosing a
set I of k random coordinates, and
setting
h(p) = projection of p on I
Example
" Take
3 d=10, p=0101110010
3 k=2, I={2,5}
" Then h(p)=11

Can show that this function is locality sensitive


Another example
Divide the space using randomly chosen hyperplanes

is a hyperplane separating the space (next page for exam


Locality Sensitive Hashing

" Take random projections of data


" Quantize each projection with few bits

0 101

1 Input vector

0
1
1
Fergus et al 0
How to search from hash table?

A set of data points

N
Hash function
h r1&rk
Search the hash table
Xi for a small set of images

<< N
Hash table
11010 Q
1

h r &r
1 k Q
11011
1
11110
1
New query

[Kristen Grauman et al] results


COL106: Data Structures and
Algorithms

Lecture 26

Keerti Choudhary
Department of Computer Science, IIT Delhi
TodayÕs Lecture

Revisit 234 Trees used in Assignment 4

Some Resources:(
(
234 Trees: https://people.ksp.sk/~kuko/gnarley-trees/234tree.html (
(
Trees: https://people.ksp.sk/~kuko/gnarley-trees/Intro.html(
234 Trees
Property 1: Every node is of following type:

" 2-node : has one data element, and zero or two children

" 3-node : has two data elements, and zero or three children

" 4-node : has three data elements, and zero or four children

2-node 3-node 4-node

a a b a b c

N1 N2 N1 N2 N3 N1 N2 N3 N4
234 Trees
Property 1: Every node is of following type:

" 2-node : has one data element, and zero or two children

" 3-node : has two data elements, and zero or three children

" 4-node : has three data elements, and zero or four children

2-node 3-node 4-node

a a b a b c

N1 N2 N1 N2 N3 N1 N2 N3 N4

a €a a [a, b] €b  a [a, b] [b, c] €c


234 Trees
Property 2 : All leaves are at same level.

30 50

5 10 15 40 55 65

4 8 11 12 20 35 45 55 60 70
Trees with a Single Overflow
DeÞnition: Contains exactly one 5-node.

30 50
5-node

5 10 15 25 40 55 65

4 8 11 12 20 28 35 45 55 60 70
Why study Overflow?
Reason : We can get a 5-Node after insertions

30 50

5 10 15 40 55 65

4 8 11 12 20 35 45 55 60 70

20 22 25 28

5-Node obtained(
after insertion of 22, 25, 28
Handling Insertion

Algorithm:(
(
Suppose a key k needs to be inserted

1. The Þrst step is to Þnd the leaf node, say x, that can accommodate key k(
without breaking the sorted ordering. This takes O(log n) time.

2. After inserting key k, the node x transitions from either 2-node to 3-node, (
or 3-node to 4-node, or 4-node to 5-node.

3. If x becomes a 5-node, then the insertion problem reduces to that of Þxing (


a tree with a single overßow.

(
Question: How to Þx trees with a Single Overßow?
(
Handling Overflow

x b x

T€x T€x

5-node
a b c d a c d

Ta T[a,b] T[b,c] T[c,d] T€d Ta T[a,b] T[b,c] T[c,d] T€d

Case 1 : The parent node is originally a 2-node.


Handling Overflow

x y x b y

Tx T€y Tx T€y

5-node
a b c d a c d

Ta T[a,b] T[b,c] T[c,d] T€d Ta T[a,b] T[b,c] T[c,d] T€d

Case 2 : The parent node is originally a 3-node.


Handling Overflow

x y z x b y z

Tx T[y,z] T€z Tx T[y,z] T€z

5-node
a b c d a c d

Ta T[a,b] T[b,c] T[c,d] T€d Ta T[a,b] T[b,c] T[c,d] T€d

Case 3 : The parent node is originally a 4-node.(


(
In this case the overßow is transferred to the parent node.
Handling Overflow
Conclusion : (
(
Handling overßow at a single node takes O(log n) time in the worst case.

Furthermore, in the process if the overßow reaches the root then the height of the tree
increases by one.
Trees with a Single Underflow
DeÞnition: Contains exactly one Ò1-nodeÓ

30 50

5 10 15 1-node 55 65

4 8 11 12 20 40 55 60 70
Why study Underflow?
Reason : We can get a 1-Node after deletions

origÑ30

35 50

5 10 15 40 55 65

4 8 11 12 20 35 45 55 60 70

1- Node obtained
after deletion of 30
Algorithm:

1. Suppose a key k needs to be deleted. We Þrst search k. If k is not a leaf then we go to its (
successor node (say x) and swap.

2. Because successor is a leaf, we may get a 1-Node at leaf.


Handling Underflow

b c b

a × d a c d

Ta T[a,b] T[b,c] T[c,d] T€d Ta T[a,b] T[b,c] T[c,d] T€d

Case 1 : All Siblings are 2-Nodes, and the parent is either a 3-Node or 4-Node.
Handling Underflow

a c b d

× b d e a c e

Ta T[a,b] T[b,c] T[c,d] T[d,e] T€e Ta T[a,b] T[b,c] T[c,d] T[d,e] T€e

Case 2 : At least one Sibling is a 3-Node or 4-Node.(


(
In this case structure of the parent node remains intact.
Handling Underflow

a ×

× b a b

Ta T[a,b] T€b Ta T[a,b] T€b

Case 3 : Parent and Sibling both are 2-Nodes.(


(
In this case we have propagated the underßow to the parent node.
Handling Underflow
Conclusion : (
(
Handling underßow at a single node takes O(log n) time in the worst case.

Furthermore, in the process if the underßow reaches the root then the height of the tree
decreases by one.

You might also like