You are on page 1of 10

Efficient Historical R-trees

Yufei Tao and Dimitris Papadias
Department of Computer Science
Hong Kong University of Science and Technology
Clear Water Bay, Hong Kong
{taoyf, dimitris}@cs.ust.hk

Abstract Considerable work has been done on indexing static
spatial objects [6]. Probably the most popular index is the
The Historical R-tree is a spatio-temporal access R-tree [7], a balanced structure that clusters objects by
method aimed at the retrieval of window queries in the their spatial proximity. R-tree variants are currently
past. The concept behind the method is to keep an R-tree incorporated in many commercial DBMS. A
for each timestamp in history, but allow consecutive trees straightforward solution towards indexing spatio-temporal
to share branches when the underlying objects do not data is to create an R-tree for each timestamp in history.
change. New branches are only created to accommodate Such an approach will certainly achieve excellent
updates from the previous timestamp. Although existing performance for timestamp queries as they degenerate into
implementations of HR-trees process timestamp (window) traditional window queries. However, an obvious
queries very efficiently, they are hardly applicable in disadvantage would be the excessive space required to
practice due to excessive space requirements and poor store all the trees. In fact, one may notice that it is not
interval query performance. This paper addresses these necessary to preserve a complete tree on each timestamp
problems by proposing the HR+-tree, which occupies a due to the fact that consecutive trees may have a lot of
small fraction of the space required for the corresponding identical branches. This is especially true, if only a small
HR-tree (for typical conditions about 20%), while percentage of the objects move at each timestamp.
improving interval query performance several times. Our The MR-tree [16] is the first structure that takes
claims are supported by extensive experimental advantage of this observation. In MR-trees, consecutive
evaluation. trees share branches when the underlying objects do not
move and new branches are only created to accommodate
changes from the previous timestamp. The first concrete
1. Introduction update algorithms were presented in [9], which proposed
the HR-tree based on the same idea. No experimental
The most fundamental type of query in spatial evaluation was available for these methods until [10]
databases is the window query, which retrieves all objects compared the HR-tree with some 3D R-tree
that intersect a window specified by the user. In spatio- implementations (in 3D R-trees time is incorporated as an
temporal databases, due to the inclusion of temporal extra dimension). It was revealed that, HR-trees
information, there exist two types of window queries: (i) outperform 3D R-trees on timestamp and short-interval
timestamp (or timeslice) queries that retrieve all objects queries. This is due to the fact that timestamp query
that intersect a window at a specific timestamp, and (ii) performance of 3D R-trees does not depend on the live
interval queries, which involve several continuous entries at the query timestamp, but on the total number of
timestamps. Since window queries, especially timestamp entries in history. Since all objects are indexed by a single
queries, are usually the building blocks for other more tree, the size and height of the tree is expected to be larger
sophisticated operations, their efficient processing is vital than that of the corresponding HR-tree at the query
to the overall system performance. Supporting such timestamp. However, the space requirements of HR-trees
queries in spatio-temporal databases demands new are still prohibitive in practice, because for most typical
querying languages, modeling methods, novel attribute datasets HR-trees almost degenerate to independent R-
representations [5], and, very importantly, new access trees, one for each timestamp. Furthermore, their
methods [14]. performance deteriorates very fast for interval queries as

Notice that the trees of searched first and all (positive and negative) qualifying previous timestamps are never modified. Although to some extent version redundancy is Historical R-trees (HR-trees) [9] are based on the unavoidable for maintaining satisfactory timestamp query overlapping technique [4. a0. the size is still prohibitive importance of these queries in any system that deals with for practical applications. but An interval query involving several timestamps should common branches of consecutive trees are stored only search the corresponding trees of the timestamps involved. we use2 positive and negative pointers to both trees meaning that its content has not changed during distinguish exclusive and shared nodes. it is duplicated. B1). The change(s) should be propagated to the root Section 3 presents HR+-trees and the corresponding (causing the creation of B1 and C1) so even if only one update and query processing algorithms. capacity of 100 rectangles (a rather common value). the entry is inserted in B0) are replaced with positive ones (to B1). query performance. the entire path may need to be contains an extensive experimental evaluation. it is possible that the whole R- space requirements and query performance. We call this phenomenon version redundancy. duplicating the internal nodes if they follows: the tree associated with the earliest timestamp is are shared by some earlier tree. HR+-trees break this constraint... if the new version e1 of e0. considered the most efficient R-tree variant. c0 and d0 in Figure 2. however. given the fundamental multiple tree implementation. First instance. however. tree needs to be replicated since a moving object may the new method consumes less than 20% of the space cause the duplication of multiple nodes. which contains the entries of D0 plus e1. which if less than 1% of the objects move between two outperforms the HR-tree significantly with respect to both consecutive timestamps. The rest of the paper is new nodes: D1. If the leaf node is shared by in the current tree (e. . to some earlier tree. it is rather excessive in HR-trees (this is version data structure into a partially persistent one. b0. as new nodes are created [2] choose subtree algorithm.g. Then. A general the new copy. Section 4 object changes its position. C0 are all negative. HR-trees’ excellent timestamp versions (timestamps) to be placed in the same node. D E D E 0 0 1 1 Figure 1: Example of HR-tree 3. and the update is propagated up to the root interval query on HR-trees can now be answered as of the current tree. its pointers to nodes the leaf to insert the entry is found by applying the R*-tree A0. HR-trees do not allow entries of different is handled very efficiently. while duplicated. In order to guarantee good timestamp query the query degenerates into a traditional window query and performance. once in order to save space. times faster. the negative pointers (e. Thus. which contains the entries of E0 after the deletion discusses its advantages. 12] that transforms a single performance.g. of e0. Node A0 is shared by different roots.. Figure 1 illustrates part of an In order to avoid multiple visits to the same node via HR-tree for timestamps 0 and 1. 10]. structure maintains an R-tree 1 for each timestamp. Section 2 introduces the HR-tree. Even. Consider for instance a node historical information retrieval. Thus. pointers are followed to the leaves. To be specific. B0.the interval length increases. the HR-tree will contain multiple directions for future work. is seriously affected by version redundancy and the large size of the structure. it processes timestamp queries is inserted in node D0. Historical R-trees 1). It is obvious from the above discussion that in section 5 summarizes the contributions and provides most typical situations. Insertion is carried out as follows.g. and analyzes its problems. A B C B C In the next section. a space-efficient Although it is claimed in [10] that the structure can method that performs satisfactorily on both timestamp and achieve up to 33% space savings with respect to the naïve interval queries is necessary. does not come for free. The experimentally demonstrated in section 4). 0 1 Interval query performance. Part of an HR+-tree is 1 2 All implementations in the paper are R*-trees [2] since they are Interval queries were not discussed in the original HR-tree papers [9. copies of the same object at different timestamps although the object has not moved (e. for these timestamps. required by HR-trees yet answers interval queries several In the example of Figure 1. we propose a new access method that 0 0 0 1 1 a0 b 0 c0 d 0 e0 a0 b 0 e1 c0 d 0 overcomes the disadvantages of HR-trees. In Figure 1. it will cause the creation of two as efficiently as HR-trees. organized as follows. In this paper we propose the HR+-tree. Meanwhile. Next the trees timestamp 0 timestamp 1 associated with the other timestamps are searched in negative pointer R R chronological order by following only positive pointers. and E1. HR+-Trees A timestamp query is directed to the corresponding R- tree and search is performed inside this tree only. when R1 is copied from R0.

i. d0. The choose subtree of the corresponding node. the fanout of followed is determined similarly to R*-trees: (i) if it is at nodes in HR+-trees is only about 8% smaller than that of the level just above the leaves. Entry d0 is <a. This makes it necessary to store temporal information along with the entries (this is not needed in HR-trees). which. The subscript of an entry denotes its for intermediate entries. 1> <e. and c0 from the rest of entries in node C. Entries d. Instead of storing the correspondingly to reflect the changes. which harms space information along the insertion path is adjusted utilization and query performance. c0. 1. new nodes are created only Figure 3: Entry information in the HR+-tree when an overflow occurs. which states that. b0. b.e. B is the node capacity and Pweak is a tree entries bounded entries bounded {a0b0c0d0} {d0e1f 1g1} parameter ranging from 0 to 1. two parent entries of C at timestamp 0 and 1 respectively e. timestamp 0 and 1). 255]. 1. pointer>. for each timestamp but for each node we need to store its To ensure the weak version condition..shown in Figure 2 (the node capacity is 10 for all the For leaf entries.. In HR+-trees. algorithm. in the sense that updates (insertions or entry is inserted (deleted). b0. Figure 3 shows information for the corresponding versions (i. 0. The lifespan of an entry is the deletions) can be applied to the current timestamp only. we keep the relative timestamps. . appropriate data duplication is at timestamp 2. while a. however. b0. 1. 0. C> C . it is necessary to guarantee that each node ensure that the timestamp query performance does not contains a minimum number of entries alive at a given degenerate. *> bounded by both entries. <d. semi-closed interval [tstart.. differences with those of R-trees. 1. they (i. Entries a0.. *. 1> <f. is first located by the choose subtree now-time). An entry is point to C. For the weak version condition. In 3. alive at timestamp 1 (i. In order to maintain good timestamp query Tree construction algorithms (to be described shortly) performance. entries have live entries. the insertion procedure of HR+-trees deleted until the current time (i. is very immediately. 0. g1). e1. timestamp 0 (i.e. In the sequel we actual timestamps. the one to be infrequent in practice. Among the remaining entries.e. and c0..g.e. which indicates that d0 does not <b. tend). the leaf node creation time (4 bytes).. while Like HR-trees. d0).1 Insertion algorithms and overflow handling our implementation. C> however. and v contains objects <u. If an entry has not been Similar to R*-trees. during the insertion of entry h excessively long lifespans).. 0. tend. <c. To maintain calls for some method to distinguish these versions.e. Entries u and v are the entries in Figure 2 (the current timestamp is 1). in the sense that it should contain some range [0. f1.e. Motivated by the multi-version B-tree [1]. the one that incurs the HR-trees. we have to version split in previous temporal access methods [8. (iii) the HR+-tree lowers the node capacity. 15]. following examples). *> change its position at timestamp 1.. a currently live entry). thus. only if its tstart = tend. timestamp. are different. 1.. In Figure 4(a). space is utilized better. in a query raised at timestamp 1. g are currently alive. so that during the Figure 2: HR+-tree and HR-tree processing of timestamp or short-interval queries only a The inclusion of different versions in the same node small number of nodes needs to be accessed. 0. a0. 1. 1> <g. are “invisible” to v as they have been deleted at .. the HR+-tree is a partially persistent tstart (tend) represents the timestamp that the corresponding access method [12]. c are dead. *> B . Values of tstart and tend are in the should be alive. *> <v.1 The choose subtree algorithm. separate a0. their minimum bounding rectangles (MBRs) physically deleted. includes 3 main steps: (i) the leaf node to accommodate its tend is marked as “*” (a reserved word which means the new entry. S denotes the MBR as defined in R-trees. The relative time is the actual time minus the creation time 3. Note that the leaf node C contains entries of two level. dead branches s and t are eliminated introduced to ensure correctness. the pointer points to the actual record. u encloses objects alive at A . timestamp 1. the treat overflow function is called.. we applied the concept of example. if this range is exceeded (e. f. which require 4 bytes (standard integer describe these algorithms in detail and elaborate their implementation) each. we ensure that each node in the HR+-tree satisfies the weak A B version condition. In particular. With this mechanism. we use only 1 byte algorithm determines a leaf node to enter the new entry. it points to a node at the next version. for all the nodes there u v must be either none or at least B·Pweak entries alive at any timestamp. This mechanism groups C a0b0c0d0e1f 1g1 temporally adjacent entries together. as will be elaborated in subsequent sections. Although u and v both have been logically deleted at timestamp 1. In this way. tstart. each entry has the form <S..1. node C is shared by two trees). (ii) if entering the new entry causes the node to The temporal information stored with each entry in the overflow.

2. an object that is clustered 2. (ii) the entries that used to be <b. set N to the child pointed to by e (or e'. D> <u. which <e. the insertion does not incur any structural changes Figure 6: Example of version split in the tree. In Figure 4(a). while u' bounds record. so that a few insertions in subsequent 10. If. the live entries in node A with the following modifications: A D A D (i) all the entries in B. *> entry may be duplicated to guarantee good timestamp <c. In addition to space overhead. 2. 0. 0. permit some redundancy in order to achieve subtree algorithm formally. if e. 1. 1> <h. *> <e. Figure 5 describes the choose methods. let S = {all the live entries in N} necessarily be clustered well at other timestamps if it is 4. *. 0. 1. node A is version split into and 1. B> <a. 2. insert a copy e' of e into N and set e'. 2> <k. split because there exist entries inserted at timestamp 0 In Figure 7 (PSVO = 0. 2. *> <t. *. 1> <h. <s. E> <d. *> <k. 0.g. splits [2].g.2 Handling overflows. and is performed only when all the entries splits (no re-insertions are attempted the first time a node in the node were inserted at the current timestamp. *> physically deleted from A. the insertion causes enlargement. 1. else (N is at a higher level) 7. let N be the root associated with the current timestamp without version redundancy. k) are <w.c. . A key split just distributes the entries of a node trees split the corresponding node into two. As shown in Figure higher level. Notice that overflow cannot <u'. D> <c. 0. j.tend to the current time (delete e) timestamps will cause it to (version) split again resulting 11.. 1. 0. and does not incur version redundancy. 0.MBR has changed these problems in section 3. 0. 2. 2. *> entries in node A inserted at the current time (e. *. the leaf <d.. 1> <g. 0. Entry u spatially bounds the entries timestamps should ideally be represented by a single (a. 2> <i. node A. 2. In this case.. the selected <a. *> query performance. 9. 2. select an entry e from S such that minimum area intermediate level copies). *> at timestamp 2. HR+-trees. query efficiency. 2> <f. 0. a version split is performed instead to node B. E> <d. We will discuss how to avoid 8. F> <h. 2> <i. C> <c.b. like most transaction time access entries alive at timestamp 2.. *.85). An object new entry u' is created while u is logically deleted (its tend that remains static at a certain position for a number of is changed to 2). If the insertion of <h. *> <a. 2. for example. 2. if created) and go to in more redundancy. 1. 0. 2.g. 2> <e. 0. 0. a version split generates a new node B containing all area enlargement (ties are resolved as described in [2]). have tstart = 2. 0. *> <t. since they are inserted at the current . if N is just above the leaf level placed in the same node. 5. 2> <g. or duplicate reports of the same enlargement is necessary result (for leaf level copies).d) alive in the interval [0. because re-insertions can potentially lead to Figure 6(a). set e. In this case. however. area and margin. 2> <h.1. In overflows). (ii) if it is at a ensure the weak version condition.3. 2> <j. *> <c. 0. 2. 0. *> <f. *> <b. *. 2). 2. 0. 0. *> entry to be inserted persist in the new node generated from a version split (e. This is necessary because. however. 0. Strong version overflows are treated by maximum number of entries. Figure 4: Duplicating an intermediate entry A A B Unlike conventional R-trees. optimising into two new ones.minimum overlap enlargement is selected. 1. conventional R. overflows are always treated with in HR+-trees. key splits. *> <s. 0. a Version splits result in version redundancy.*> into D does not enlarge D's spatial (a) Before version split (b) After version split extent. 2. *> <b. *> alive in node A are logically deleted (tend = 2). where B denotes the node capacity and PSVO is a tree parameter 3. we select the one leading to the minimum 6(b). 0.h and i are duplicated in both the new and the old node although Algorithm Choose Subtree(new_entry) the objects remain static. entries e. 2. 2. optimising the spatial criteria of R* criteria such as the minimum overlap. Overflow occurs when an (this is also motivated by related work in temporal data entry is inserted into a node that already contains the structures [1]).tstart < current timestamp and e. *> <v. *. Notice This type of time-independent split is called the key split that unlike R*-trees. *> is reached by following the entry u in node A. 1.tstart to the current The new node created after a version split may be time almost full. 0. 0. 0.. In the above example. *> <v. *> timestamp.. 2. 2> <j. return N well with other objects at some timestamp may not 3. *> <d. 2> <i. F> <w. which incurs a strong version overflow and is key . 2. 1> <g. 0. 1. *> node selected by the choose subtree algorithm is D. we line 2 introduce the strong version overflow that occurs if the Figure 5: Algorithm choose subtree new node contains more than B·PSVO entries. 2. 1. select an entry e from S such that inserting new_entry to e version redundancy complicates query processing since it incurs the minimal overlap enlargement may cause duplicate visits to the same node (for 6. D> <h. In order to avoid this situation. if N is a leaf. cannot be key followed in our implementation of HR-trees). which generates an overflow due to version splits in other nodes (the same approach is the insertion of <k. (iii) those <u. (a) Before insertion (b) After insertion node B in Figure 6(b)). C> <b. 2. B> <a. 0.

0. C> <x. Then. H> <q. *> Algorithm Insert (new_entry) <b. J> <u. R R R' split into nodes B and C. Similar to insertions. B> <w. D> <x. As with many transaction time access methods. *> <j. *. H> <x. I> <w. 0. deletion can actually roots for different timestamps. I> Corresponding entries are inserted into nodes at the higher <s. 2. call choose subtree to locate the leaf node NL to <c. 2. else algorithms in detail. the branch to be followed at each level There can be multiple roots in an HR+-tree. 2> <g. physically delete e whenever necessary. 2. adjust (insert. due to insertion of u' in node A). 0. 0. 2. Next we describe the deletion 9.1 The find leaf algorithm. if created) containing the target entry. N1 (and N2. 0. 2. 2> <f. *. 2. 2. 0. R' is node D. 2> <h. if the root incurs a key split (c) Final situation after a key split 10. *. 0. 2. *> 8. E> <u. Obviously. 2. 2. *> accommodate new_entry <d. 1. 2> <c. 2. Figure 9: Creating a new root A B <a. A> <v. D> modifications are made similarly to Figure 6(b). . 1. *. The root of create new entries to ensure good query performance. *> <d. thus only 12. *. Figure 8 formally describes the treat overflow algorithm (algorithm key split is omitted (a) Root to be split (b) A new root is created because it is exactly the same as R-trees). In this way entries at every timestamp are a root table is maintained to record the corresponding bounded by the tightest MBR. which is reached by following entry u in node A. *> <e. 1> <g. *> <k. 2. 0. *> 1. HR+-trees has a smaller size than that of HR-trees since a which are handled as described in the previous section. 0. 2. and thus. Thus. *> from previous tree incurs a version split. update the most recent entry in the root table Figure 7: Example of strong version overflow Figure 10: Algorithm insert Algorithm Treat Overflow 3. *> <j. *> <i. 0. 2. due to insertion of If the removal of h causes the MBR of node D to decrease. 2. 2. Similar to insertion. key split it into N1 and N2 live entries can be deleted. 1. 2> <h. *. 1. 2. 1. duplicate e to e’ and enter e’ into N1 removed from the leaf node and the entries along the 7. *> <k. 2. 11. *. 1. In a new (logical) tree is created when the root of the Figure 11. the root table of cause overflows (e. Note that. search is directed to the current Figure 8: Algorithm treat overflow tree. for each live entry e in N containing the entry to be deleted. 1. B> <w. return N. The jurisdiction interval of R contains timestamps a new entry u' is created to bound the entries alive at the 0 and 1. if NI overflows invoke treat overflow <c. modify the deletion time of e and the insertion time of e’ to the current time 3. 0. 1. *> <h. H> <v. 1. if e was inserted at the current timestamp deletion path are adjusted. *> <e. *> 7. *.g. 2. 1. 0. deleted. <q. *> <c. 2. 1. create a new node N1 R-trees: (i) algorithm find leaf identifies the leaf node 5. 2. *> <b. 2. 2. In order to locate the node 13. while R' becomes the root for the current current timestamp (all entries but h) and u is logically timestamp. 1. E> <y. F> <z. 0. in Figure 7(c). G> <s. 1. *> <d. F> <z. 0. key split N into itself and N1 and go to line 13 Deletion in HR+-trees also follows the framework of 4. 2. 2. J> <r. 2. 0. 1. I> levels to reflect the changes. *> <f. we may need to bounding lifespan of all the entries in the root. 1. 2. G> <p. In Figure 9(b). 2> <i. 1. 1. (ii) the entry is 6.2. 2. presents the insertion algorithm. D> <y. *. if the root incurs a version split <d. if all entries in N were inserted at the current timestamp 3. invoke treat overflow to handle the overflow 4. 2. if necessary) appropriate entries to reflect the changes occurred in the levels below <a. 2. entry <p. *> <e. 1. 2. if NL overflows after entering new_entry <g. *> 3. for example. 2. create a new entry in the root table for this <e. 2. we want to delete <h. entry z. 1.2 Deletion algorithms and underflow handling 1. *. *> <k. 10. set N to the node that incurs overflow 2. *> 2. which is the minimum requested entry. 1. 1. 0. root is now responsible for multiple timestamps. ascend to the root and for each node NI in the path (a) Before version split (b) Strong version overflow A B C 5. *> <i.. A> <v. 2> <i. G> <r. *. 0. the split may <t. 2. 2. *. if N1 strong version overflows deletions are allowed for the current timestamp. *> <f. 2. generated by the version split of root R. 1. C> <s. (iii) underflows are handled 8. *> 6. 1. *> <j. K> propagate up to the root. 1. 0. K> <t. 1> <g. *> timestamp <f. 0. 2> <h. Figure 10 Figure 12 presents the formal description. and each contains both the spatial extents and the lifespan of the root has a jurisdiction interval. 2> 9.

2> <l. node A is modified as described in Figure 13(b). 1. 4. 2. 1. create entries in P for T1 (and T2. 2> <m. Therefore. 1.4). 0. *> <u. A> after deletion. if created) and modify <d. 1> <g. and should minimize the enlarged area of the 9. goto to line 7 for Ns <c. *. 0. 2. 2.2. <a. <v. 0. 0. Note that merging Ns with N gives the minimum area enlargement all the entries reinserted have their tstart set to the current 4. *> Figure 12: Algorithm find leaf <f. the weak version 2. 2> <k. 1. 2.2 Handling underflows. 1> <i. 2.. if e was inserted earlier and e. if N contains entry_to_find return N insertion (and version redundancy) is reduced. 0. *> <e. 1. *> <l. reinserts the live (b) After merging entries in a node that underflows. 1. 0. 0. 0. E> <v.tend of the live entries to the current time <a. 1. 1. 2> 11. The sibling node to be merged must be a 7. B> <c. *> <w. 1. 0. F> <h. 0. *> <n. 1. 1. 2. Thus. *> <b. 2> entries Figure 15: Algorithm treat underflow reinserted In Figure 14(a). if e. 0. 7. <d. 2. 1. 2. else return not found The third approach does not apply reinsertion at all. 1. a node containing some live entries) child of the 8. 2> <e. 0. *. 0. insert a copy e' of e into N and set e'. D> <c. we check each node in the linked list and reinsert its live entries if the underflow has Algorithm Find Leaf (entry_to_find) not been recovered. 0. 2. *> <o. 1. discard N 9. 0. 0.. *> <a. 1. enough entries after subsequent insertions. 1> <h. 1> <h. A> <b. *> <w. we will <f. F> <h. among all the nodes in S. 2. 0. <a. but simply <u'. 2> (a) Before merging 3. while 3. 0. *. node A may no longer underflow if some <c. return not found . *.. 0. It is not hard to see that it is always the <w. Before processing the first Figure 11: Duplicating an entry during deletion record of the next timestamp. 1. 0. 2. 10. 1. C> <b. 0. 2> <l. 2> <h. 0. 1> <i. D> <d.lifespan contains entry_to_find. call find leaf passing the node pointed by e live (i. 1. B> <a. 2. <a. 1. 2> <j. *> <u. 2. In Figure 13(a).. 2> <j.. <s. The first approach. on the average about 20% of the 2. while their original entries are deleted from A. *> 11.. for 1. *> <t. 1. but 5. *> entry is inserted later at the same timestamp. else set e.. Underflow occurs as the P A B C consequence of violation of the weak version condition . 0. 1> <g. E> <d. 1. 1> <g. 2. 0. return found <w. 1> <g.. h. 2. 1> <i. 2> <o. 0. 2. 1. *> current timestamp that fails to have enough live entries <x. the difference being that the live entries of both . C> <b. *> (Pweak = 0. let N be the node being considered typical datasets. 2> <m. *. *. if necessary) the entries pointing to N and Ns <e. 2> <c. set e. 2. 2.A D A D the observation that a node which underflows may have . *> <b. 0. 0. 0. 6.tend to the current timestamp <u. h.MBR and e. 2. C> . and j) to remain 1. 2> <n. 2. D> add it into a linked list storing all the nodes that incurred (a) Before deletion (b) After deletion underflows at this timestamp.e. Entry re-insertion may lead to version redundancy and Merging is similar to performing version splits in nodes A should be minimised. *> <n. 2> <g.tstart to the current P A B time .MBR contains entry_to_find. *> for example. *. 0. 2> <g. 0. 2. *> describe three alternatives to handle underflows. *> 13. 2. 1. 2> <h. *> . thus re- 3. The second alternative is based on and B. *> 12.. 2> we do not handle the underflow immediately. Figure 14(a) Figure 14: Use merging to handle underflows demonstrates a leaf node A that generates an underflow after deleting the entry <i.. Experimental results show that.lifespan as in B+-trees. 0. find the node Ns such that. *> <k. 0. 2> <m. if N is a leaf underflows are repaired at the end of a timestamp. 1. 2. set tstart of the entries in T1 to the current time j is physically deleted from node A because it was inserted 6. 0. 0. create a node T1 containing the live entries of Ns and N time. 1. 2> <j. and j are reinserted. node A generates an underflow (after (a) Node underflow (b) Entries to be reinserted the deletion of entry <i. 2. *> <c. *. *> 10. if entry_to_find was found (return from recursion) same father. 0. *> <t. *> <d. for each live entry e in N tries to merge the node that underflows with a sibling node. 1. 1. if all the entries in N were inserted at the current time A A 8. *> <u. 1> <h. 0. let N be the node that underflows and let P be its parent alive at the current timestamp. In the sequel. B> <s. 0.MBR has changed merged node. *> (delete. S = {the child nodes (other than N) pointed to by the live entries in P} condition is violated and g. 2> <k. *> there are only three live Figure 13: Reinserting entries in a node entries) and node B is identified for merging with A.. if T1 strong version overflows then key split to T1 and T2 at the current timestamp. 0.. <e. *> <d. derived directly from R*-trees. 1> <i. Entry 5. B> <b. *> when an underflow happens. 0. 1> <h. *. 2> <j. The deletion Algorithm Treat Underflow (using the merging approach) of i will cause only three entries (g. 2.

When entry g via entry u in node A. we save their block 2. C> . The memory overhead 6. we generated synthetic datasets with real-world semantics using the GSTD method [13]. C> . this number is 7.. Storing such information. for example. insert a new entry in the root table section. *> will overflow. if NI is not the root and underflows. search proceeds to the appropriate branches example. in which case a key split is performed. Only when we have NL underflows finished all the nodes at this level. ascend to the root and for each node NI in the path level will be searched by the address information saved. invoke find leaf to locate the node NL that contains entry_to_del accessing these nodes immediately. Note that it is possible that node C B <c. 1> <h. entry g can first be reported in considering both spatial and the temporal extents node A. reflect the changes in the lower level Using this approach. 2. Due to the lack of real data. created by version splits. Continuing the example of Figure 14. however. delete entry_to_del from NL and invoke treat underflow if addresses and check for duplicates. <b. an amount of memory is needed 5. 1. n as well). When processing an interval query we branches. <e. After ids (similar to the negative pointers of HR-trees). duplicate visits are caused by version a negative id and its lifespan starts at timestamp 2. of node C is encountered. The processing of interval queries is more m. can be reported more than once in the result of Query processing in HR+-trees is similar to that of HR. a new entry pointing to the same child node is the interval. 2> <i. 2> the third approach (the formal descriptions about the first Figure 17: Duplicate visits in a query two approaches are omitted since they are relatively Another solution is to perform (interval) queries in a straightforward). *. Instead of 1. it will be discarded since it has In HR+-trees.. we can decide the Algorithm Delete (entry_to_del) nodes that need to be visited at the next level. we can decide whether to visit node 4. it is unnecessary to the interval. 2. <f. the nodes at the next 3. <d. if NI overflows invoke treat overflow to maintain the block addresses. 1. Typically.. 2> Figure 15 describes the treat underflow algorithm using <v. 2> . but complicated. l. roots whose associated logical trees will be accessed. In Figure 17. 0. and then in node C (this is true for entries h. k. the spatial extents of the old one. A priority heap is used to maintain the addresses Figure 16: Algorithm delete in memory because such a heap can support search and update in logarithmic worst case time.. All copies have the same spatial extent. 1. it may be visited many times during an interval only report the copy that contains the first timestamp of query 3 .. Experiments C (from v) by checking if entry u intersects the query window. make the child of the entry the new root 9. 0. will In this section. if the root incurs a version split overhead hardly affects the performance in the experiment 11. through extensive experimentation. Thus. interval queries. splits and entry reinsertion at intermediate levels. One approach to avoid duplicate without any additional space overhead. adjust (insert or delete. (lifespan).3 Window query processing Similarly. In Figure 17. update the most recent entry in the root table allocated for this purpose. if necessary) the entries in NI to is evident that duplicate visits are trivially avoided. though CPU time is negligible compared to I/O cost in most cases. for instance. [10. search is directed to the root we distinguish the redundant versions by using negative whose jurisdiction interval covers the timestamp. <a. since a node can be shared by multiple different lifespans. The deletion algorithm is described in breadth-first way. if information about entry u is stored in v. 2. 3.g. in Figure 14. assume follow entry v in node B if we have already visited node C an interval query covering timestamps 1-3. Corresponding . we compare HR+-trees with HR-trees significantly lower the fanout of the tree. duplicate elimination can be achieved created from an old one. It 4. invoke treat depends on the maximum number of nodes that will be underflow accessed at a level in a query... we start with the set of Figure 16. A C nodes are inserted in a single new node C. To be specific.. *> entries are modified or inserted in the parent node P to <u. . 0. We will show that such memory 10. 0. In these implying that an earlier copy (that of node A) intersects cases. visits is to store in the new entry. if the root has only one entry but not a data page low and a very small fraction of buffer pages may be 8. For that.. 0. By examining the entries in these nodes. 1> <g. For timestamp queries. 11]) as a benchmarking 3 The method proposed to solve this problem in MVB-trees [3] cannot environment for access methods dealing with moving be applied in our case. GSTD has been widely employed (e. *> reflect the changes. In order to perform duplicate elimination trees. redundant leaf entries. 0.

the node capacity for HR-. the agility of the dataset in the sequel is 5%. leading to smaller tree size and better 6% agility.5 purpose. the order of the queries is noticeably worse than the other implementations.85 follows.2.. and HR+PRE is 50.01. 1 – length]. 5. respectively. deferred reinsertion. HR+NRML. grow linearly with dataset of experiments was performed to explore the optimal . Each workload contains size. lowering Megabytes. determined following a Gaussian distribution. timestamp query performance is compromised. Underflows in HR+PRE are handled by The performance of different access methods is reinserting the entries immediately. or 10% of the First. 5%. we compare the four HR+-tree implementations total universe. the page is returned to the buffer. and HR+MRG correspond to they eventually tend to scatter uniformly across the spatial the versions with underflow treatments based on universe. 10. sizes of the two methods (in Megabytes) as a function of Small values for Pweak and PSVO reduce the number of dataset agility. if on the average p% of the objects change described in section 3. when additional memory is needed to maintain the use this version to represent HR+-trees in the subsequent block addresses.e. 46. which account for Figure 18 shows the sizes of the HR+-trees for datasets up to 20% of the entire history. The sizes of thus. at timestamp 0) are and in the sequel we use these values. insertion) is the most efficient implementation. thus for a query that 6. HR+NRML. A set HR+-trees. HR+DFR. we trees. The best overall 10.3) so that duplicate visits are their positions at each timestamp. and merging modeled as float numbers ranging from 0 to 1 with respectively (as described in section 3. on the other hand. Conversely. At this point the size of HR-trees is around 33 interval query performance. 46. Size (Mbytes) buffer. Furthermore.5 Typical queries demand a very small amount of memory 5 for maintaining the block addresses in memory. HR-trees grow very fast with agility and underflows and version splits respectively. while lowering PSVO introduces more key splits. Pweak Figure 18: Size comparison of HR-tree variations and PSVO. 15. i. and hence avert eventually the size of HR-trees appear to stabilize after version redundancy. length denoting workloads page accesses required for various interval lengths and with different query areas and interval lengths. Notice that PSVO must be at least twice as large The remaining experiments compare HR-trees and as Pweak to guarantee that the weak version condition can HR+-trees on several aspects. when less memory is required for this 7. one memory page is allocated from the experimentation. In subsequent 100 timestamps. Unless specifically avoided in interval queries without deploying the breadth- stated. Timestamps are immediate reinsertion. HR+DFR. modeled as a unit square. we refer to with agility 5%. 46. Objects’ initial positions (i.5 and is generated as performance was achieved for Pweak = 0. queries). or 20 timestamps. Queries are window areas 1% and 10%.024 bytes for all the previous versions does not pay off due to lower node experiments and an LRU buffer with 200 pages (200K capacity. 100 times the size when agility equals to 0% Pweak reduces the node usage with respect to a single (static objects). one for each timestamp. (iii) underflows than merging. The HR+-PRE variation performs unless specifically stated. At each timestamp. For processing interval queries in HR+.5 needs to access 1000 leaf nodes (a really large query 6 window). The diagrams generated in a completely random manner: (i) the extents suggest that HR+MRG is less efficient than HR+NRML. HR+PRE. the percentage of queries are answered by searching the trees in the breadth- objects that will change their positions is roughly the same first manner. 500 queries with the same (window) area and interval HR+MRG.points and regions. respectively. while Figures 19(a) and (b) illustrate the each workload as WRKLDarea. the completely randomized. namely. The intervals used involve 1 (timestamp with respect to space requirements and query performance. On the other hand. Each block 7 address is represented by 4 bytes. The last version. (ii) the starting point of each query’s interval reinsertion of entries is a better approach to handle distributes uniformly in the range [0. at most 5 pages are necessary at any time. In the sequel.2). objects move in way such that particular. speed-up achieved by storing the spatial extents of The page size is set to 1. stores the extents and corresponds to the agility of the dataset. in of each query window distribute uniformly in the spatial terms of both space and query cost. a dataset of the previous version in each intermediate entry (as has agility p. indicating that universe. first search.e. DFR NRML MRG PRE There are two parameters for HR+-trees. In the Four different HR+-tree versions were implemented. The areas of queries correspond to 1%.000 regions with density 0. 5. implying that HR trees degenerate into timestamp. hence. and 33 length (number of timestamps included in the interval). it is clear that HR+DFR (deferred re- bytes) is assumed. Interval granularity 0. individual R-trees. Figure 20(b) shows the still hold after a key split due to a strong version overflow.4 and PSVO = 0. Assuming 1K page measured by running workloads. Each of the following datasets contains settings for these parameters.

the HR+-tree outperforms the HR- 0 0 2 4 6 8 10 tree even on timestamp queries.9% of the HR+-tree's size) and performed difference increases with the query length. It is evident that HR+-trees I/O accesses 300 HR+ HR are much smaller than HR-trees. The 1. batched workloads). large difference of the sizes of the two structures. We set the cache size of the query length. This is reasonable agility (%) because each logical tree in HR+-trees is responsible for Figure 20: Size comparison under different agilities multiple timestamps. 1~20 Figure 21: Query Performance (agility 5%) 200 897 800 I/O accesses HR+ HR 600 0 1 5 10 15 20 interval length 400 (b)WRKLD10%. HR+-trees outperform HR-trees on all temporal extents deploy similar parts of the index. 100 150 I/O accesses 50 DFR NRML MRG PRE 0 1 5 10 15 20 100 interval length (a) WRKLD1%.and HR+-trees as a function of interval lengths for workloads involving 5%-area windows. 1000. For timestamp workloads with randomized timestamp queries. 350 agility at a reasonable speed. This usually reduces the I/O accesses in the the HR-tree has only marginal improvements. 1~20 600 I/O accesses 200 DFR NRML MRG PRE 0 1 5 10 15 20 400 interval length (b) WRKLD10%. thus search may be performed in the Figure 21 compares performance using workloads of same tree for adjacent timestamp queries. HR+-trees are 5% to 10% less efficient than HR. For 1000 or presence of buffers. as 200 the high mobility of data forces both structures to 150 degenerate into individual trees. This is within our expectation due to the fact that 5% of the total workspace.9 50 1000 I/O accesses HR+ HR 800 0 600 1 5 10 15 20 interval length 400 (a) WRKLD1%. shows the results for queries involving windows covering trees. 0~20 1317 1774. node capacity is smaller for HR+-trees.7% to 67. 3000. For interval queries. though eventually (agility 250 o 100%) both methods will tend to have similar sizes. This is not surprising considering the workload according to the starting time of their intervals. The HR+-tree performs significantly better than HR- In some cases (e. due to the fact that queries with close more buffer pages. we investigated the performance of both number of queries in a workload. The efficiency of the sorted chronologically before being submitted to the HR+-tree eventually improves by more than 50%. Figure 23 queries. 2000. HR+-trees to 100.g. which utilizes queries with windows covering 1% and 10% of the the buffer more efficiently. In particular. 4000 pages (which accounts for outperform its competitor by a significant factor. queries were sorted in each interval length). whereas system. . The page accesses are averaged over the Finally. 1~20 Figure 19: Performance of HR+-tree implementations 200 size (Mbytes) 35 HR HR+ 30 0 1 5 10 15 20 25 interval length 20 Figure 22: Sorted queries (WRKLD5%.. Figure 22 shows the page accesses of HR. In the aspects (recall that the performance gap increases with the following experiment. and shown as a function methods when the cache size varies. 10 the HR+-tree receives larger improvements for all interval 5 lengths. queries can be trees when the buffer size increases. 1~20) 15 Although the performance of both methods is improved. workspace.

Council of the Hong Kong SAR. (ii) query algorithms that can avoid [10] Nascimento. Nascimento. VLDB 1996. on such ideas. W. Theodoridis. J. P. may be to store Manolopoulos. Schneider. 1989. P.. Multidimensional Access queries. Future investigation could [9] Nascimento.... Query Processing 5. O. Sellis. 1990.. models for HR+-trees. International Workshop on spatio-temporal joins) with HR+-trees.. 1998. this would result in huge space requirements. T. D. J. B. Evaluation of Access Structures for Discretely and (iii) efficient algorithms of other operations (e. VLDB Journal 5(4): 264- 20 275. E. Spatio-Temporal Database Management. 1990. ACM scenarios where region objects move at steady speeds. Güting. We are currently investigating solutions based 1998. many Approaches to the Indexing of Moving Object problems remain unsolved. M. R-trees: A Dynamic Index Structure HR+-tree does not assume that data are known a priori. x HR+-trees inherit HR-trees’ efficiency on timestamp [6] Gaede. R. efficient method for retrieval of historical information Making Data Structures Persistent. traffic management systems. the [7] Guttman. Seeger. Furthermore. V. velocities. Computer Surveys 30(2): 170-231. B. Kriegel. ACM SIGMOD.. Theodoridis. very different demands on the indexing methods. STDBMS to spend most of time just handling the updates. ACM SIGMOD. J.. Potential applications include urban planning and Multiversion Data. attempting to update the database [13] Theodoridis. Y. 2000. 1999. N. Silva. N. In this paper. Sleator. Verma. unlike 3D R-tree based methods [10]. Computer and System Science 38(1): 86-124. Compared to HR. Nardelli. A. The [12] Salzberg... D. Silva. C. Furthermore.. L.. 0 100 1000 2000 3000 4000 B. [14] Theodoridis.. An Asymptotically Optimal 30 Multiversion B-Tree. A Data Model and Data Structures for for the corresponding HR-trees (usually less than 20%).. Tsotras. Seeger. Sarnak. J. H.. Methods. Tarjan. Computing Surveys 31(2): 158-221. SSD. A better solution. grants HKUST SDH. Specifications for Efficient information about objects’ motion patterns such as Indexing in Spatiotemporal Databases. R. 1998. Novel extensive attention during the past few years. Salzberg. Han. 40 Widmayer. D. ACM SAC. Papadopoulos. 6090/99E and HKUST 6070/00E. Conclusions and Future Work Techniques for Multiversion Access Methods. . The R*-tree: an Efficient and Robust Access cache size Method for Points and Rectangles. all duplicate visits while incurring no memory overhead. T. Jensen. 1996 10 [2] Beckmann.. 2000. Gschwind.. Although spatio-temporal databases have received [11] Pfoser. though still not present. A Comparison of previous structures. but perform much better on interval queries. Schneider. ACM SIGMOD. for Spatial Aearching. 70 I/O accesses HR+ HR References 60 50 [1] Becker. trees. RT-tree: An Improved R- This work was supported from the Research Grants tree Index Structures for Spatiotemporal Data. Ohler.. Access Methods for method. 1984. V. Various applications place Trajectories. may not be efficient for Access Methods for Temporal Data. R. B. Silva. we propose HR+-trees a time and space [4] Driscoll.. R. J. HR+-trees have the following properties: [5] Forlizzi. S. Y. for example. x They consume a small fraction of the space required M. J. Moving Objects Databases. [15] Varman. M. Y. queries. SSDBM. 1999. An Efficient Multiversion Access Structure. X.. This is because. 1999. Y. Y. VLDB. A. x The improvement increases with the buffer size.. B.... [3] Bercken... the most common method of handling timestamp 1989. On the whenever the objects change their positions will cause the Generation of Spatiotemporal Datasets. Journal of regarding moving regions and points. Acknowledgments 1997 [16] Xu... Lu. thus can be used as an on-line spatio-temporal access [8] Lomet. Towards Historical R- focus on the following issues: (i) accurate analytical cost trees.g.. B. Moving Points. IEEE TKDE 9(3):391-409. Seeger. M. ACM Figure 23: Random timestamp queries / buffer size SIGMOD. Gunther.