Professional Documents
Culture Documents
LSD - Tree PDF
LSD - Tree PDF
- 45 -
dimtial2
4
100
*O 4 9
60
40
20 1
0e
d ioio
Figure 2.1: Possible partition of the data space for an LSD tree
- 46 -
n b b
a strategy is to split a cell into two cells of equal
areas.Note that this halving split strategyrelies on
the assumptionof a uniform distributionof the objects.
Since the LSD directory is a binary tree, any type of
split strategycan be implementedin an easy and efficient
manner. Note that data dependentsplit strategiescannot
be realized by data structuresbased on hashing (see e.g.
[HSW88al, [KS861,[KSSSI,lNI-IS841).
We now turn our attentionto the directory of the LSD
tree. As already mentionedin the previous section,if the
numberof nodes in the directory T exceedsthe maximal
possiblenumberof internal nodes,a subtreeof T is written
onto secondarystorage,i.e. stored in a directory page. In
sucha directorypagea subtreeis organixedas a sequential
heapof fixed height hr. Hence,wheneverthe height of the
associatedsubtreeexceedsh, after an additional insertion,
a directory page split has to be perfomxd. The directory
page split algorithm is simple: the left and right subtreeof
Figure 2.3: Overall structure of the LSD tree the root are storedin two distinct directory pagesand the
root is insertedinto the directory T by calling the directory
directory root insertion algorithm.
Wearenow in a positionto describehow a new nodeq,
P resultingfrom a bucketor a directory pagesplit, is inserted
split of into the directory T by the directory insertion algorithm.
The heart of this algorithm is the paging algorithm we
explain afterwards. In the following, we assumethat the
main memory capacity reservedfor the directory T is ni.
The directory insertion algorithm assuresthat the internal
prefix treeTi of the directoryT containsat most~-1 nodes.
uu lb10 casesmay occur:
Figure 2.4: Effect of a bucket split case1: The fatherp of the new nodeq is an internalnode.
The searchfor the bucketb which will receivethe new cast 1.1: The number of internal nodes is less than
object is guided by thedirectory as in kd trees. If b does ni-1. Then insert q into Ti and finish.
not overflow, the insertionis finished,otherwise the bucket case 1.2: The number of internal nodes is equal to
split algorithm createstwo new bucketsb and b, from b ni-1. Then insert q into Ti, cdl the paging
accordingto a split strategy we describeafterwards.In case algorithm for T, and finish. Note that after
of a bucket split the pointer in the directory referencingb the execution of the paging algorithm the
is changedto a pointer referencinga new directorynode q numberof internal nodesis at most b-1.
representingthe split decisionconcerningb, i.e. the new
nodeq is insertedinto the directoryby calling the directory case2: The fatherp of thenew nodeq is a nodein a subtree
insertion algorithm explainedlater. The new nodeq points TP of T storedin an externaldirectory page.
to the new bucketsbi and b,,. Figure 2.4 depictsthe effect
of a bucket split. case21: After the insertion of q the height of TP is
We distinguishbetweentwo inherently distinct types at most hr.; then finish.
of split strategies: case2.2: After the msertionof q the height of TP is
greaterthan hr. Then call the directory page
1. Data dependentsplit strategies split algorithm for TP and finish.
These strategiesdependonly on the objects storedin
the bucket to be split. A typical examplefor such a
strategyis to choosefor the split position the aver- The paging algorithm is called when after an insertion
age of all object coordinateswith respectto a certain of an additionalnode the size of the internal PE~IXtree Ti
dimension. reachesthe maximalpossiblenumberni of internal nodes.
2. Distribution dependentsplit strategies The algorithm searchesfor a subtreeTI in Ti such that
Thesestrategieschoosethe split dimensionandthe split pagingTI preservesthe externalbalancingproperty&fined
position independentlyof the actual objects storedin in section2.1. This property is preserved if T, is a paging
the bucket to be split. A typical examplefor such candidate, i.e. T fultllls the following properties: *
- 47 -
is scanneduntil the object searchedfor is located or the
searchends unsuccessfully.
233 Insertion
The insertionalgorithmhasbeenexplainedin detail in
the previous section.
233 Deletion
The deletion of a node in the directory T basically
.... ........ ........ ....... ..... ........ ........ ......
works inversely to the insertion of a node. Due to space
LJ limitationswe cannotdiscussthis topic in more detail.
Figure 2.5: Directory before and after paging subiree T, 23.4 Range query
1. Any path from the mot of TI down to a bucketcontains In a rangequery all points locatedin the query region
the minimal numberof externaldirectorypages(of all are reported. According to Fredman lFred801,a query
pat.hsinthedirectoryT). regionmaybe a rectangle(orthogond range query), a circle
2. The height of Ts is at most hr.
(circular range query) or a polygon@olygonalrange query).
In the following, we restrict the discussionto orthogonal
Figure 2.5 shows a directory before and after paging rangequeries,becausethealgorithmis the samefor all query
the subtreeTs. types,exceptfor the proceduresevaluatingwhethera data
If more than one paging candidateoccurs in Ti, the region is enclosedby, intersectedby, or disjoint from the
paging algorithm choosesa candidatewith the maximal queryregion. But thesearedetailsleft to the implementation
possiblenumber of nodes. level.
In order to direct the searchfor a paging candidatein In order to report all points locatedin the query region
Ti the following numbersare attachedto eachnodev in Ti: Q we haveto traversethe LSD tree to determineall buckets
n-(v), resp. nep&v),: the minimal, resp. maximal, whose associatedcells intersect Q. The query algorithm
number of external directory pagesoccurring on any movesdown the LSD tree branchingat eachdirectorynode
path in T containing v. w accordingto the following criteria: (Here D(w) denotes
s(v): the numberof nodesof the biggestpaging candidate the data region which is the union of all data cells whose
which can be reachedfrom v. correspondingbucketscan be reachedfrom w.)
h(v): The height of the subtreewith root v in Ti.
The paging algorithm movesdown the internal direc- 1. If&nD(rightsonofw) = 0,
tory Ti branchingat eachnodew on the searchpath accord- continuewith the left son of w.
ing to the following criteria: 2. IfQnD(leftsonofw) = 0,
continuewith the right son of w.
1. If neM(left son of w) # nw(right son of w), 3. Otherwisecontinuewith both sonsof w.
continuewith the son with lower rim.
2. If new(left son of w) = n~(right son of w), Note thatQ n D(w) # 0 is the invariantcondition of
continuewith the son with greaters. the loop of the query algorithm. Hence, in l., resp. 2..
C$; z; son of w) # 0, resp. Q rl D(right son of w)
The root r of a paging candidateT is determinedif , .
1. h(r)<h,and 3. The LSD tree for non-point objects
2. n-(r) = nep&r).
We explain the non-point situation for k-dimensional
The secondcondition assuresthat after pagingTs the exter- intervalswhich serveas boundingboxes for arbitrary geo-
nal balancingproperty is preservedfor T. metric objects in many applications. We restrict the dis-
It shouldbe clearthataftertheinsertionof a new nodeq cussionto the 2-dimensionalsituation,i.e. to rectanglesin
into the internaldirectoryTi, resp. the pagingof a subtreeTI the plane,becausea generalizationto higher dimensionsis
of Ti, the numbersnw, nep,,,,, s and h mustbe updated straightforward.
foreachnodewonthepathPfromtherootofTitothe To store a set of rectanglesin the LSD tree we use
fatherof q, msp. to the fatherof the root of Ts (now stored the transform&on technique (lI-Iin851, [SK883), i.e. 2-
in an externaldirectory page). Thesenumberscan easily dimensionalrectanglesare storedas 4dimensional points.
be recomputedfrom the existingnumbersof the nodes(and We choosethe simple corner representation[SK881which
their direct sons)on the path P. considersfor each of the two dimensionsthe lower and
2.3 The operations upperboundsof the rectanglesto be distinct dimensions.
The idea is simple but severalsevereproblemsarise
23.1 Exact match from this approach.First, there is a strong conelation be-
In an exact match operationthe directory is traversed tweenupperand lower bounds,becausefor eachdimension
until the correspondingbucket is determined. The bucket the upper bound of a rectangleis always greaterthan (or
- 48 -
equal to) the lower bound. Becauseof the correlationall
pointsare locatedin a triangularshapedsubspaceof the im-
age space.Furthermore,sincein almostall applicationsall
rectanglesare small comparedto the data space,the points
are locatedin a small strip abovethe diagonal.
Datastructureswhich rely on a rectangularshapeddata
spaceand partition the data space into rectangularcells
tend to degeneratefor such applications,especiallyif they
are basedon hashingtechniques. However, the LSD tree
overcomesthe drawbacksof the transformationtechnique
if a refined bucket split strategyis used. Since the split
strategyis crucial to the efficiencyof the LSD tree for non-
point objects,we devotethe next sectionto this topic.
3.1 The split strategies
Figure 3.1: Split positions achieved by two basic split strategies
In this section,we discusssplit strategiessuitablefor
the skew data distributionsinducedby the transformation SP2. This effect is desirablefor the usual situationwhere
technique.First of all, a suitablesplit strategymusttakeinto rectanglestend to be small comparedto the data spaceand
accountthe correlationbetweenlower and upperboundsof hencethe imagepoints tend to be locatedin a strip above
rectanglesin the original dimension I, resp, 2, stored in the diagonal. We performedsimulationswith other roots
dimensions1 and 2, resp. 3 and 4. but the lo* root behavedwell in all cases.
We will explain two different split strategies,a data
dependentand a distribution dependentone. The data 33 The operations
dependentsplit strategy is simple: For the split dimension In this section,we discussthe LSD tree operationsfor
under concern the averageover all coordinatesof objects non-point objects. The operationsexuct match, insertion
storedin the bucket to be split including the object to be and deletion are identical to the correspondingoperations
insertedis chosenas the split position. for 4dimensional points. In the case of a range query
The distribution dependentsplit strategy is a combi- the situationis different,becausethe original 2dimensional
nation of two basic (distributiondependent)split strategies query region and the 4dimensional image query region
each of them designedfor an extremesituation. The first differ substantiallybecauseof the different representations
split strategyrelies on the (fictitious) assumptionthat all of the objects.
rectanglesare degeneratedto points, i.e. the upper and In a range query, the query region can either be an
lower bounds coincide for each dimension. Here, all im- orthogonal rectangle,a circle or a polygon. Independentof
age points are locatedon the diagonalw.r.t the dimensions the threekinds
1 and 2, resp. 3 and 4. A suitablesplit strategyfor this case query types forofa query
set of
regionswe distinguishbetweentwo
rectangles% (see [SKSS]):
is to split the data cell into two cells containing equally
long parts of the diagonal. The split position achievedby 1. Rectangleintersection:
this split strategyis denotedbp SPr in Figure 3.1. GivenaqueryregionQfindallR~~s.t. QnR# 0.
The secondbasic split strategyrelies on the assump- 2. Rectangleenclosure:
tion that all imagepoints am uniformly distributedover the Given a queryregion Q tind all R E % s.t. R E Q.
triangular subspaceof the image spaceb&t from dimen- In contrast to the point situation, the algorithm for
sions 1 and 2, resp. 3 and 4. Here, a suitablesplit strategy circular and polygonal range queries is different from the
halvesthe datacell into two cells of equal areas.The split algorithm for orthogonalrange queries. However, due to
position achievedby this split strategyis denotedbp SPs spacelimitations of the paper we discussonly orthogonal
in Figure 3.1. range queries.
The split position SP calculatedby the combinedsplit Webeginexplainingthe rectangle intersection problem
strategyis the weightedsum of SPt and SP2: for orthogonal query regions. In this case,the original 2-
dimensionalquery region for rectanglespi, ut] x ps,us] is
SP = aSPI +(1 -a)SPs ) where transformedinto a 4dimensional query region for points.
For eachoriginal dimensiond E (1,2) we define
Then for the original query region pi, ui] x [1s,us] the im-
(Here Ld denotesthe lower and & the upperbound of the age tegion is given by cp(@t , ui]) x(p([ls, ui]). Figure 3.2
data spacewf.t the split dimensiond) illustratesthe transformationof a query interval &I, ttd].
The effect of the choice of Q is that for large data The areaof the.imageregion can be reducedby using
cells SPapproachesSPi while for smalIcelIsSPapproaches the transformationp instead of 9. Let & denote the
- 49 -
Ld d d Iowa bounds
- 50 -
storage utilizatim
zky
2,865 69.8 % 68.2 %
282 70.9 % 68.1%
2,838 70.5 % 68.8 %
Table 4.1: Size of the directory and storageutilization (directory page size = 512 bytes; max. number of internal nodes = 1OW)
- 51-
number of rectangle
-gl= distribution
unifoml
*,
skew
I
unifolm
skew
I lOO,OCHI skew
Table 4.2: Rang query pexformauce (directory page size = 512 bytes; max. number of internal no&s = 1000)
below 69% but still above 60% and hencenot bad at all. A The hit ratio is higher for smaller bucket capacities, be-
comparisonof the overall storageutilization and the bucket causeof the higher selectivity. For the data dependentsplit
utilization convincingly demonstratesthat the storagespace strategy, bucket capacity 5, skew distribution, and query
neededto accommodatethe directory is rather small com- type 2, the hit ratio is 66.4% if only bucket accessesare
pared to the data storage space. counted. This is nearly optimal with respect to a normal
We now turn our attention to the performance of the bucket utilization of 69.3%. For smaller query regions and
LSD tree operations. Clearly, in an exact match at most i larger bucket capacities the hit ratio deteriorates: Changing
directory pagesand one bucket must be read if the directory the bucket capacity to 50 yields 54.2%.
contains i external levels. The performanceof the insertion The performanceresults in the sorted situation can be
procedure is easy to estimate, too. For bucket capacity 5, explained by the fact that the data cells tend to be long and
resp. 50, we have between 4 and 5, resp. less than 3, ex- small for the dam dependentsplit strategy in this case.
ternal accesses(directory page and bucket I/G) per inserted For the remainder of this section we comparethe per-
object, if 100,000objects are inserted into an initially empty formance of the LSD tree and the multilayer gridfile [SW881
LSD tree irrespective of the split strategy used. using 5 layers (5L-GF for short). According to the multi-
Hence, we focus on range queries. We concentrateon layer philosophy layer 5 is implemented as a clipping grid
the intersection query becauseits performance is an upper file. We have inserted 50,000 uniform distributed rectan-
bound of the enclosure query performance. (Experiments gles in random order into both initially empty structures.
show that the enclosurequeriescan be carried out 10%faster The cover quotient is 2.5 and the bucket capacity varies
than intersection queries on the average.) Table 4.2 shows from 5 to 30 in stepsof 5. It turns out that the SL-GF is not
the average number of external accessesfor two types of able to work with bucket capacity 5. After the insertion of
range queries. For square regions of sixes 0.5% and 5% 20,974 rectangles a bucket of the clipping layer could not
of the size of the data space,we have performed 20 range be split becauseeachrectangle stored in this bucket covered
queries each, at random positions. the whole corresponding data cell. (For the skew data dis-
As with all other data structures larger query regions tribution the 5L-GF runs into a similar error situation even
lead to fewer disk accessesper found object, becausethe for bucket capacity 10.)
number of buckets completely contained in the query region Figure 4.3 showsthe bucket utilization for the LSD tree
grows faster than the number of buckets intersectedby the with the data dependent split strategy (LSD&J, the LSD
region boundary. 2 wiem~G~tribution dependentsplit strategy (LSD&,
Another important characteristic number is the hit ratio, .
defined as The range query performance is illustrated in Figures
4.4 and 4.5. We have used the samequery types as before.
number of objects found For the first, resp. secondtype, 308, resp. 2712, objectsare
selected on the average.
bucket capacity x disk accesses
- 52 -
height proves the efficiency of the LSD tree [HSW89]. At
*w . . . . . . . . . . . . . . . . . .. . the moment, we are implementing more general (spatial)
qsts
70% operations, like non-orthogonal range queries, point queries
[SK881 and queries where geometric as well as standard
attributes are qualified. Furthermore, we are embedding the
LSD tree as spatial accesspath into the geometric database
b system Gral [Gtit89]. Hence, an empirical study about the
5 10 15 20 25 30 bu&aapcity benefits of the LSD tree in such an environment can be
carried out in the near future.
Figure 4.3: Bucket utilization
A hit ratio References
[Ben751 Bentley, J.L.: Multidimensional Binary Search Trees Used in
Database Applications. Ccmmunications of the ACM, Vol. 18, 9,
509-517, 1975
[FSR87] Faloutsos,C., Sellis, T., Roussopoulos,N.: Analysis of Object
Oriented Spatial Access Methods, Proc. ACM SIGMOD Int. Conf.
on Management of Data, 426-439, 1987
zm-.......................... [Fred801Fredman, ML.: The Inherent Complexity of Dynamic Data
StructuresWhich AccommodateRange Queries, IEEE. CH1498-5/80.
3;;;;;; b 1980
5 10 1s 20 25 30 bwk~capcity [Frcc87] Freeston.M.: Jbe BANG file: a new kind of grid tile, Proc.
ACM SIGMOD Int. Conf. on Managementof Data, 260269. 1987
Figure 4.4: Range query performance (0.5% of the data space) [Gut841Guttman, A.: R-Trees: A Dynamic Index Structure for Spatial
Searching. Proc. ACM SIGMOD Jnt Conf. on Managementof Data,
47-57. 1984
[Gut891 Gitting. R.H.: Gral An Extensible Relational DatabaseSystem
for Geometric Applications, Pmt. lsth Int. Conf. on VLDB (1989).
to appear
[Him851Hiichs, K.: The Grid Fide System: Implementation and Case
Studiesof Apphcations, Doctoral Thesis No. 7734, ETH Zurich, 1985
[HSWSSa] Hutflesx, A., Six, H.-W.. Widmayer, P.: Globally Order
Preserving Multidbnensional Linear Hashing. Ptoc. IEEE 4* Jnt,
Cod. on Data Engineering. 572579. 1988
[HSW88b] Huttlesx, A., Six, H.-W., Widmayer, P.: Twin Grid Fiies:
Space Optimizing Access Schemes,Proc. ACM SIGMOD Jnt. Conf.
cu Management of Data, 183-190, 1988
Figure 4.5: Range query performance (5% of the data space) [HSW89] Hemich. A., Six, H.-W.. Widmayer.P.: Paging binary treeswith
external balancing, Proc. Jnt. Workshop on GraphtbeoreticConcepts
It turns out that the LSD tree with the data dependent in Computer Science (WG 89), Springer Lecture Notes in Comp.
Science, to appear
split strategy clearly outperforms the SL-GF while the LSD [KS861 Kriegel, H.-P., Seeger, B.: Multidimensional Order Preserving
tme with the distribution dependentsplit strategy is at least Linear Hashing with Partial Expansions, Proc. Jnt. Conf. on Database
as efficient as the SL-GF. We have not comparedthe exact lltcory. 203-220, 1986
match and insertion performancebecauseit is obvious that [KS881 Kriegel, H.-P., Seeger, B.: PLOP-Hashing: A Grid File without
the 5 layers of the SL-GF do not allow a competitive Directory, Proc. IEEE 4* Jnt. Conf. on Data Engineering, 369-376,
pXfOIRllUlCe. 1988
[KW85] Krishnamtuthy. R.. Whang. K.-Y.: Multilevel Grid Files, IBM
It should be noted that besidesits better overall perfor- ResearchRepott, Yorktown Heights, 1985
mance the LSD tree is much easier to implement than the [rZL88] Litwin. W., Zegour, D., Levy, G.: Multilevel Trie Hashing,
SL-GF and does not need an additional (completely differ- Proc. Jnt. Conference Extending Database Technology (BDBT 88).
ent) overtlow data structure for storing objects which do Springer Lecture Notes in Ccmp. Science, 309-335. 1988
not fit into the main structure. [NHSM] Nievergelt, J.. Hinterberger, H., Sevcik. K.C.: Ibe Grid File:
An Adapable Symmetric Multikey File Structure, ACM Transactions
5. Conclusion cm DatabaseSystems. Vol. 9. 1. 38-71. 1984
[Otoo86] Otoo, E.J.: Balanced Multidimensional Extendible Hash Tree.
We have proposed the LSD tree, a data structure sup- Proc. 9 ACM SlGACf / SIGMOD Symposium on Principles of
porting efficient spatial accessto geometricobjects. Its main DatabaseSystems, 100-113. 1986
advantagesover other structuresare that it performs well for [Rob811Robinson. J.T.: The K-D-B-Tree: A Search Structurefor Large
all reasonabledata distributions, cover quotients, and bucket MultidirnensicmalDynamic Indexes. Proc. ACM SIGMOD Jnt. Conf.
capacities,and that it maintains multidimensional points as on Managementof Data. 10-18. 1981
well as arbitrary geometric objects. These properties make [SK881Seeger,B., Kriegel. H.-P.: Techniquesfor Design and Jmplemen-
tatiut of Bfiicient Spatial Access Methods, Proc. IS* Int. Conf. on
the LSD tree extremely suitable for the implementation of VLDB. 360-371. 1988
spatial accesspaths in geometric databases. [SW881 Six, H.-W., Widmap P.: Spatial Searching in Geometric
In addition to the performance evaluation an analysis Databases,Prcc. IEEE 4 Jnt. Conf. on Data Engineering, 496-503,
of the expectedstorageutilization and the expectedexternal 1988
- 53 -
- 54 -