You are on page 1of 10

The LSD tree: spatial accessto multidimensional point and non- saint objects *

AndreasHenrich Hans-WernerSix PeterWidmayer


FemUniversit&Hagen FemUniversit&Hagen UniversitiitFreiburg
5800 Hagen 5800 Hagen 7800 Freiburg
WestGermany WestGermany WestGermany

Abstract An obvious approachfor storing intervals uses data


structuresfor points. Here, intervals need not be entirely
Weproposethe Local Split Decisiontree(LSD tree,for containedin a cell, but may insteadintersectseveralcells.
short),a data structuresupportingefficientspatialaccessto Hence,in order to perform rangequeriesefficiently, infor-
geometricobjects.Its main advantagesoverother structures mation about an interval must be stored in each bucket,
arethat it performswell for all reasonabledatadistributions, whose correspondingcell intersectsthe interval. Using
cover quotients(which measurethe overlappingof the data this so-called clipping technique, the spacerequirementsin-
objects),and bucket capacities,and that it maintainsmul- creasesubstantiallydue to the redundantinformation.
tidimensionalpoints as well as arbitrary geometricobjects. The clipping approachis basedon datastructureswhich
Thesepropertiesdemonstratedby an extensiveperformance, partition the data space into pairwise disjoint cells. To
evaluationmakethe LSD treeextremelysuitablefor the im- avoid clipping problems, in R-Trees ([Gut84], [FSR87]).
plementationof spatialaccesspathsin geometricdatabases. respectivelymultilayer grid files [SWSS],the data spaceis
The paging algorithm for the binary tree directory is inter- divided into overlapping cells, such that all, respectively
esting in its own right becausea practical solution for the most, intervals are entirely containedin a cell, i.e. need
problem of how to page a (multidimensional)binary tree not be clipped. Unfortunately, the R-Tree suffers from a
without accesspath degenerationis presented. poor exact match performanceand often from inefficient
1. Introduction range queries becausecells may overlap considerablyin
In non-standardapplicationssuchascartography,CAD a dynamic setting. In the multilayer grid file with each
and robotics,DatabaseManagementSystemshaveto orga- layer a grid file is associatedinducing a directoryoverhead
nize large sets of multidimensionalgeometricobjects on which deterioratesthe efficiency of operationsconcerning
secondarystoragesuch that these objects can quickly be few objects.*
retrievedaccordingto their spatial locations. Typical spa- In the so-called tran.+ormation technique (lJIin85],
tial queriesare the retrievalof an object by its coordinates [SK88]),k-dimensionalintervalsare interpretedaspointsin
(exactmatch)and rangequerieswhere all objectsgeomet- a 2kdimensionalspace,in order to usepoint datastructures
rically intersectingthe query region are selectedfor further in a standardway. For instance,a l-dimensionalinterval
processingor presentationon a screen.Sincethe set of ob- [a, b] may be interpretedas point (a, b). Sincea 5 b holds,
jects variesover time, insertionsand deletionshave to be the imagespaceis a triangle. The main drawbacksof this
perfomml as well. approachare that the point distributionin the triangularim-
Data structures,which efficiently support spatial ac- age spaceis extremely skew and that a (bounded)range
cessto geometricobjects,usuallydivide the dataspaceinto queryon intervalsbecomesa partly unboundedrangequery
cells and store all objects located in a cell in an associ- on image points.
ated data bucket. As far as multidimensionalpoints are From a wide spectrumof performancetests we have
concerned,variousefficientdata structureshave beenpro- got the experiencethat the efficiencyof spatialdata struc-
posed (see e.g. @ee87], [HSW8&], [HSWSSb],[KS86], tures depends at least on the object distribution, the cover
lKS881,[KWSSI,[NHS841;[Otoo86], [Rob81]). In typical quotient definedas the sum of all object areasdivided by
applications,however,mostobjectsare arbitrarygeometric, the areaof the data space,and the bucket capacity,i.e. the
i.e. non-point, objects. In many situations,it has proven maximal number of objects in a bucket. For an increas-
to be useful to representnon-point objectsby their (mini- ing cover quotient as well as for small bucket capacities,
mal)boundingboxes,servingas simplegeometrickeys, We clipping andoverlappingcell techniquesdeterioratesubstan-
thereforeconcentrateon multidimensionalintervals,as far tially. On the other hand, all non-tree structures(seee.g.
as non-pointobjectsare concerned [HSW88a],Ir<S86],[KS881,[NHS84]) degeneratefor skew
object distributions.
In this paper, we propose a data structure support-
permission to copy without fee all of part of this material is
ing spatial accessto k-dimensionalpoints as well as k-
grunted provided that the copies arc not made or distributed for dimensional intervals. The accessto intervalsis basedon
direct commercial advantage, the VLDB copyright notice and the transformation technique.A sophisticateddirectorytree
the title of the publication and its date appear, and notice is togetherwith a refinedsplitting techniqueeliminatesthepre-
given that copying w by permisrion of the Very Large Data Base
Endowment.TOcopyotherwise, or to republish, requirea a fee l This we& has been supported by DFG grants Si 374/l and Wi 81OfZ.
and/or special permission from the Endowment.

proceedings of the Fifteenth International Amsterdam, 1989


Conference on Very Large Data Bases

- 45 -
dimtial2
4
100

*O 4 9
60
40
20 1
0e
d ioio
Figure 2.1: Possible partition of the data space for an LSD tree

vious drawbacksof this technique.The performanceeval-


uation convincinglydemonstratesthat the new structureis Figure 2.2: LSD tree associated with
well qualified for maintaininglarge sets of geometricob- the data space partition of Figure 2.1
jects. However,the main advantageof the new structureis
not only the efficient spatialaccessbut also its robustness.
By robustnesswe meanthat the new structurebehaveswell directory is directly related to the number of buckets,i.e.
for all reasonabledata distributions, cover quotients and for n bucketsthe d&ctory containsn-l nodes. This is in
bucket capacities. contrastto the grid file whereseveralentriesin the directory
Section 2 explains the new data structure for point may point to the samebucket.
objects while in section 3 the generalizationto non-point Besidesthe advantagesof the LSD tree directorythere
objectsis provided. In section4 a performanceevaluation are somedrawbackswhich are typical for multidimensional
of the new structureand a comparisonwith the multilayer binary tree structures:
grid file is presented.Section5 concludesthe paper. 1. A multidimensionalbii tree may become unbal-
2. The LSD tree for points anced, i.e. may contain long paths with almost no
branches,and
2. no suitablemethodfor paging a multidimensionalbi-
2.1 Basic ideas and properties nary tree is known. (The interestingpaging technique
Like most structures,the new structurepartitions the presentedin &ZL88] is suitable only for the one-
dataspaceinto painvisedisjoint cells with associatedbuck- dimensionalcase.)
ets of fixed size. In contrastto the grid file [NHS84], how-
ever, it is not grid oriented, i.e. all cell boundariesmay We overcometheseproblemsby introducinga paging
occur at arbitrary positions. The free choice of split posi- algorithmwhich preservesthefollowing external balancing
tionsis the basisof the gracefuladaptationto arbitraryskew PropeW
object distributions.Sincea new split position can be cho- The number of external directory pages which are tra-
senlocally optimal,i.e. optimalwith respectonly to the cell versed on any two paths from the directory root to a
to be split and independentfrom other existingcell bound- bucket diJers by at most 1.
aries, we call the new strocture Local Split Decision tree When geometricobjects are insertedinto an initiahy
(LSD tree, for short). Figure 2.1 showsa possiblepartition empty LSD tree the directory grows up to a size when it
of a zdimensional data spacefor an LSD tree. cannot be kept in the dedicatedpart of the main memory
The LSD directory maintainingthe flexible data space any longer. Then the pagingalgorithmdeterminesa subtree
partitionis a binary tree similar to a kd tree [Ben75]. Each to be paged on secondarystorage such that the external
nodeof this tree representsone split decisionby storingthe balancingproperty is preserved.If tbe subtreeconsistsof
split dimensionand the split position. Figure 2.2 illustrates n, nodes,the mainmemoryis thenable to receiveadditional
the LSD tree associatedwith the data spacepartition of n, nodesuntil a further invocationof the paging algorithm
Figure 2.1. must take place.
It should be obvious that the directory provides the Figure 2.3 showsthe overall structureof the LSD tree.
freedom for using the split strategybest suitable for the
actualapplication.This is an importantadvantageoverother 2.2 A closer hk
structures (see e.g. tFn3387], [HSW88al, [KS861,lKS881, In this section,we discussthe LSD tree in more detail
[NHS84]) wheresplit decisionsare more or lessinfluenced by explainingthe insertionof a new geometricobject into
by previous split decisions. Furthermore,the size of the the structure.

- 46 -
n b b
a strategy is to split a cell into two cells of equal
areas.Note that this halving split strategyrelies on
the assumptionof a uniform distributionof the objects.
Since the LSD directory is a binary tree, any type of
split strategycan be implementedin an easy and efficient
manner. Note that data dependentsplit strategiescannot
be realized by data structuresbased on hashing (see e.g.
[HSW88al, [KS861,[KSSSI,lNI-IS841).
We now turn our attentionto the directory of the LSD
tree. As already mentionedin the previous section,if the
numberof nodes in the directory T exceedsthe maximal
possiblenumberof internal nodes,a subtreeof T is written
onto secondarystorage,i.e. stored in a directory page. In
sucha directorypagea subtreeis organixedas a sequential
heapof fixed height hr. Hence,wheneverthe height of the
associatedsubtreeexceedsh, after an additional insertion,
a directory page split has to be perfomxd. The directory
page split algorithm is simple: the left and right subtreeof
Figure 2.3: Overall structure of the LSD tree the root are storedin two distinct directory pagesand the
root is insertedinto the directory T by calling the directory
directory root insertion algorithm.
Wearenow in a positionto describehow a new nodeq,
P resultingfrom a bucketor a directory pagesplit, is inserted
split of into the directory T by the directory insertion algorithm.
The heart of this algorithm is the paging algorithm we
explain afterwards. In the following, we assumethat the
main memory capacity reservedfor the directory T is ni.
The directory insertion algorithm assuresthat the internal
prefix treeTi of the directoryT containsat most~-1 nodes.
uu lb10 casesmay occur:
Figure 2.4: Effect of a bucket split case1: The fatherp of the new nodeq is an internalnode.
The searchfor the bucketb which will receivethe new cast 1.1: The number of internal nodes is less than
object is guided by thedirectory as in kd trees. If b does ni-1. Then insert q into Ti and finish.
not overflow, the insertionis finished,otherwise the bucket case 1.2: The number of internal nodes is equal to
split algorithm createstwo new bucketsb and b, from b ni-1. Then insert q into Ti, cdl the paging
accordingto a split strategy we describeafterwards.In case algorithm for T, and finish. Note that after
of a bucket split the pointer in the directory referencingb the execution of the paging algorithm the
is changedto a pointer referencinga new directorynode q numberof internal nodesis at most b-1.
representingthe split decisionconcerningb, i.e. the new
nodeq is insertedinto the directoryby calling the directory case2: The fatherp of thenew nodeq is a nodein a subtree
insertion algorithm explainedlater. The new nodeq points TP of T storedin an externaldirectory page.
to the new bucketsbi and b,,. Figure 2.4 depictsthe effect
of a bucket split. case21: After the insertion of q the height of TP is
We distinguishbetweentwo inherently distinct types at most hr.; then finish.
of split strategies: case2.2: After the msertionof q the height of TP is
greaterthan hr. Then call the directory page
1. Data dependentsplit strategies split algorithm for TP and finish.
These strategiesdependonly on the objects storedin
the bucket to be split. A typical examplefor such a
strategyis to choosefor the split position the aver- The paging algorithm is called when after an insertion
age of all object coordinateswith respectto a certain of an additionalnode the size of the internal PE~IXtree Ti
dimension. reachesthe maximalpossiblenumberni of internal nodes.
2. Distribution dependentsplit strategies The algorithm searchesfor a subtreeTI in Ti such that
Thesestrategieschoosethe split dimensionandthe split pagingTI preservesthe externalbalancingproperty&fined
position independentlyof the actual objects storedin in section2.1. This property is preserved if T, is a paging
the bucket to be split. A typical examplefor such candidate, i.e. T fultllls the following properties: *

- 47 -
is scanneduntil the object searchedfor is located or the
searchends unsuccessfully.
233 Insertion
The insertionalgorithmhasbeenexplainedin detail in
the previous section.
233 Deletion
The deletion of a node in the directory T basically
.... ........ ........ ....... ..... ........ ........ ......
works inversely to the insertion of a node. Due to space
LJ limitationswe cannotdiscussthis topic in more detail.
Figure 2.5: Directory before and after paging subiree T, 23.4 Range query

1. Any path from the mot of TI down to a bucketcontains In a rangequery all points locatedin the query region
the minimal numberof externaldirectorypages(of all are reported. According to Fredman lFred801,a query
pat.hsinthedirectoryT). regionmaybe a rectangle(orthogond range query), a circle
2. The height of Ts is at most hr.
(circular range query) or a polygon@olygonalrange query).
In the following, we restrict the discussionto orthogonal
Figure 2.5 shows a directory before and after paging rangequeries,becausethealgorithmis the samefor all query
the subtreeTs. types,exceptfor the proceduresevaluatingwhethera data
If more than one paging candidateoccurs in Ti, the region is enclosedby, intersectedby, or disjoint from the
paging algorithm choosesa candidatewith the maximal queryregion. But thesearedetailsleft to the implementation
possiblenumber of nodes. level.
In order to direct the searchfor a paging candidatein In order to report all points locatedin the query region
Ti the following numbersare attachedto eachnodev in Ti: Q we haveto traversethe LSD tree to determineall buckets
n-(v), resp. nep&v),: the minimal, resp. maximal, whose associatedcells intersect Q. The query algorithm
number of external directory pagesoccurring on any movesdown the LSD tree branchingat eachdirectorynode
path in T containing v. w accordingto the following criteria: (Here D(w) denotes
s(v): the numberof nodesof the biggestpaging candidate the data region which is the union of all data cells whose
which can be reachedfrom v. correspondingbucketscan be reachedfrom w.)
h(v): The height of the subtreewith root v in Ti.
The paging algorithm movesdown the internal direc- 1. If&nD(rightsonofw) = 0,
tory Ti branchingat eachnodew on the searchpath accord- continuewith the left son of w.
ing to the following criteria: 2. IfQnD(leftsonofw) = 0,
continuewith the right son of w.
1. If neM(left son of w) # nw(right son of w), 3. Otherwisecontinuewith both sonsof w.
continuewith the son with lower rim.
2. If new(left son of w) = n~(right son of w), Note thatQ n D(w) # 0 is the invariantcondition of
continuewith the son with greaters. the loop of the query algorithm. Hence, in l., resp. 2..
C$; z; son of w) # 0, resp. Q rl D(right son of w)
The root r of a paging candidateT is determinedif , .
1. h(r)<h,and 3. The LSD tree for non-point objects
2. n-(r) = nep&r).
We explain the non-point situation for k-dimensional
The secondcondition assuresthat after pagingTs the exter- intervalswhich serveas boundingboxes for arbitrary geo-
nal balancingproperty is preservedfor T. metric objects in many applications. We restrict the dis-
It shouldbe clearthataftertheinsertionof a new nodeq cussionto the 2-dimensionalsituation,i.e. to rectanglesin
into the internaldirectoryTi, resp. the pagingof a subtreeTI the plane,becausea generalizationto higher dimensionsis
of Ti, the numbersnw, nep,,,,, s and h mustbe updated straightforward.
foreachnodewonthepathPfromtherootofTitothe To store a set of rectanglesin the LSD tree we use
fatherof q, msp. to the fatherof the root of Ts (now stored the transform&on technique (lI-Iin851, [SK883), i.e. 2-
in an externaldirectory page). Thesenumberscan easily dimensionalrectanglesare storedas 4dimensional points.
be recomputedfrom the existingnumbersof the nodes(and We choosethe simple corner representation[SK881which
their direct sons)on the path P. considersfor each of the two dimensionsthe lower and
2.3 The operations upperboundsof the rectanglesto be distinct dimensions.
The idea is simple but severalsevereproblemsarise
23.1 Exact match from this approach.First, there is a strong conelation be-
In an exact match operationthe directory is traversed tweenupperand lower bounds,becausefor eachdimension
until the correspondingbucket is determined. The bucket the upper bound of a rectangleis always greaterthan (or

- 48 -
equal to) the lower bound. Becauseof the correlationall
pointsare locatedin a triangularshapedsubspaceof the im-
age space.Furthermore,sincein almostall applicationsall
rectanglesare small comparedto the data space,the points
are locatedin a small strip abovethe diagonal.
Datastructureswhich rely on a rectangularshapeddata
spaceand partition the data space into rectangularcells
tend to degeneratefor such applications,especiallyif they
are basedon hashingtechniques. However, the LSD tree
overcomesthe drawbacksof the transformationtechnique
if a refined bucket split strategyis used. Since the split
strategyis crucial to the efficiencyof the LSD tree for non-
point objects,we devotethe next sectionto this topic.
3.1 The split strategies
Figure 3.1: Split positions achieved by two basic split strategies
In this section,we discusssplit strategiessuitablefor
the skew data distributionsinducedby the transformation SP2. This effect is desirablefor the usual situationwhere
technique.First of all, a suitablesplit strategymusttakeinto rectanglestend to be small comparedto the data spaceand
accountthe correlationbetweenlower and upperboundsof hencethe imagepoints tend to be locatedin a strip above
rectanglesin the original dimension I, resp, 2, stored in the diagonal. We performedsimulationswith other roots
dimensions1 and 2, resp. 3 and 4. but the lo* root behavedwell in all cases.
We will explain two different split strategies,a data
dependentand a distribution dependentone. The data 33 The operations
dependentsplit strategy is simple: For the split dimension In this section,we discussthe LSD tree operationsfor
under concern the averageover all coordinatesof objects non-point objects. The operationsexuct match, insertion
storedin the bucket to be split including the object to be and deletion are identical to the correspondingoperations
insertedis chosenas the split position. for 4dimensional points. In the case of a range query
The distribution dependentsplit strategy is a combi- the situationis different,becausethe original 2dimensional
nation of two basic (distributiondependent)split strategies query region and the 4dimensional image query region
each of them designedfor an extremesituation. The first differ substantiallybecauseof the different representations
split strategyrelies on the (fictitious) assumptionthat all of the objects.
rectanglesare degeneratedto points, i.e. the upper and In a range query, the query region can either be an
lower bounds coincide for each dimension. Here, all im- orthogonal rectangle,a circle or a polygon. Independentof
age points are locatedon the diagonalw.r.t the dimensions the threekinds
1 and 2, resp. 3 and 4. A suitablesplit strategyfor this case query types forofa query
set of
regionswe distinguishbetweentwo
rectangles% (see [SKSS]):
is to split the data cell into two cells containing equally
long parts of the diagonal. The split position achievedby 1. Rectangleintersection:
this split strategyis denotedbp SPr in Figure 3.1. GivenaqueryregionQfindallR~~s.t. QnR# 0.
The secondbasic split strategyrelies on the assump- 2. Rectangleenclosure:
tion that all imagepoints am uniformly distributedover the Given a queryregion Q tind all R E % s.t. R E Q.
triangular subspaceof the image spaceb&t from dimen- In contrast to the point situation, the algorithm for
sions 1 and 2, resp. 3 and 4. Here, a suitablesplit strategy circular and polygonal range queries is different from the
halvesthe datacell into two cells of equal areas.The split algorithm for orthogonalrange queries. However, due to
position achievedby this split strategyis denotedbp SPs spacelimitations of the paper we discussonly orthogonal
in Figure 3.1. range queries.
The split position SP calculatedby the combinedsplit Webeginexplainingthe rectangle intersection problem
strategyis the weightedsum of SPt and SP2: for orthogonal query regions. In this case,the original 2-
dimensionalquery region for rectanglespi, ut] x ps,us] is
SP = aSPI +(1 -a)SPs ) where transformedinto a 4dimensional query region for points.
For eachoriginal dimensiond E (1,2) we define

~#d,ud]) sf [Ld,Ud] x [Id,Ud] .

Then for the original query region pi, ui] x [1s,us] the im-
(Here Ld denotesthe lower and & the upperbound of the age tegion is given by cp(@t , ui]) x(p([ls, ui]). Figure 3.2
data spacewf.t the split dimensiond) illustratesthe transformationof a query interval &I, ttd].
The effect of the choice of Q is that for large data The areaof the.imageregion can be reducedby using
cells SPapproachesSPi while for smalIcelIsSPapproaches the transformationp instead of 9. Let & denote the

- 49 -
Ld d d Iowa bounds

Figure 3.2: Transformation of query intend [Id, ud]

Figure 4.1: Uniformly distributed rectangles


d
4. Performance evaluation
d
To assessthe meritsof the LSD tree we haveevaluated
the performancefor rectanglesin the plane. We do not
discuss the efficiency of the LSD tree for points, because
the performancefor rectanglesis an upper bound of the
performancefor points, We haveimplementedan LSD tree
u,, lowerbound8 on a SUN workstationin Modula-2.
The inter& directory is stored in an array storing 1000
Figure 3.3: Improved transformation of query interval [b. ~1 nodes.An external directory page of sire 512 bytescontains
subtreesup to a height of 6 organizedas sequentialheaps.
greatestextensionof an insertedrectanglefor dimension We choosebucketcapacitiesof 5 and 50 rectangles.
d, then The simulationsare basedon a sophisticatedrandom
.rectanglegeneratorcreating sets of 10,000 and 100,000
rectanglesaccordingto two different distributions. These
% &ud]) sf [Id - Ed,Ud] x [Id,Ud + Ed] . distributionsare illustratedfor 1,000rectanglesin Figures
4.1 and 4.2. Since the cover quotient remainsconstantat
Figure3.3 illustratesthe improvedtransformationof a query 2.5, in the case of 10,000,resp. 100,000,rectanglesthe
interval [ld, udl. averageareaof a rectangleis 10, resp. 100, times smaller
For the image region the range query algorithm for than for 1,000rectangles.
points can directly be used. Bucketsplitsareperformedaccordingto the split strate-
Wecontinuetbe discussionwith therectangleenclosure gies describedin section3.1. In the caseof the datadepen-
problemfor orthogonal query regions. Sincein this casefor dent split strategywe useda refinedinsertionprocedure:If
each dimensiond both, the lower and the upper bound of a new rectanglecausesthe split of a bucket bi which has
a rectangle,must be enclosedin the query interval [ld, ~1, a brother bucket b, i.e. both bucketsstem from the same
we use the simple transformation bucket split with split line S, and the capacityof b is not
exhausted,the objectwhich is closestto S in bi is movedto
b and S is updatedin the directory. Thenthe new rectangle
19(b,ud]) d [bud] x [bud] . canbe insertedwithout a bucketsplit. Otherwise,bt is split.
First, we focus on the directory evaluation. We have
The image query region for the rectangleenclosure randomly inserted 10,000, resp. 1OO,ooO, uniformly dis-
problemis smallerthan for the rectangleintersectionprob- tributed rectanglesand 10,000,resp. lOO,OCQ skew dis-
lem. Hence, an enclosurequery can be performedmore tributedrectanglesinto an initially emptyLSD tree. To sim-
efficiently than an intersectionquery. Note that this holds ulatea worst casesituation,100,ooOuniformly distributed
only for transformationtechniquesand not for clipping or rectangleshavebeen insertedin sorted order. The sort-
overlappingcell techniques. ing hasbeencarriedout by randominsertionsinto an LSD

- 50 -
storage utilizatim

buckets bucket overall


utilia- storage
lion UtiliZa-
lion (loo

zky
2,865 69.8 % 68.2 %
282 70.9 % 68.1%
2,838 70.5 % 68.8 %

290 69.0 % 66.3 8


392 60.6 % 59.0 %

3266 61.2 % 59.6 %

25,883 77.3 % 71.8 %

2593 77.2 % 71.7 %

10,ooo 3.419 994 21 1 221 18.7 46


23,085 997 204 2 4,836 8.7 %

loo.ooo distrib. 33,787 985 16 2 3501 16.2 %

Table 4.1: Size of the directory and storageutilization (directory page size = 512 bytes; max. number of internal nodes = 1OW)

case the distribution dependentsplit strategy is the winner.


For the sorted caseand the data dependentsplit strategy the
unbalanceof the directory is reflected mainly by the height
of the internal directory. Becauseof the external balancing
property the number of external directory levels is 2 as for
the distribution dependentsplit strategy. The utilization of
the directory pagesis mainly influenced by the split strategy
and the bucket capacity (and, for the data dependentsplit
strategy, of course by the order of insertion).
For the sametest set we have also measuredthe bucker
utilization which is defined as
number of stored objects
number of buckets x bucket capacity
and the overall storage utilization defined as
number of stored objects x 100bytes
bytes needed for the LSD tree
which includes the storage space needed for the directory
and some administrative informations. Empty buckets are
Figure 4.2: Skew distributed rectangles not allocatedbut representedby nil-pointers in the directory.
The results m shown in pdble 4.1.
tree with bucket capacity 5 followed by a left to right scan For the data dependentsplit strategy the bucket utiliza-
through the LSD leaves. The results are shown in Table 4.1. tion is independent of the object distribution and slightly
It comesout very clearly that the size of the directory above the theoretical value of 69.3% (In 2). Due to the
does not depend on the data distribution but on the split refinement of the insertion procedure which prefers small
strategy (and of course on the size of the data set and the bucket capacities the utilization is even higher for bucket
bucket capacity). For unsorted situations, the data depen- capacity 5. The bucket utilization of 86.6% for the sorted
dent split strategy performs significantly better than the dis- situation is a consequenceof the same effect. For the dis-
tribution dependentvariant, while, as expected,in the sorted tribution dependent split strategy the bucket utilization is

- 51-
number of rectangle
-gl= distribution

unifoml
*,

skew
I

unifolm

skew

I lOO,OCHI skew

Table 4.2: Rang query pexformauce (directory page size = 512 bytes; max. number of internal no&s = 1000)

below 69% but still above 60% and hencenot bad at all. A The hit ratio is higher for smaller bucket capacities, be-
comparisonof the overall storageutilization and the bucket causeof the higher selectivity. For the data dependentsplit
utilization convincingly demonstratesthat the storagespace strategy, bucket capacity 5, skew distribution, and query
neededto accommodatethe directory is rather small com- type 2, the hit ratio is 66.4% if only bucket accessesare
pared to the data storage space. counted. This is nearly optimal with respect to a normal
We now turn our attention to the performance of the bucket utilization of 69.3%. For smaller query regions and
LSD tree operations. Clearly, in an exact match at most i larger bucket capacities the hit ratio deteriorates: Changing
directory pagesand one bucket must be read if the directory the bucket capacity to 50 yields 54.2%.
contains i external levels. The performanceof the insertion The performanceresults in the sorted situation can be
procedure is easy to estimate, too. For bucket capacity 5, explained by the fact that the data cells tend to be long and
resp. 50, we have between 4 and 5, resp. less than 3, ex- small for the dam dependentsplit strategy in this case.
ternal accesses(directory page and bucket I/G) per inserted For the remainder of this section we comparethe per-
object, if 100,000objects are inserted into an initially empty formance of the LSD tree and the multilayer gridfile [SW881
LSD tree irrespective of the split strategy used. using 5 layers (5L-GF for short). According to the multi-
Hence, we focus on range queries. We concentrateon layer philosophy layer 5 is implemented as a clipping grid
the intersection query becauseits performance is an upper file. We have inserted 50,000 uniform distributed rectan-
bound of the enclosure query performance. (Experiments gles in random order into both initially empty structures.
show that the enclosurequeriescan be carried out 10%faster The cover quotient is 2.5 and the bucket capacity varies
than intersection queries on the average.) Table 4.2 shows from 5 to 30 in stepsof 5. It turns out that the SL-GF is not
the average number of external accessesfor two types of able to work with bucket capacity 5. After the insertion of
range queries. For square regions of sixes 0.5% and 5% 20,974 rectangles a bucket of the clipping layer could not
of the size of the data space,we have performed 20 range be split becauseeachrectangle stored in this bucket covered
queries each, at random positions. the whole corresponding data cell. (For the skew data dis-
As with all other data structures larger query regions tribution the 5L-GF runs into a similar error situation even
lead to fewer disk accessesper found object, becausethe for bucket capacity 10.)
number of buckets completely contained in the query region Figure 4.3 showsthe bucket utilization for the LSD tree
grows faster than the number of buckets intersectedby the with the data dependent split strategy (LSD&J, the LSD
region boundary. 2 wiem~G~tribution dependentsplit strategy (LSD&,
Another important characteristic number is the hit ratio, .
defined as The range query performance is illustrated in Figures
4.4 and 4.5. We have used the samequery types as before.
number of objects found For the first, resp. secondtype, 308, resp. 2712, objectsare
selected on the average.
bucket capacity x disk accesses

- 52 -
height proves the efficiency of the LSD tree [HSW89]. At
*w . . . . . . . . . . . . . . . . . .. . the moment, we are implementing more general (spatial)
qsts
70% operations, like non-orthogonal range queries, point queries
[SK881 and queries where geometric as well as standard
attributes are qualified. Furthermore, we are embedding the
LSD tree as spatial accesspath into the geometric database
b system Gral [Gtit89]. Hence, an empirical study about the
5 10 15 20 25 30 bu&aapcity benefits of the LSD tree in such an environment can be
carried out in the near future.
Figure 4.3: Bucket utilization
A hit ratio References
[Ben751 Bentley, J.L.: Multidimensional Binary Search Trees Used in
Database Applications. Ccmmunications of the ACM, Vol. 18, 9,
509-517, 1975
[FSR87] Faloutsos,C., Sellis, T., Roussopoulos,N.: Analysis of Object
Oriented Spatial Access Methods, Proc. ACM SIGMOD Int. Conf.
on Management of Data, 426-439, 1987
zm-.......................... [Fred801Fredman, ML.: The Inherent Complexity of Dynamic Data
StructuresWhich AccommodateRange Queries, IEEE. CH1498-5/80.
3;;;;;; b 1980
5 10 1s 20 25 30 bwk~capcity [Frcc87] Freeston.M.: Jbe BANG file: a new kind of grid tile, Proc.
ACM SIGMOD Int. Conf. on Managementof Data, 260269. 1987
Figure 4.4: Range query performance (0.5% of the data space) [Gut841Guttman, A.: R-Trees: A Dynamic Index Structure for Spatial
Searching. Proc. ACM SIGMOD Jnt Conf. on Managementof Data,
47-57. 1984
[Gut891 Gitting. R.H.: Gral An Extensible Relational DatabaseSystem
for Geometric Applications, Pmt. lsth Int. Conf. on VLDB (1989).
to appear
[Him851Hiichs, K.: The Grid Fide System: Implementation and Case
Studiesof Apphcations, Doctoral Thesis No. 7734, ETH Zurich, 1985
[HSWSSa] Hutflesx, A., Six, H.-W.. Widmayer, P.: Globally Order
Preserving Multidbnensional Linear Hashing. Ptoc. IEEE 4* Jnt,
Cod. on Data Engineering. 572579. 1988
[HSW88b] Huttlesx, A., Six, H.-W., Widmayer, P.: Twin Grid Fiies:
Space Optimizing Access Schemes,Proc. ACM SIGMOD Jnt. Conf.
cu Management of Data, 183-190, 1988
Figure 4.5: Range query performance (5% of the data space) [HSW89] Hemich. A., Six, H.-W.. Widmayer.P.: Paging binary treeswith
external balancing, Proc. Jnt. Workshop on GraphtbeoreticConcepts
It turns out that the LSD tree with the data dependent in Computer Science (WG 89), Springer Lecture Notes in Comp.
Science, to appear
split strategy clearly outperforms the SL-GF while the LSD [KS861 Kriegel, H.-P., Seeger, B.: Multidimensional Order Preserving
tme with the distribution dependentsplit strategy is at least Linear Hashing with Partial Expansions, Proc. Jnt. Conf. on Database
as efficient as the SL-GF. We have not comparedthe exact lltcory. 203-220, 1986
match and insertion performancebecauseit is obvious that [KS881 Kriegel, H.-P., Seeger, B.: PLOP-Hashing: A Grid File without
the 5 layers of the SL-GF do not allow a competitive Directory, Proc. IEEE 4* Jnt. Conf. on Data Engineering, 369-376,
pXfOIRllUlCe. 1988
[KW85] Krishnamtuthy. R.. Whang. K.-Y.: Multilevel Grid Files, IBM
It should be noted that besidesits better overall perfor- ResearchRepott, Yorktown Heights, 1985
mance the LSD tree is much easier to implement than the [rZL88] Litwin. W., Zegour, D., Levy, G.: Multilevel Trie Hashing,
SL-GF and does not need an additional (completely differ- Proc. Jnt. Conference Extending Database Technology (BDBT 88).
ent) overtlow data structure for storing objects which do Springer Lecture Notes in Ccmp. Science, 309-335. 1988
not fit into the main structure. [NHSM] Nievergelt, J.. Hinterberger, H., Sevcik. K.C.: Ibe Grid File:
An Adapable Symmetric Multikey File Structure, ACM Transactions
5. Conclusion cm DatabaseSystems. Vol. 9. 1. 38-71. 1984
[Otoo86] Otoo, E.J.: Balanced Multidimensional Extendible Hash Tree.
We have proposed the LSD tree, a data structure sup- Proc. 9 ACM SlGACf / SIGMOD Symposium on Principles of
porting efficient spatial accessto geometricobjects. Its main DatabaseSystems, 100-113. 1986
advantagesover other structuresare that it performs well for [Rob811Robinson. J.T.: The K-D-B-Tree: A Search Structurefor Large
all reasonabledata distributions, cover quotients, and bucket MultidirnensicmalDynamic Indexes. Proc. ACM SIGMOD Jnt. Conf.
capacities,and that it maintains multidimensional points as on Managementof Data. 10-18. 1981
well as arbitrary geometric objects. These properties make [SK881Seeger,B., Kriegel. H.-P.: Techniquesfor Design and Jmplemen-
tatiut of Bfiicient Spatial Access Methods, Proc. IS* Int. Conf. on
the LSD tree extremely suitable for the implementation of VLDB. 360-371. 1988
spatial accesspaths in geometric databases. [SW881 Six, H.-W., Widmap P.: Spatial Searching in Geometric
In addition to the performance evaluation an analysis Databases,Prcc. IEEE 4 Jnt. Conf. on Data Engineering, 496-503,
of the expectedstorageutilization and the expectedexternal 1988

- 53 -
- 54 -

You might also like