Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Standard view
Full view
of .
Look up keyword
Like this
0 of .
Results for:
No results containing your search query
P. 1
Database Management Sys1

Database Management Sys1

Ratings: (0)|Views: 7 |Likes:
Published by deepagan_v_g6903

More info:

Published by: deepagan_v_g6903 on Nov 17, 2009
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





Access Path Selectionin a Relational Database Management SystemP.Griffiths SelingerM. M. AstrahanD. D. Chamberlin‘,It. A. Lorie.:T. G.Price4:IBM Research Division, San Jose, California 95193ABSTRACT: In a high level query and datamanipulation language such as SQL, requestsarestated non-procedurally, withoutreference to access paths. Thispaperdescribes how System R chooses access paths
both simple (single relation) andcomplex queries(such asjoins), given auserspecification
desired data as aboolean expression
predicates. System Ris anexperimental database managementsystem developed to carry out research onthe relational model of data. System R wasdesigned and built by members of the IBMSan Jose Research'Laboratory.1. IntroductionSystem' R is an experimental databasemanagement systembased on the relationalmodel of data which has been under develop-ment at the IBM San Jose Research Laborato-rysince1975 Cl>. The software was. developed as a research vehicle in rela-tional database,and isnot generallyavailableoutside the IBM ResearchDivi-sion.Thispaperassumes familiaritywithrelationaldatamodel terminology asdescribed inCodd <7> and Date<a>.Theuser interfacein System R is the unifiedquery,data definition, and manipulationlanguage SQL <5>.Statements in SQL can beissued both from an on-line casual-user-or-iented terminal interface and from program-ming languages such as PL/I and COBOL.In System R a user need not know howthe tuplesare physically stored and whataccess paths
available (e.g.whichcolumns haveindexes). SQL statements donot requirethe userto specify anythingabout the access path tobe used
tuplePermission to copy without fee all or part of thismaterial is granted provided that the copies arenot made or distributed for direct couunercial ad-vantage, the ACMcopyright notice and the title ofthe publication and its date appear, and notice isgiven that copying is by permission of the Associa-tion for Computing Machinery.To copy otherwise,or to republish,requires a fee end/or specificpermission.01979 ACM 0-89791-001-X/79/0500-0023 $00.75retrieval.Nor does a user specify in whatorder joinsare to beperformed. TheSystem R optimizer .choosesboth join orderand anaccess path foreach table in theSQLstatement.
the many possiblechoices,the optimizerchoosestheonewhich minimizes"total access cost"
performing the entire statement.This paperwill address theissues ofaccess pathselection for queries.Retrieval for datamanipulation (UPDATE,DELETE)is treatedsimilarly. Section 2will describe the place of the optimizer intheprocessingof aSQLstatement, andsection 3 will describe the storage compo-nent access paths that are available on
single physically stored table. In section4 theoptimizer costformulas areintro-duced for single table queries, and section5 discusses the joining of two
moretables,and their corresponding costs.Nested queries (queries in predicates) arecovered in section 6.2. processi.Bg & B.B u statementA SQL statement is subjected to
processing.Depending on 'theorigin and contents of the statement., thesephases may be separated byarbitraryintervals. oftime. InSystem RIthesearbitrary time intervals are transparent tothe systemcomponents which
a SQLstatement.These mechanisms and a descrip-tion ofthe processing
SQLstatementsfromboth programs andterminals arefurther discussed in <2>.Only an overviewof those processing steps that are relevantto access path selection will be discussedhere.The four phases
statement processingare-parsing,optimization.code generation.and execution. Each SQLstatement is sentto ,the parser.where itis checked
correct syntax.
guery block is
sented by a SELECT list, a FROM list, and aWHEREtree, containing,respectivelythelist of .items to be retrieved, the table(s)referenced, and theboolean combination ofsimple predicates specified by the
user. A
single SQLstatement mayhave manyqueryblocksbecause a predicate
operand which is itself a query.Ifthe parser returns withoutanyerrors detected,the OPTIMIZER component iscalled.TheOPTIMIZER accumulates thenames oftables and columns referenced inthe query and looks them up in the System Rcatalogs toverify their existence and toretrieve information about them.Thecatalog lookup portion of theOPTIMIZER also obtains statistics about thereferenced relations, and the access paths‘available oneach of them. These will beused later in access path selection.Aftercataloglookup has obtained the datatypeand lengthof each column, theOPTIMIZERrescans theSELECT-list and WHERE-tree tocheck for semantic errors and type compati-bilityin bothexpressions and predicatecomparisons.Finally theOPTIMIZER performsaccesspath selection.It first determines theevaluation order among the query blocks inthe statement. Then for each query block,therelations in the FROMlist areprocessed. If there is more than onerelation ina block, permutations
thejoin order and of the method of joining areevaluated.The access paths that minimizetotal cost for the blockare chosen from atree ofalternate pathchoices.Thisminimum cost.solution is representedby astructural modification of the parse tree.The result is an executionplan in the
Specification Language (ASLI <lo>.After
plan is chosen foreach queryblock and represented inthe parse tree,theCODE GENERATOR iscalled. TheCODEGENERATOR s a table-driven program whichtranslates ASL treesinto machine languagecodeto execute the plan chosen by theOPTIMIZER. Indoing this it usesa rela-tively small number ofcode templates, onefor each type of join method (including nojoin). Query blocks for nested queries aretreated as"subroutines" which returnvalues to the predicates inwhich theyoccur.TheCODEGENERATOR isfurtherdescribed in <9>.During code generation,the parse treeis replaced by executablemachine code anditsassociateddatastructures.Eithercontrol isimmediately transferedto thiscodeor thecode isstoredaway inthedatabase for laterexecution,depending ontheorigin ofthestatement (program orterminal). Ineither case, whenthe codeis ultimatelyenecuted, it callsupon the
R internal storage
(RSS) viathe storage system interface(RSII to scaneach of the physicallystored relations inthequery.Thesescans arealongtheaccess paths chosen bythe OPTIMIZER.TheRSI commands that maybe used by generatedcode are described in the next section.
_T'he Research Storaae SystemTheResearch StorageSystem (RSSI isthe storagesubsystem of System R.It isresponsibleformaintainingPhysicalstorage
relations,access paths on theserelations, locking(in amulti-user
ronment),and loggingand recovery facili-ties.The RSSpresents atuple-orientedinterface (RSII to its users.Although theRSS
be usedindependently of
R,weare concernedherewithits use
executing the code generated by the proces-singof SQLstatements inSystem R, asdescribed in the previous section.
acomplete description of the RSS, see <l>.Relationsarestoredin the RSS as acollection oftupleswhosecolumnsarephysicallycontiguous.Thesetuplesarestored on
byte pages; no tuplespans apage.Pagesare organizedintologicalunitscalledsegments.Segmentsmaycontainone or
relations,but norelation
span a segment.Tuples
two or
relations may
the samepage.
Each tuple istaggedwiththeidentification of the relationto which itbelongs.The'primary way ofaccessing tuples inarelation isvia anRSSscan.
scanreturns a tuple ata timealong agivenaccess path.OPEN, NEXT, and CLOSE are theprincipal commands on a scan.Twotypes
scans are currentlyavailable
SQL statements. The firsttypeis a segment scan tofind all
tuples ofa given relation.
ofNEXTs on
segment scan simply examines allpages ofthe segment which contain tuples,
any relation, and returns those tuplesbelonging to the given relation.The second typeof scan is an indexscan.An index
be created by a.Sy.stemR useron one or more columns ofa rela-tion, anda relation may haveany number(including
of indexes on it. Theseindexes arestored on separate pages
thosecontaining the relationtuples.Indexes areimplemented asB-trees
whose leaves
pages containing sets of(key,identifiersof tuples .. which containthat
aseries of NEXTs onan index scan doesa sequential read alongthe leaf Pages ofthe index,obtaining thetuple identifiers matching a key, and usingthem to find and returnthe data tuples tothe userin key value
chained togetherso that NEXTsneedinot reference any upper level Pages Ofthe i,ndex.In asegment scan, all thenon-emptypage5 of a segment will be touched. regard-less
whether there are anytuples fromthe
pageis touchedonly once.When anentire relationis enaminedvia anindexscan,each pageof theindex istouched
only once,but 'a data page may be examinedmore than onceif it has twotuples on itwhich arenot"close" in theindex order-ing.If thetuplesareinsertedintosegment pages in the index 'ordering, and ifthisphysical proximity corresponding toindex key value ismaintained, we say thatthe index is clustered.A clustered indexhas theproperty that not only each indexpaw Pbut also each data page containing atuple from that relation will betouchedonly once in a scan on that index.,...:An index scan neednot scan the *entirerelation.Starting and stopping key valuesmaybe specified in orderto scan onlythose tuples which have a key in a range ofindex values.Both index and segment scansmay optionallytake a set of predicates,called searcharguments (or SARGS), whichareapplied to a tuple before it isreturned to the RSI caller. If the tuplesatisfies thepredicates, it is returned;otherwisethe scancontinues until iteither findsa tuplewhich satisfiestheSARGS orexhausts the segment orthespecified index value range.This reducescost by eliminating theoverhead of makingRSIcalls for tuples which can beeffi-ciently rejectedwithin the RSS. Not allpredicates are of theform that can becomeSARGS. A sm predicateis one of theform (orwhich can beput intothe form)ncolumn comparison-operatorvalue". SARGSare expressedas a boolean expression ofsuch predicates in disjunctive normal form.f 4.fox sinqleostsrelation access pathsIn the next 'severalsections we willdescribe the process of choosing a plan forevaluating a query.We will first describethesimplest case,accessing asinglerelation, andshow how itextends andgeneralizesto t-way joins ofrelations,n-wayjoins,and finallymultiplequeryblocks (nested queries).The OPTIMIZER examinesboth the predi-catesin the query andthe accesspathsavailableon therelations referenced bythe queryIand.formu1ate.s a cost predictionfor eachaccess plan, usingthe followingcost formula:COST =PAGE -FETCHES + W * (RSI CALLS).Thiscost is' aweighted measure ofI/O(pagesfetched)andCPUutilization(instructions executed). Wis an adjusta-ble weightingfactor between I/Oand CPU.RSI CALLS is the predicted number 0.f tuplesreturnedfrom the RSS.Since most ofSystem R'sCPU time isspent inthe RSS,the number of RSI callsis a good approxi-mationforCPUutilization. Thusthechoice of a minimum costpath to process aquery attempts tominimize total resourcesrequired.During executionof thetype-compati-bility and semantic checking portion of theOPTIMIZER, each query block's WHERE tree ofpredicates is examined.TheWHERE tree isconsidered to be inconjunctivenormalform,andeveryconjunctiscalled abooleanfactoy.Booleanfactorsarenotable because every tuple returned to theuser must satisfy
boolean factor. Anindex is said to matcha boolean factor i,fthe boolean factor iswhose referenceda sargable predicatecolumn is theindex key;e.g., anindex onSALARY matchesthepredicate 'SALARY =20000'.More precise-ly,we saythat apredicate orset ofpredicates matchesan index accesspathwhen thepredicates aresargable andthecolumns mentionedin thepredicate(s) arean initial substring ofthe set of columnsofthe index key.For example. aNAME,LOCATION indexmatches NAME ='SMITH' ANDLOCATION = 'SAN JOSE'.If an index matchesa booleanfactor, anaccessusingthatindex isan efficientway tosatisfy theboolean factor.Sargable booleanfactorscan alsobe efficiently satisfiedif theyare expressed assearch arguments.Notethat a boolean factor may be an entire treeof predicates headed by an OR.Duringcatalog lookup,the OPTIIlIZERretrievesstatistics onthe relations inthe query and on the access paths availableon each relation.The statistics kept arethe following:
each relation T.- NCARD(TIr the cardinality of relation T.- TCARDfT). the number
pages in thesegment that hold tuples of relation T.- P(T), the fraction of data pages in thesegment that hold tuples of relation T.P(T) = TCARD(T1 / (no. of non-emptypages in the segment).For each index I on relation T,- ICARD( number of distinct keys inindex I.- NINDXfIlrthe number of pages in index I.These statistics aremaintained in theSystem R catalogs, and
from severalsources. Initialrelation loadingandindex creation ihitialize these statistics.They are then updated periodically by anUPDATE STATISTICS command, which can be runbyany user.SystemR does not updatethese statistics at everyINSERT, DELETE,orUPDATE because of the extra databaseoperations and thelocking bottleneck th,iswouldcreate atthe systemcatalogs.Dynamic updatingof statistics would tendtoserializeaccesses that modifytherelation contents.Using thesestatistics, theOPTIMIZERassigns a selectivity factor 'F'for eachboolean factor in the predicate list.Thisselectivity factor very roughly corresponds
theof tupleswhichwkexpected fraction11 satisfy the predicate.TABLE 1 givesthe selectivity factors for different kindsof predicates.We assumethat alack ofstatisticsimplies that therelation issmall,so an arbitrary factor is chosen.25

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->