You are on page 1of 90

Chapter11:StorageandFileStructure

Rev.Aug1,2008

DatabaseSystemConcepts,5thEd.
Silberschatz,KorthandSudarshan Seewww.dbbook.comforconditionsonreuse

Chapter11:StorageandFileStructure
s OverviewofPhysicalStorageMedia s MagneticDisks s RAID s TertiaryStorage s StorageAccess s FileOrganization s OrganizationofRecordsinFiles s DataDictionaryStorage

DatabaseSystemConcepts5thEdition

11.2

Silberschatz,KorthandSudarshan

ClassificationofPhysicalStorageMedia
s Speedwithwhichdatacanbeaccessed s Costperunitofdata s Reliability
q q

datalossonpowerfailureorsystemcrash physicalfailureofthestoragedevice volatilestorage:losescontentswhenpowerisswitched off nonvolatilestorage:


s Candifferentiatestorageinto:
q

Contentspersistevenwhenpowerisswitchedoff. Includessecondaryandtertiarystorage,aswellas batterybackedupmainmemory.

DatabaseSystemConcepts5thEdition

11.3

Silberschatz,KorthandSudarshan

PhysicalStorageMedia
s Cachefastestandmostcostlyformofstorage;volatile;

managedbythecomputersystemhardware
q

(Note:Cacheispronouncedascash) fastaccess(10sto100sofnanoseconds;1nanosecond= 109seconds) generallytoosmall(ortooexpensive)tostoretheentire database


s Mainmemory:
q

capacitiesofuptoafewGigabyteswidelyused currently Capacitieshavegoneupandperbytecostshave decreasedsteadilyandrapidly(roughlyfactorof2 every2to3years)

Volatilecontentsofmainmemoryareusuallylostifa powerfailureorsystemcrashoccurs.
11.4 Silberschatz,KorthandSudarshan

DatabaseSystemConcepts5thEdition

PhysicalStorageMedia(Cont.)
s Flashmemory
q q

Datasurvivespowerfailure Datacanbewrittenatalocationonlyonce,butlocationcan beerasedandwrittentoagain

Cansupportonlyalimitednumber(10K1M)of write/erasecycles. Erasingofmemoryhastobedonetoanentirebankof memory

q q

Readsareroughlyasfastasmainmemory Butwritesareslow(fewmicroseconds),eraseisslower

DatabaseSystemConcepts5thEdition

11.5

Silberschatz,KorthandSudarshan

PhysicalStorageMedia(Cont.)
s Flashmemory
q

NORFlash

Fastreads,veryslowerase,lowercapacity Usedtostoreprogramcodeinmanyembeddeddevices Pageatatimeread/write,multipageerase Highcapacity(severalGB) Widelyusedasdatastoragemechanisminportable devices

NANDFlash

DatabaseSystemConcepts5thEdition

11.6

Silberschatz,KorthandSudarshan

PhysicalStorageMedia(Cont.)
s Magneticdisk
q q

Dataisstoredonspinningdisk,andread/writtenmagnetically Primarymediumforthelongtermstorageofdata;typicallystores entiredatabase. Datamustbemovedfromdisktomainmemoryforaccess,and writtenbackforstorage directaccesspossibletoreaddataondiskinanyorder, unlikemagnetictape Survivespowerfailuresandsystemcrashes

diskfailurecandestroydata:israrebutdoeshappen

DatabaseSystemConcepts5thEdition

11.7

Silberschatz,KorthandSudarshan

PhysicalStorageMedia(Cont.)
s Opticalstorage
q

nonvolatile,dataisreadopticallyfromaspinningdiskusing alaser CDROM(640MB)andDVD(4.7to17GB)mostpopular forms Writeone,readmany(WORM)opticaldisksusedfor archivalstorage(CDR,DVDR,DVD+R) Multiplewriteversionsalsoavailable(CDRW,DVDRW, DVD+RW,andDVDRAM) Readsandwritesareslowerthanwithmagneticdisk Jukeboxsystems,withlargenumbersofremovabledisks,a fewdrives,andamechanismforautomaticloading/unloading ofdisksavailableforstoringlargevolumesofdata

q q

DatabaseSystemConcepts5thEdition

11.8

Silberschatz,KorthandSudarshan

PhysicalStorageMedia(Cont.)
s Tapestorage
q

nonvolatile,usedprimarilyforbackup(torecoverfromdisk failure),andforarchivaldata sequentialaccessmuchslowerthandisk veryhighcapacity(40to300GBtapesavailable) tapecanberemovedfromdrivestoragecostsmuch cheaperthandisk,butdrivesareexpensive Tapejukeboxesavailableforstoringmassiveamountsof data

q q q

hundredsofterabytes(1terabyte=109bytes)toevena petabyte(1petabyte=1012bytes)

DatabaseSystemConcepts5thEdition

11.9

Silberschatz,KorthandSudarshan

StorageHierarchy

DatabaseSystemConcepts5thEdition

11.10

Silberschatz,KorthandSudarshan

StorageHierarchy(Cont.)
s primarystorage:Fastestmediabutvolatile(cache,main

memory).

s secondarystorage:nextlevelinhierarchy,nonvolatile,

moderatelyfastaccesstime
q q

alsocalledonlinestorage E.g.flashmemory,magneticdisks

s tertiarystorage:lowestlevelinhierarchy,nonvolatile,slow

accesstime
q q

alsocalledofflinestorage E.g.magnetictape,opticalstorage

DatabaseSystemConcepts5thEdition

11.11

Silberschatz,KorthandSudarshan

MagneticHardDiskMechanism

NOTE:Diagramisschematic,andsimplifiesthestructureofactualdiskdrives
DatabaseSystemConcepts5thEdition 11.12 Silberschatz,KorthandSudarshan

MagneticDisks
s Readwritehead
q q

Positionedveryclosetotheplattersurface(almosttouchingit) Readsorwritesmagneticallyencodedinformation. Over50K100Ktracksperplatterontypicalharddisks Sectorsizetypically512bytes Typicalsectorspertrack:500(oninnertracks)to1000(onouter tracks) diskarmswingstopositionheadonrighttrack platterspinscontinually;dataisread/writtenassectorpasses underhead


11.13 Silberschatz,KorthandSudarshan

s Surfaceofplatterdividedintocirculartracks
q

s Eachtrackisdividedintosectors.
q q

s Toread/writeasector
q q

DatabaseSystemConcepts5thEdition

MagneticDisks(Cont.)
s Headdiskassemblies
q q

multiplediskplattersonasinglespindle(1to5usually) oneheadperplatter,mountedonacommonarm.

s Cylindericonsistsofithtrackofalltheplatters s Earliergenerationdisksweresusceptibletoheadcrashes

leadingtolossofalldataondisk
q

Currentgenerationdisksarelesssusceptibletosuch disastrousfailures,butindividualsectorsmaygetcorrupted

DatabaseSystemConcepts5thEdition

11.14

Silberschatz,KorthandSudarshan

DiskController
s Diskcontrollerinterfacesbetweenthecomputersystemand

thediskdrivehardware.
q q

acceptshighlevelcommandstoreadorwriteasector initiatesactionssuchasmovingthediskarmtotherighttrack andactuallyreadingorwritingthedata Computesandattacheschecksumstoeachsectortoverify thatdataisreadbackcorrectly

Ifdataiscorrupted,withveryhighprobabilitystored checksumwontmatchrecomputedchecksum

Ensuressuccessfulwritingbyreadingbacksectorafterwriting it Performsremappingofbadsectors

DatabaseSystemConcepts5thEdition

11.15

Silberschatz,KorthandSudarshan

DiskSubsystem

s Diskinterfacestandardsfamilies
q q q

ATA(ATadaptor)rangeofstandards SATA(SerialATA) SCSI(SmallComputerSystemInterconnect)rangeof standards Severalvariantsofeachstandard(differentspeedsand capabilities)


11.16 Silberschatz,KorthandSudarshan

DatabaseSystemConcepts5thEdition

PerformanceMeasuresofDisks
s Accesstimethetimeittakesfromwhenareadorwriterequest

isissuedtowhendatatransferbegins.Consistsof:
q

Seektimetimeittakestorepositionthearmoverthecorrect track.

Averageseektimeis1/2theworstcaseseektime. Wouldbe1/3ifalltrackshadthesamenumberof sectors,andweignorethetimetostartandstoparm movement

4to10millisecondsontypicaldisks

Rotationallatencytimeittakesforthesectortobeaccessed toappearunderthehead.

Averagelatencyis1/2oftheworstcaselatency. 4to11millisecondsontypicaldisks(5400to15000r.p.m.)

DatabaseSystemConcepts5thEdition

11.17

Silberschatz,KorthandSudarshan

PerformanceMeasures(Cont.)
s Datatransferratetherateatwhichdatacanberetrievedfrom

orstoredtothedisk.
q q

25to100MBpersecondmaxrate,lowerforinnertracks Multipledisksmayshareacontroller,soratethatcontrollercan handleisalsoimportant

E.g.ATA5:66MB/sec,SATA:150MB/sec,Ultra320SCSI: 320MB/s FiberChannel(FC2Gb):256MB/s

DatabaseSystemConcepts5thEdition

11.18

Silberschatz,KorthandSudarshan

PerformanceMeasures(Cont.)
s Meantimetofailure(MTTF)theaveragetimethediskis

expectedtoruncontinuouslywithoutanyfailure.
q q

Typically3to5years Probabilityoffailureofnewdisksisquitelow,correspondingto atheoreticalMTTFof500,000to1,200,000hoursforanew disk

E.g.,anMTTFof1,200,000hoursforanewdiskmeansthat given1000relativelynewdisks,onanaverageonewillfail every1200hours

MTTFdecreasesasdiskages

DatabaseSystemConcepts5thEdition

11.19

Silberschatz,KorthandSudarshan

OptimizationofDiskBlockAccess
s Blockacontiguoussequenceofsectorsfromasingletrack
q

dataistransferredbetweendiskandmainmemoryin blocks Typicalblocksizestodayrangefrom4to16kilobytes

s Diskarmschedulingalgorithmsorderpendingaccessesto

trackssothatdiskarmmovementisminimized
q

elevatoralgorithm:movediskarminonedirection(from outertoinnertracksorviceversa),processingnextrequest inthatdirection,tillnomorerequestsinthatdirection,then reversedirectionandrepeat

DatabaseSystemConcepts5thEdition

11.20

Silberschatz,KorthandSudarshan

OptimizationofDiskBlockAccess(Cont.)
s Fileorganizationoptimizeblockaccesstimebyorganizing

theblockstocorrespondtohowdatawillbeaccessed
q

E.g.Storerelatedinformationonthesameornearby blocks/cylinders.

Filesystemsattempttoallocatecontiguouschunksof blocks(e.g.8or16blocks)toafile E.g.ifdataisinsertedto/deletedfromthefile Orfreeblocksondiskarescattered,andnewlycreated filehasitsblocksscatteredoverthedisk Sequentialaccesstoafragmentedfileresultsin increaseddiskarmmovement

Filesmaygetfragmentedovertime

Somesystemshaveutilitiestodefragmentthefilesystem, inordertospeedupfileaccess
11.21 Silberschatz,KorthandSudarshan

DatabaseSystemConcepts5thEdition

OptimizationofDiskBlockAccess(Cont.)
s Nonvolatilewritebuffersspeedupdiskwritesbywritingblocksto

anonvolatileRAMbufferimmediately
q

NonvolatileRAM:batterybackedupRAMorflashmemory Evenifpowerfails,thedataissafeandwillbewrittentodisk whenpowerreturns

Controllerthenwritestodiskwheneverthediskhasnoother requestsorrequesthasbeenpendingforsometime Databaseoperationsthatrequiredatatobesafelystored beforecontinuingcancontinuewithoutwaitingfordatatobe writtentodisk Writescanbereorderedtominimizediskarmmovement

DatabaseSystemConcepts5thEdition

11.22

Silberschatz,KorthandSudarshan

OptimizationofDiskBlockAccess(Cont.)
s Logdiskadiskdevotedtowritingasequentiallogofblock

updates
q

UsedexactlylikenonvolatileRAM Writetologdiskisveryfastsincenoseeksarerequired Noneedforspecialhardware(NVRAM)

s Filesystemstypicallyreorderwritestodisktoimprove

performance
q

JournalingfilesystemswritedatainsafeordertoNVRAM orlogdisk Reorderingwithoutjournaling:riskofcorruptionoffilesystem data

DatabaseSystemConcepts5thEdition

11.23

Silberschatz,KorthandSudarshan

RAID
s RAID:RedundantArraysofIndependentDisks
q

diskorganizationtechniquesthatmanagealargenumbersof disks,providingaviewofasinglediskof

highcapacityandhighspeedbyusingmultipledisksin parallel,and highreliabilitybystoringdataredundantly,sothatdatacan berecoveredevenifadiskfails

s ThechancethatsomediskoutofasetofNdiskswillfailismuch

higherthanthechancethataspecificsinglediskwillfail.
q

E.g.,asystemwith100disks,eachwithMTTFof100,000 hours(approx.11years),willhaveasystemMTTFof1000 hours(approx.41days)

DatabaseSystemConcepts5thEdition

11.24

Silberschatz,KorthandSudarshan

ImprovementofReliabilityviaRedundancy
s Redundancystoreextrainformationthatcanbeusedto

rebuildinformationlostinadiskfailure
q

s E.g.,Mirroring(orshadowing)

Duplicateeverydisk.Logicaldiskconsistsoftwophysical disks. Everywriteiscarriedoutonbothdisks

Readscantakeplacefromeitherdisk Datalosswouldoccuronlyifadiskfails,anditsmirror diskalsofailsbeforethesystemisrepaired Probabilityofcombinedeventisverysmall

Ifonediskinapairfails,datastillavailableintheother

Exceptfordependentfailuremodessuchasfireor buildingcollapseorelectricalpowersurges
11.25 Silberschatz,KorthandSudarshan

DatabaseSystemConcepts5thEdition

ImprovementofReliabilityviaRedundancy
s Meantimetodatalossdependsonmeantimetofailure,

andmeantimetorepair
q

E.g.MTTFof100,000hours,meantimetorepairof10 hoursgivesmeantimetodatalossof500*106hours(or 57,000years)foramirroredpairofdisks(ignoring dependentfailuremodes)

DatabaseSystemConcepts5thEdition

11.26

Silberschatz,KorthandSudarshan

ImprovementinPerformanceviaParallelism
s Twomaingoalsofparallelisminadisksystem:

1. Loadbalancemultiplesmallaccessestoincreasethroughput 2. Parallelizelargeaccessestoreduceresponsetime.
s Improvetransferratebystripingdataacrossmultipledisks. s Bitlevelstripingsplitthebitsofeachbyteacrossmultipledisks
q

Butseek/accesstimeworsethanforasingledisk

Bitlevelstripingisnotusedmuchanymore

s Blocklevelstripingwithndisks,blockiofafilegoestodisk(i

modn)+1
q

Requestsfordifferentblockscanruninparalleliftheblocks resideondifferentdisks Arequestforalongsequenceofblockscanutilizealldisksin parallel


11.27 Silberschatz,KorthandSudarshan

DatabaseSystemConcepts5thEdition

RAIDLevels
s RAIDorganizations,orRAIDlevels,havedifferingcost,

performanceandreliabilitycharacteristics
q

s RAIDLevel0:Blockstriping;nonredundant.

Usedinhighperformanceapplicationswheredatalostisnot critical. Offersbestwriteperformance. Popularforapplicationssuchasstoringlogfilesinadatabase system.

s RAIDLevel1:Mirroreddiskswithblockstriping
q q

DatabaseSystemConcepts5thEdition

11.28

Silberschatz,KorthandSudarshan

RAIDLevels(Cont.)
s RAIDLevel2:MemoryStyleErrorCorrectingCodes(ECC)withbit

striping.
q

s RAIDLevel3:BitInterleavedParity asingleparitybitisenoughforerrorcorrection,notjustdetection

Whenwritingdata,correspondingparitybitsmustalsobe computedandwrittentoaparitybitdisk Torecoverdatainadamageddisk,computeXORofbitsfrom otherdisks(includingparitybitdisk)

DatabaseSystemConcepts5thEdition

11.29

Silberschatz,KorthandSudarshan

RAIDLevels(Cont.)
s RAIDLevel3(Cont.)
q

Fasterdatatransferthanwithasingledisk,butfewerI/Osper secondsinceeverydiskhastoparticipateineveryI/O.

s RAIDLevel4:BlockInterleavedParity;usesblocklevelstriping,

andkeepsaparityblockonaseparatediskforcorresponding blocksfromNotherdisks.
q

Whenwritingdatablock,correspondingblockofparitybitsmust alsobecomputedandwrittentoparitydisk Tofindvalueofadamagedblock,computeXORofbitsfrom correspondingblocks(includingparityblock)fromotherdisks.

DatabaseSystemConcepts5thEdition

11.30

Silberschatz,KorthandSudarshan

RAIDLevels(Cont.)
s RAIDLevel4(Cont.)
q

ProvideshigherI/OratesforindependentblockreadsthanLevel 3

blockreadgoestoasingledisk,soblocksstoredondifferent diskscanbereadinparallel Canbedonebyusingoldparityblock,oldvalueofcurrent blockandnewvalueofcurrentblock(2blockreads+2block writes) Orbyrecomputingtheparityvalueusingthenewvaluesof blockscorrespondingtotheparityblock Moreefficientforwritinglargeamountsofdatasequentially

Beforewritingablock,paritydatamustbecomputed

Parityblockbecomesabottleneckforindependentblockwrites sinceeveryblockwritealsowritestoparitydisk
11.31 Silberschatz,KorthandSudarshan

DatabaseSystemConcepts5thEdition

RAIDLevels(Cont.)
s RAIDLevel5:BlockInterleavedDistributedParity;partitions

dataandparityamongallN+1disks,ratherthanstoringdata inNdisksandparityin1disk.
q

E.g.,with5disks,parityblockfornthsetofblocksis storedondisk(nmod5)+1,withthedatablocksstored ontheother4disks.

DatabaseSystemConcepts5thEdition

11.32

Silberschatz,KorthandSudarshan

RAIDLevels(Cont.)
s RAIDLevel5(Cont.)
q

HigherI/OratesthanLevel4.

Blockwritesoccurinparalleliftheblocksandtheirparity blocksareondifferentdisks.

SubsumesLevel4:providessamebenefits,butavoids bottleneckofparitydisk.

s RAIDLevel6:P+QRedundancyscheme;similartoLevel5,but

storesextraredundantinformationtoguardagainstmultipledisk failures.
q

BetterreliabilitythanLevel5atahighercost;notusedas widely.

DatabaseSystemConcepts5thEdition

11.33

Silberschatz,KorthandSudarshan

ChoiceofRAIDLevel
s FactorsinchoosingRAIDlevel
q q q q

Monetarycost Performance:NumberofI/Ooperationspersecond,and bandwidthduringnormaloperation Performanceduringfailure Performanceduringrebuildoffaileddisk

Includingtimetakentorebuildfaileddisk s RAID0isusedonlywhendatasafetyisnotimportant
q

E.g.datacanberecoveredquicklyfromothersources

s Level2and4neverusedsincetheyaresubsumedby3and5 s Level3isnotusedsincebitstripingforcessingleblockreadsto

accessalldisks,wastingdiskarmmovement formostapplications

s Level6israrelyusedsincelevels1and5offeradequatesafety s Socompetitionismainlybetween1and5
DatabaseSystemConcepts5thEdition 11.34 Silberschatz,KorthandSudarshan

ChoiceofRAIDLevel(Cont.)
s Level1providesmuchbetterwriteperformancethanlevel5
q

Level5requiresatleast2blockreadsand2blockwritestowritea singleblock,whereasLevel1onlyrequires2blockwrites Level1preferredforhighupdateenvironmentssuchaslogdisks diskdrivecapacitiesincreasingrapidly(50%/year)whereasdisk accesstimeshavedecreasedmuchless(x3in10years) I/Orequirementshaveincreasedgreatly,e.g.forWebservers WhenenoughdiskshavebeenboughttosatisfyrequiredrateofI/ O,theyoftenhavesparestoragecapacity

s Level1hadhigherstoragecostthanlevel5
q

q q

sothereisoftennoextramonetarycostforLevel1!

s Level5ispreferredforapplicationswithlowupdaterate,

andlargeamountsofdata

s Level1ispreferredforallotherapplications
DatabaseSystemConcepts5thEdition 11.35 Silberschatz,KorthandSudarshan

HardwareIssues
s SoftwareRAID:RAIDimplementationsdoneentirelyinsoftware,

withnospecialhardwaresupport
q q

s HardwareRAID:RAIDimplementationswithspecialhardware

UsenonvolatileRAMtorecordwritesthatarebeingexecuted Beware:powerfailureduringwritecanresultincorrupteddisk

E.g.failureafterwritingoneblockbutbeforewritingthesecond inamirroredsystem Suchcorrupteddatamustbedetectedwhenpowerisrestored Recoveryfromcorruptionissimilartorecoveryfromfailed disk NVRAMhelpstoefficientlydetectedpotentiallycorrupted blocks

Otherwiseallblocksofdiskmustbereadandcompared withmirror/parityblock
11.36 Silberschatz,KorthandSudarshan

DatabaseSystemConcepts5thEdition

HardwareIssues(Cont.)
s Hotswapping:replacementofdiskwhilesystemisrunning,

withoutpowerdown
q q

SupportedbysomehardwareRAIDsystems, reducestimetorecovery,andimprovesavailabilitygreatly

s Manysystemsmaintainsparediskswhicharekeptonline,and

usedasreplacementsforfaileddisksimmediatelyondetection offailure
q

Reducestimetorecoverygreatly

s ManyhardwareRAIDsystemsensurethatasinglepointof

failurewillnotstopthefunctioningofthesystembyusing
q q

Redundantpowersupplieswithbatterybackup Multiplecontrollersandmultipleinterconnectionstoguard againstcontroller/interconnectionfailures

DatabaseSystemConcepts5thEdition

11.37

Silberschatz,KorthandSudarshan

RAIDTerminologyintheIndustry
s RAIDterminologynotverystandardintheindustry
q

E.g.Manyvendorsuse

RAID1:formirroringwithoutstriping RAID10orRAID1+0:formirroringwithstriping

HardwareRAIDimplementationsoftenjustoffloadRAID processingontoaseparatesubsystem,butdontoffer NVRAM.

Readthespecscarefully!

s SoftwareRAIDsupporteddirectlyinmostoperatingsystems

today

DatabaseSystemConcepts5thEdition

11.38

Silberschatz,KorthandSudarshan

OpticalDisks
s Compactdiskreadonlymemory(CDROM)
q q

Seektimeabout100msec(opticalreadheadisheavierand slower)

Higherlatency(3000RPM)andlowerdatatransferrates(36 MB/s)comparedtomagneticdisks s DigitalVideoDisk(DVD)


q q

DVD5holds4.7GB,variantsupto17GB Slowseektime,forsamereasonsasCDROM

s Recordonceversions(CDRandDVDR)

DatabaseSystemConcepts5thEdition

11.39

Silberschatz,KorthandSudarshan

MagneticTapes
s Holdlargevolumesofdataandprovidehightransferrates
q

FewGBforDAT(DigitalAudioTape)format,1040GBwithDLT (DigitalLinearTape)format,100400GB+withUltriumformat, and330GBwithAmpexhelicalscanformat

Transferratesfromfewto10sofMB/s s Currentlythecheapeststoragemedium
q q

Tapesarecheap,butcostofdrivesisveryhigh

s Veryslowaccesstimeincomparisontomagneticdisksandoptical

disks
q

limitedtosequentialaccess.

Someformats(Accelis)providefasterseek(10sofseconds)at costoflowercapacity

s Usedmainlyforbackup,forstorageofinfrequentlyusedinformation,

andasanofflinemediumfortransferringinformationfromonesystem toanother.
q

s Tapejukeboxesusedforverylargecapacitystorage

(terabyte(1012bytes)topetabye(1015bytes)
11.40 Silberschatz,KorthandSudarshan

DatabaseSystemConcepts5thEdition

StorageAccess
s Adatabasefileispartitionedintofixedlengthstorageunits

calledblocks.Blocksareunitsofbothstorageallocationand datatransfer. transfersbetweenthediskandmemory.Wecanreducethe numberofdiskaccessesbykeepingasmanyblocksas possibleinmainmemory. diskblocks.

s Databasesystemseekstominimizethenumberofblock

s Bufferportionofmainmemoryavailabletostorecopiesof s Buffermanagersubsystemresponsibleforallocatingbuffer

spaceinmainmemory.

DatabaseSystemConcepts5thEdition

11.41

Silberschatz,KorthandSudarshan

BufferManager
s Programscallonthebuffermanagerwhentheyneedablockfrom

disk.
q

s Buffermanagerdoesthefollowing:

Iftheblockisalreadyinthebuffer,returntheaddressofthe blockinmainmemory Iftheblockisnotinthebuffer


1.

2.

Allocatespaceinthebufferfortheblock Replacing(throwingout)someotherblock,ifrequired, tomakespaceforthenewblock. Replacedblockwrittenbacktodiskonlyifitwas modifiedsincethemostrecenttimethatitwaswritten to/fetchedfromthedisk. Readtheblockfromthedisktothebuffer,andreturnthe addressoftheblockinmainmemorytorequester.

DatabaseSystemConcepts5thEdition

11.42

Silberschatz,KorthandSudarshan

BufferReplacementPolicies
s Mostoperatingsystemsreplacetheblockleastrecentlyused(LRU

strategy)

s IdeabehindLRUusepastpatternofblockreferencesasa

predictoroffuturereferences

s Querieshavewelldefinedaccesspatterns(suchassequential

scans),andadatabasesystemcanusetheinformationinausers querytopredictfuturereferences
q

LRUcanbeabadstrategyforcertainaccesspatternsinvolving repeatedscansofdata

e.g.whencomputingthejoinof2relationsrandsbyanestedloops foreachtupletrofrdo foreachtupletsofsdo ifthetuplestrandtsmatch

Mixedstrategywithhintsonreplacementstrategyprovided bythequeryoptimizerispreferable
11.43 Silberschatz,KorthandSudarshan

DatabaseSystemConcepts5thEdition

BufferReplacementPolicies(Cont.)
s Pinnedblockmemoryblockthatisnotallowedtobe

writtenbacktodisk.

s Tossimmediatestrategyfreesthespaceoccupiedbya

blockassoonasthefinaltupleofthatblockhasbeen processed

s Mostrecentlyused(MRU)strategysystemmustpinthe

blockcurrentlybeingprocessed.Afterthefinaltupleofthat blockhasbeenprocessed,theblockisunpinned,andit becomesthemostrecentlyusedblock. probabilitythatarequestwillreferenceaparticularrelation


q

s Buffermanagercanusestatisticalinformationregardingthe

E.g.,thedatadictionaryisfrequentlyaccessed.Heuristic: keepdatadictionaryblocksinmainmemorybuffer

s Buffermanagersalsosupportforcedoutputofblocksforthe

purposeofrecovery(moreinChapter17)
11.44

DatabaseSystemConcepts5thEdition

Silberschatz,KorthandSudarshan

FileOrganization
s Thedatabaseisstoredasacollectionoffiles.Eachfileisa

sequenceofrecords.Arecordisasequenceoffields.
qassumerecordsizeisfixed qeachfilehasrecordsofoneparticulartypeonly qdifferentfilesareusedfordifferentrelations

s Oneapproach:

Thiscaseiseasiesttoimplement;willconsidervariablelength recordslater.

DatabaseSystemConcepts5thEdition

11.45

Silberschatz,KorthandSudarshan

FixedLengthRecords
s Simpleapproach:
q

Storerecordistartingfrombyten(i1),wherenisthe sizeofeachrecord. Recordaccessissimplebutrecordsmaycrossblocks

Modification:donotallowrecordstocrossblock boundaries

s Deletionofrecordi:

alternatives:
q

moverecordsi+1,...,n toi,...,n1 moverecordntoi donotmoverecords,but linkallfreerecordsona freelist


11.46 Silberschatz,KorthandSudarshan

q q

DatabaseSystemConcepts5thEdition

FreeLists
s Storetheaddressofthefirstdeletedrecordinthefileheader. s Usethisfirstrecordtostoretheaddressoftheseconddeleted

record,andsoon

s Canthinkofthesestoredaddressesaspointerssincethey

pointtothelocationofarecord.

s Morespaceefficientrepresentation:reusespacefornormal

attributesoffreerecordstostorepointers.(Nopointersstored ininuserecords.)

DatabaseSystemConcepts5thEdition

11.47

Silberschatz,KorthandSudarshan

VariableLengthRecords
s Variablelengthrecordsariseindatabasesystemsin

severalways:
q q

Storageofmultiplerecordtypesinafile. Recordtypesthatallowvariablelengthsforoneormore fields. Recordtypesthatallowrepeatingfields(usedinsome olderdatamodels).

DatabaseSystemConcepts5thEdition

11.48

Silberschatz,KorthandSudarshan

VariableLengthRecords:SlottedPage Structure

s Slottedpageheadercontains:
q q q

numberofrecordentries endoffreespaceintheblock locationandsizeofeachrecord

s Recordscanbemovedaroundwithinapagetokeepthem

contiguouswithnoemptyspacebetweenthem;entryinthe headermustbeupdated. shouldpointtotheentryfortherecordinheader.


11.49

s Pointersshouldnotpointdirectlytorecordinsteadthey
DatabaseSystemConcepts5thEdition Silberschatz,KorthandSudarshan

OrganizationofRecordsinFiles
s Heaparecordcanbeplacedanywhereinthefilewhere

thereisspace

s Sequentialstorerecordsinsequentialorder,basedonthe

valueofthesearchkeyofeachrecord

s Hashingahashfunctioncomputedonsomeattributeof

eachrecord;theresultspecifiesinwhichblockofthefilethe recordshouldbeplaced multitableclusteringfileorganizationrecordsofseveral differentrelationscanbestoredinthesamefile


q

s Recordsofeachrelationmaybestoredinaseparatefile.Ina

Motivation:storerelatedrecordsonthesameblockto minimizeI/O

DatabaseSystemConcepts5thEdition

11.50

Silberschatz,KorthandSudarshan

SequentialFileOrganization
s Suitableforapplicationsthatrequiresequential

processingoftheentirefile

s Therecordsinthefileareorderedbyasearchkey

DatabaseSystemConcepts5thEdition

11.51

Silberschatz,KorthandSudarshan

SequentialFileOrganization(Cont.)
s Deletionusepointerchains s Insertionlocatethepositionwheretherecordistobeinserted
q q q

ifthereisfreespaceinsertthere ifnofreespace,inserttherecordinanoverflowblock Ineithercase,pointerchainmustbeupdated

s Needtoreorganizethefile

fromtimetotimetorestore sequentialorder

DatabaseSystemConcepts5thEdition

11.52

Silberschatz,KorthandSudarshan

MultitableClusteringFileOrganization(cont.)
s Storeseveralrelationsinonefileusingamultitable

clusteringfileorganization

s Multitableclusteringorganizationofcustomeranddepositor:

q q

goodforqueriesinvolvingdepositorcustomer,andfor queriesinvolvingonesinglecustomerandhisaccounts badforqueriesinvolvingonlycustomer


11.53 Silberschatz,KorthandSudarshan

DatabaseSystemConcepts5thEdition

DataDictionaryStorage
Datadictionary(alsocalledsystemcatalog)stores metadata:thatis,dataaboutdata,suchas
s Informationaboutrelations
q q q q

namesofrelations namesandtypesofattributesofeachrelation namesanddefinitionsofviews integrityconstraints

s Userandaccountinginformation,includingpasswords s Statisticalanddescriptivedata
q

numberoftuplesineachrelation Howrelationisstored(sequential/hash/) Physicallocationofrelation


11.54 Silberschatz,KorthandSudarshan

s Physicalfileorganizationinformation
q q

s Informationaboutindices(Chapter12)
DatabaseSystemConcepts5thEdition

DataDictionaryStorage(Cont.)
s Catalogstructure
q q

Relationalrepresentationondisk specializeddatastructuresdesignedforefficientaccess,in memory

s Apossiblecatalogrepresentation:

Relation_metadata=(relation_name,number_of_attributes, storage_organization,location) Attribute_metadata=(relation_name,attribute_name, domain_type, position,length) User_metadata=(user_name,encrypted_password,group) Index_metadata=(relation_name,index_name,index_type, index_attributes) View_metadata=(view_name,definition)

DatabaseSystemConcepts5thEdition

11.55

Silberschatz,KorthandSudarshan

ExtraSlides

DatabaseSystemConcepts,5thEd.
Silberschatz,KorthandSudarshan Seewww.dbbook.comforconditionsonreuse

RecordRepresentation
s Recordswithfixedlengthfieldsareeasytorepresent
q q

Similartorecords(structs)inprogramminglanguages Extensionstorepresentnullvalues

E.g.abitmapindicatingwhichattributesarenull

s Variablelengthfieldscanberepresentedbyapair

(offset,length) offset:thelocationwithintherecord,length:fieldlength.
q

Allfieldsstartatpredefinedlocation,butextraindirection requiredforvariablelengthfields
400 Perryridge

A102

10

000

balance account_number branch_name nullbitmap Examplerecordstructureofaccountrecord


DatabaseSystemConcepts5thEdition 11.57 Silberschatz,KorthandSudarshan

EndofChapter

DatabaseSystemConcepts,5thEd.
Silberschatz,KorthandSudarshan Seewww.dbbook.comforconditionsonreuse

FileContainingaccountRecords

DatabaseSystemConcepts5thEdition

11.59

Silberschatz,KorthandSudarshan

FileofFigure11.6,withRecord2Deletedand AllRecordsMoved

DatabaseSystemConcepts5thEdition

11.60

Silberschatz,KorthandSudarshan

FileofFigure11.6,WithRecord2deletedand FinalRecordMoved

DatabaseSystemConcepts5thEdition

11.61

Silberschatz,KorthandSudarshan

ByteStringRepresentationofVariableLength Records

DatabaseSystemConcepts5thEdition

11.62

Silberschatz,KorthandSudarshan

ClusteringFileStructure

DatabaseSystemConcepts5thEdition

11.63

Silberschatz,KorthandSudarshan

ClusteringFileStructureWithPointerChains

DatabaseSystemConcepts5thEdition

11.64

Silberschatz,KorthandSudarshan

ThedepositorRelation

DatabaseSystemConcepts5thEdition

11.65

Silberschatz,KorthandSudarshan

ThecustomerRelation

DatabaseSystemConcepts5thEdition

11.66

Silberschatz,KorthandSudarshan

ClusteringFileStructure

DatabaseSystemConcepts5thEdition

11.67

Silberschatz,KorthandSudarshan

DatabaseSystemConcepts5thEdition

11.68

Silberschatz,KorthandSudarshan

Figure11.4

DatabaseSystemConcepts5thEdition

11.69

Silberschatz,KorthandSudarshan

Figure11.7

DatabaseSystemConcepts5thEdition

11.70

Silberschatz,KorthandSudarshan

Figure11.8

DatabaseSystemConcepts5thEdition

11.71

Silberschatz,KorthandSudarshan

Figure11.20

DatabaseSystemConcepts5thEdition

11.72

Silberschatz,KorthandSudarshan

ByteStringRepresentationofVariableLengthRecords

Bytestringrepresentation Attachanendofrecord()controlcharactertotheendofeachrecord Difficultywithdeletion Difficultywithgrowth

DatabaseSystemConcepts5thEdition

11.73

Silberschatz,KorthandSudarshan

FixedLengthRepresentation
s Useoneormorefixedlengthrecords:
q q

reservedspace pointers

s Reservedspacecanusefixedlengthrecordsofaknown

maximumlength;unusedspaceinshorterrecordsfilledwithanull orendofrecordsymbol.

DatabaseSystemConcepts5thEdition

11.74

Silberschatz,KorthandSudarshan

PointerMethod

s Pointermethod
q

Avariablelengthrecordisrepresentedbyalistoffixedlength records,chainedtogetherviapointers. Canbeusedevenifthemaximumrecordlengthisnotknown

DatabaseSystemConcepts5thEdition

11.75

Silberschatz,KorthandSudarshan

PointerMethod(Cont.)
s Disadvantagetopointerstructure;spaceiswastedinall

recordsexceptthefirstinaachain.
q q

s Solutionistoallowtwokindsofblockinfile:

Anchorblockcontainsthefirstrecordsofchain Overflowblockcontainsrecordsotherthanthosethat arethefirstrecordsofchairs.

DatabaseSystemConcepts5thEdition

11.76

Silberschatz,KorthandSudarshan

MappingofObjectstoFiles
s Mappingobjectstofilesissimilartomappingtuplestofilesinarelational

system;objectdatacanbestoredusingfilestructures.

s ObjectsinOOdatabasesmaylackuniformityandmaybeverylarge;

suchobjectshavetomanageddifferentlyfromrecordsinarelational system.
q

Setfieldswithasmallnumberofelementsmaybeimplemented usingdatastructuressuchaslinkedlists. Setfieldswithalargernumberofelementsmaybeimplementedas separaterelationsinthedatabase. Setfieldscanalsobeeliminatedatthestoragelevelby normalization.

SimilartoconversionofmultivaluedattributesofERdiagramsto relations

DatabaseSystemConcepts5thEdition

11.77

Silberschatz,KorthandSudarshan

MappingofObjectstoFiles(Cont.)
s Objectsareidentifiedbyanobjectidentifier(OID);thestoragesystem

needsamechanismtolocateanobjectgivenitsOID(thisactionis calleddereferencing).
q

logicalidentifiersdonotdirectlyspecifyanobjectsphysical location;mustmaintainanindexthatmapsanOIDtotheobjects actuallocation. physicalidentifiersencodethelocationoftheobjectsothe objectcanbefounddirectly.PhysicalOIDstypicallyhavethe followingparts: 1.avolumeorfileidentifier 2.apageidentifierwithinthevolumeorfile 3.anoffsetwithinthepage

DatabaseSystemConcepts5thEdition

11.78

Silberschatz,KorthandSudarshan

ManagementofPersistentPointers
s PhysicalOIDsmaybeauniqueidentifier.Thisidentifieris

storedintheobjectalsoandisusedtodetectreferencesvia danglingpointers.

DatabaseSystemConcepts5thEdition

11.79

Silberschatz,KorthandSudarshan

ManagementofPersistentPointers (Cont.)
s ImplementpersistentpointersusingOIDs;persistentpointersare

substantiallylongerthanareinmemorypointers alreadyinmemory.
q

s Pointerswizzlingcutsdownoncostoflocatingpersistentobjects s Softwareswizzling(swizzlingonpointerdeference)

Whenapersistentpointerisfirstdereferenced,thepointeris swizzled(replacedbyaninmemorypointer)aftertheobjectis locatedinmemory. Subsequentdereferencesofofthesamepointerbecomecheap. Thephysicallocationofanobjectinmemorymustnotchangeif swizzledpointersponttoit;thesolutionistopinpagesinmemory Whenanobjectiswrittenbacktodisk,anyswizzledpointersit containsneedtobeunswizzled.

q q

DatabaseSystemConcepts5thEdition

11.80

Silberschatz,KorthandSudarshan

HardwareSwizzling
s Withhardwareswizzling,persistentpointersinobjectsneedthe

sameamountofspaceasinmemorypointersextrastorage externaltotheobjectisusedtostorerestofpointerinformation.

s Usesvirtualmemorytranslationmechanismtoefficientlyand

transparentlyconvertbetweenpersistentpointersandinmemory pointers. firstreadin.


q

s Allpersistentpointersinapageareswizzledwhenthepageis

thusprogrammershavetoworkwithjustonetypeofpointer, i.e.,inmemorypointer. someoftheswizzledpointersmaypointtovirtualmemory addressesthatarecurrentlynotallocatedanyrealmemory (anddonotcontainvaliddata)

DatabaseSystemConcepts5thEdition

11.81

Silberschatz,KorthandSudarshan

HardwareSwizzling
s Persistentpointerisconceptuallysplitintotwoparts:apageidentifier,

andanoffsetwithinthepage.
q

Thepageidentifierinapointerisashortindirectpointer:Each pagehasatranslationtablethatprovidesamappingfromthe shortpageidentifierstofulldatabasepageidentifiers. Translationtableforapageissmall(atmost1024pointersina 4096bytepagewith4bytepointer) Multiplepointersinpagetothesamepagesharesameentryin thetranslationtable.

DatabaseSystemConcepts5thEdition

11.82

Silberschatz,KorthandSudarshan

HardwareSwizzling(Cont.)

s Pageimagebeforeswizzling(pagelocatedondisk)

DatabaseSystemConcepts5thEdition

11.83

Silberschatz,KorthandSudarshan

HardwareSwizzling(Cont.)
s

Whensystemloadsapageintomemorythepersistentpointersinthepage areswizzledasdescribedbelow
1.

Persistentpointersineachobjectinthepagearelocatedusingobject typeinformation Foreachpersistentpointer(pi,oi)finditsfullpageIDPi


5

IfPidoesnotalreadyhaveavirtualmemorypageallocatedtoit, allocateavirtualmemorypagetoPiandreadprotectthepage

Note:thereneednotbeanyphysicalspace(whetherinmemory orondiskswapspace)allocatedforthevirtualmemorypageat thispoint.Spacecanbeallocatedlaterif(andwhen)Piis accessed.Inthiscasereadprotectionisnotrequired. Accessingamemorylocationinthepageinthewillresultina segmentationviolation,whichishandledasdescribedlater

5 5

LetvibethevirtualpageallocatedtoPi(eitherearlierorabove) Replace(pi,oi)by(vi,o11.84 i)
Silberschatz,KorthandSudarshan

DatabaseSystemConcepts5thEdition

HardwareSwizzling(Cont.)
s

Whenaninmemorypointerisdereferenced,iftheoperating systemdetectsthepageitpointstohasnotyetbeenallocated storage,orisreadprotected,asegmentationviolationoccurs. Themmap()callinUnixisusedtospecifyafunctiontobeinvoked onsegmentationviolation Thefunctiondoesthefollowingwhenitisinvoked


1.

s s

Allocatestorage(swapspace)forthepagecontainingthe referencedaddress,ifstoragehasnotbeenallocatedearlier. Turnoffreadprotection Readinthepagefromdisk Performpointerswizzlingforeachpersistentpointerinthe page,asdescribedearlier

2. 3.

DatabaseSystemConcepts5thEdition

11.85

Silberschatz,KorthandSudarshan

HardwareSwizzling(Cont.)

Pageimageafterswizzling
s Pagewithshortpageidentifier2395wasallocatedaddress5001.

Observechangeinpointersandtranslationtable.

s Pagewithshortpageidentifier4867hasbeenallocatedaddress

4867.Nochangeinpointerandtranslationtable.
11.86

DatabaseSystemConcepts5thEdition

Silberschatz,KorthandSudarshan

HardwareSwizzling(Cont.)
s Afterswizzling,allshortpageidentifierspointtovirtualmemoryaddresses

allocatedforthecorrespondingpages
q

functionsaccessingtheobjectsarenotevenawarethatithas persistentpointers,anddonotneedtobechangedinanyway! canreuseexistingcodeandlibrariesthatuseinmemorypointers

s Afterthis,thepointerdereferencethattriggeredtheswizzlingcancontinue s Optimizations:
q

Ifallpagesareallocatedthesameaddressasintheshortpage identifier,nochangesrequiredinthepage! Noneedfordeswizzlingswizzledpagecanbesavedasistodisk Asetofpages(segment)canshareonetranslationtable.Pagescan stillbeswizzledasandwhenfetched(oldcopyoftranslationtableis needed).

q q

s Aprocessshouldnotaccessmorepagesthansizeofvirtualmemory

reuseofvirtualmemoryaddressesforotherpagesisexpensive
11.87

DatabaseSystemConcepts5thEdition

Silberschatz,KorthandSudarshan

DiskversusMemoryStructureofObjects
s Theformatinwhichobjectsarestoredinmemorymaybedifferentfrom

theformalinwhichtheyarestoredondiskinthedatabase.Reasons are:
q

softwareswizzlingstructureofpersistentandinmemorypointers aredifferent databaseaccessiblefromdifferentmachines,withdifferentdata representations Makethephysicalrepresentationofobjectsinthedatabase independentofthemachineandthecompiler. Cantransparentlyconvertfromdiskrepresentationtoformrequired onthespecificmachine,language,andcompiler,whentheobject (orpage)isbroughtintomemory.

DatabaseSystemConcepts5thEdition

11.88

Silberschatz,KorthandSudarshan

LargeObjects
s Largeobjects:binarylargeobjects(blobs)andcharacterlarge

objects(clobs)
q

Examplesinclude: textdocuments graphicaldatasuchasimagesandcomputeraideddesigns audioandvideodata

s Largeobjectsmayneedtobestoredinacontiguoussequenceof

byteswhenbroughtintomemory.
q

Ifanobjectisbiggerthanapage,contiguouspagesofthebuffer poolmustbeallocatedtostoreit. Maybepreferabletodisallowdirectaccesstodata,andonlyallow accessthroughafilesystemlikeAPI,toremoveneedfor contiguousstorage.

DatabaseSystemConcepts5thEdition

11.89

Silberschatz,KorthandSudarshan

ModifyingLargeObjects
s Iftheapplicationrequiresinsert/deleteofbytesfromspecifiedregionsof

anobject:
q

B+treefileorganization(describedlaterinChapter12)canbe modifiedtorepresentlargeobjects Eachleafpageofthetreestoresbetweenhalfand1pageworthof datafromtheobject

s Specialpurposeapplicationprogramsoutsidethedatabaseareusedto

manipulatelargeobjects:
q

Textdatatreatedasabytestringmanipulatedbyeditorsand formatters. Graphicaldataandaudio/videodataistypicallycreatedanddisplayed byseparateapplication checkout/checkinmethodforconcurrencycontrolandcreationof versions

DatabaseSystemConcepts5thEdition

11.90

Silberschatz,KorthandSudarshan

You might also like