You are on page 1of 50

Introduction to Databases

Storage Management
Prof. Beat Signer
Department of Computer Science Vrije Universiteit Brussel
2 December 2005

Context of Today's Lecture
Programmers Application Programs DBMS DML Preprocessor Query Compiler DDL Compiler Users Queries DB Admins Database Schema

Program Object Code

Authorisation Control

Catalogue Manager

Integrity Checker

Command Processor

Query Optimiser

Transaction Manager Data Manager Database Manager


Buffer Manager

Recovery Manager

Access Methods

File Manager

System Buffers

Data, Indices and System Catalogue

Based on 'Components of a DBMS', Database Systems, T. Connolly and C. Begg, Addison-Wesley 2010

April 20, 2012

Beat Signer - Department of Computer Science -


Storage Device Hierarchy
 Storage devices vary in


data capacity access speed cost per byte

Main Memory

 Devices with fastest
access time have highest costs and smallest capacity

Flash Memory

Magnetic Disk

Optical Disk

Magnetic Tapes

April 20, 2012

Beat Signer - Department of Computer Science -


 On-board cache on the same chip as the microprocessor

level 1 (L1) cache temporary storage of instructions and data typical size of ~64 kB e.g. level 2 (L2) cache typical size of ~1 MB

 Extra cache levels located on separate chips

 Data items in the cache are copies of values in main
memory locations

 If data in the cache has been updated, changes must be
reflected in the corresponding memory locations
April 20, 2012 Beat Signer - Department of Computer Science - 4

Main Memory
 Main memory can be several gigabytes large  Normally too small and too expensive for storing the
entire database
 

content is lost during power failure or crash (volatile memory) in-memory databases (IMDB) primarily rely on main memory
- note that IMDBs lack durability (D of the ACID properties)

IMDB size limited by the maximal addressable memory space
- e.g. maximal 4 GB for 32-bit address space

 Random access memory (RAM)

time to access data is more or less independent of its location (different from magnetic tapes)

 Typical access time of ~10 nanoseconds (10-8 seconds)
April 20, 2012 Beat Signer - Department of Computer Science - 5

bsigner@vub.Department of Computer Science .Secondary Storage (Hard Disk)  Essentially random access  Files are moved between a hard disk and main memory (disk I/O) by the operating system (OS) or the DBMS   the transfer units are blocks tendency for larger block sizes the buffer manager of the DBMS manages the loading and unloading of blocks for specific DBMS operations 1'000'000 times slower than main memory access  Parts of the main memory are used to buffer blocks   Typical block I/O time (seek time) ~10 milliseconds   Capacity of several hundred gigabytes and a system can use many disk units April 20. 2012 Beat Signer .be 6 .ac.

bsigner@vub. 16 heads and 63 sectors  Current hard disks offer logical block addressing (LBA)  April 20. 2012 hides the physical disk geometry Beat Signer .Hard Disk  A hard disk contains one or more platters and one or more heads  The platters were originally addressed in terms of 7 .Department of Computer Science .ac. heads and sectors (block)   cylinder-head-sector (CHS) scheme max of 1024 cylinders.

Department of Computer Science . 2012 8 April 20.Solid-State Drives (SSD)  Storage device that uses solid-state memory (flash memory) to persistently store data  Offers a hard disk interface with a storage capacity of up to a few hundred gigabytes  Typical block I/O time (seek time) ~0.1 milliseconds  SSDs might help to reduce the gap between primary and secondary storage in DBMS systems  Currently there are still some limitations of SSDs   the limited number of SSD write operations before failure can be a problem for DBs with a lot of update operations write operations are often still much slower than read operations Beat Signer .

e.Department of Computer Science 9 . 2012 Beat Signer .racks of CD-ROMs (read only)  Different devices    tape silos .8 petabytes April 20. StorageTek PowderHorn with up to 28.g.Tertiary Storage  No random access  access time depends on data location tapes optical disk jukeboxes .room-sized devices holding racks of tapes operated by tape robots .ac.

bsigner@vub.I/O model of computation  DBMS model of computation      I/O model of computation  the time to move a block between disk and memory is much higher than the time for the corresponding computation Beat Signer .Department of Computer Science .Models of Computation  RAM model of computation  assumes that all data is held in main memory assumes that data does not fit into main memory efficient algorithms must take into account secondary and even tertiary storage best algorithms for processing large amounts of data often differ from those for the RAM model of computation minimising disk accesses plays a major role .be 10 April 20. 2012 .ac.

2012 Beat Signer .Department of Computer Science .main memory 11 . DBMS or disk controller to determine order of requested block read/writes .g. elevator algorithm   prefetching of disk blocks efficient caching .Accelerating Secondary Storage Access  Various possible strategies to improve secondary storage access     placement of blocks that are often accessed together on the same disk cylinder distribute data across multiple disks to profit from parallel disk accesses ( RAID) mirroring of data use of disk scheduling algorithms in OS.g.disk controllers April 20.

G.Redundant Array of Independent Disks  The redundant array of independent disks (RAID) organisation technique provides a single disk view for a number (array) of disks   divide and replicate data across multiple hard disks introduced in 1987 by D. 2012 Beat Signer .originally a RAID was also a cheaper alternative to expensive large disks • original name: Redundant Array of Inexpensive Disks   higher performance due to parallel disk access .A. Patterson. Gibson and R.multiple parallel read/write operations increased reliability since data might be stored redundantly .Department of Computer Science can be restored if a disk fails April 12 .ac.A. Katz  The main goals of a RAID solution are  higher capacity by grouping multiple disks . 13 .Department of Computer Science ..RAID .. 2012 Beat Signer .ac.  There are three main concepts in RAID systems    identical data is written to more than one disk (mirroring) data is split accross multiple disks (striping) redundant parity data is stored on separated disks and used to detect and fix problems (error correction) April 20.

data can be restored in the case of a disk failure April 20.Department of Computer Science .RAID Reliability  The mean time between failures (MTBF) is the average time until a disk failure occurs  e. then the overall system's MTBF can be much lower  e.g. a hard disk might have a MTBF of 200'000 hours (22. the MTBF for a disk array of 100 of the disks mentioned above is 200'000 hours/100 = 2'000 hours (83 days)  By storing information redundantly.note that the MTBF decreases as disks get older  If a DBMS uses an array of disks.g.bsigner@vub. 2012 Beat Signer .8 years) .be 14 .ac.

....  The mean time to data loss (MTTDL) depends on the MTBF and the mean time to repair   if we mirror the information on two disks with a MTBF of 200'000 hours and a mean time to repair of 10 hours then the MTTDL is 200'0002/(2*10) hours = 228'000 years of course in reality it is more likely that an error occurs on multiple disks around the same time . 2012 Beat Signer .drives have the same age .RAID Reliability . earthquake.Department of Computer Science 15 .. fire.power failure. April

bsigner@vub.RAID Levels [http://en.Department of Computer Science .wikipedia. 2012 .org/wiki/RAID]  The different RAID levels offer different cost-performance trade-offs  RAID 0  block level striping without any redundancy mirroring without striping bit level striping multiple parity disks byte level striping one parity disk Beat Signer .be 16  RAID 1   RAID 2    RAID 3   April

 RAID 4    block level striping one parity disk Similar to RAID 3 block level striping with distributed parity no dedicted parity disk block level striping with dual distributed parity no dedicted parity disk similar to RAID 5 Beat Signer .Department of Computer Science 2012 .RAID Levels .be 17  RAID 5    RAID 6    April 20..bsigner@vub.

bsigner@vub. relational model) are mapped to secondary storage    a field contains a fixed.the blocks also represent the units of data transfer  a file contains a collection of blocks and represents a relation  A database is finally mapped to a number of files managed by the underlying operating system  April 18 .Data Representation  A DBMS has to define how the elements of its data model (e.or variable-length sequence of bytes and represents an attribute a record contains a fixed. 2012 index structures are stored in separate files Beat Signer .Department of Computer Science .or variable-length sequence of fields and represents a tuple records are stored in fixed-length physical block storage units representing a set of tuples .ac.

Department of Computer Science . 2012 Beat Signer .be 19 .Relational Model Representation  A number of issues have to be addressed when mapping the basic elements of the relational model to secondary storage         how to map the SQL datatypes to fields? how to represent tuples as records? how to represent records in blocks? how to represent a relation as a collection of blocks? how to deal with record sizes that do not fit into blocks? how to deal with variable-length records? how to deal with schema updates and growing record lengths? April 20.

the first byte represents the length of the string (8-bit integer) followed by the string content .Department of Computer Science . 2012 Beat Signer .bsigner@vub.allocate an array of n + 1 bytes .terminate the string with a special null character (like in C) April to a maximal string length of 255 characters  Variable-length character string (VARCHAR(n))    null-terminated string .Representation of SQL Datatypes  Fixed-length character string (CHAR(n))   represented as a field which is an array of n bytes strings that are shorter than n bytes are filled up with a special "pad" character two common representations (non-fixed length version later) length plus content .be 20 .allocate an array of n + 1 bytes . the time as true variable length value  Bits (BIT(n))   bit values of size n can be packed into single bytes packing of multiple bit values into a single byte is not recommended .ac.limit the precision to a fixed value and store as VARCHAR(m)  Time (TIME(n))    true-variable length .be 21 April 20.  Dates (DATE)  fixed-length character string the precision n leads to strings of variable length and two possible representations fixed-precision .Representation of SQL Datatypes ..makes the retrieval and updating of a value more complex and error-prone Beat Signer .Department of Computer Science . 2012 ..

ac. 2012 Beat Signer .bsigner@vub.Department of Computer Science .Storage Access  A part of the system's main memory is used as a buffer to store copies of disk blocks  The buffer manager is responsible to move data from secondary disk storage into memory   the number of block transfers between disk and memory should be minimised as many blocks a possible should be kept in memory  The buffer manager is called by the DMBS every time a disk block has to be accessed  the buffer manager has to check whether the block is already allocated in the buffer (main memory) April 22 .

the buffer manager returns the corresponding address  If the block is not yet in the buffer. remove an existing block from the buffer (based on a buffer replacement strategy) and write it back to the disk if it has been modified since it was last fetched/written to disk  read the block from the 23 .if no space is available.Department of Computer Science . 2012 Beat Signer . add it to the buffer and return the corresponding memory address  Note the similarities to a virtual memory manager April 20.bsigner@vub.Buffer Manager  If the requested block is already in the buffer. the buffer manager performs the following steps  allocate buffer space .ac.

ac.Buffer Replacement Strategies  Most operating systems use a least recently used (LRU) strategy where the block that was least recently used is moved back from memory to disk  use past access pattern to predict future block access  A DBMS is able to predict future access patterns more accurately than an operating system   a request to the DBMS involves multiple steps and the DBMS might be able to determine which blocks will be needed by analysing the different steps of the operation note that LRU might not always be the best replacement strategy for a DBMS April 20. 2012 Beat Signer .be 24 .Department of Computer Science .bsigner@vub.

 Let us have a look at the procedure to compute the following natural join query: order ⋈ customer  note that we will see more efficient solutions for this problem when discussing query optimisation for each tuple o of order { for each tuple c of customer { if o.Buffer Replacement Strategies .be 25 ..orderID .name ...orderID := o. := c. 2012 Beat Signer .Department of Computer Science .ac.customerID { create a new tuple r with: r.. add tuple r to the result set of the join operation } } } April 20..customerID r.customerID = c.customerID := c..

the buffer manager should free the memory space  toss-immediate strategy  once a customer tuple has been processed. it is not needed anymore ..Buffer Replacement Strategies . the least recently used customer block will be requested next .  We further assume that the two relations order and customer are stored in separate files  From the pseudocode we can see that  once an order tuple has been processed.we should replace the block that has been most recently used (MRU) April 20.. 2012 Beat Signer .ac. it is not accessed again until all the other customer tuples have been accessed 26 .as soon as the last tuple of an order block has been processed.if a whole block of order tuples has been processed.when the processing of a customer block has been finished. that block is no longer required in memory (but an LRU strategy might keep it) .Department of Computer Science .

blocks that are currently updated should not be written to disk  The prefetching of blocks might be used to further increase the performance of the overall system  27 ..Buffer Replacement Strategies . the block has to be pinned .Department of Computer Science .ac. 2012 Beat Signer . for serial scans (relation scans) April 20.important for crash recovery .the block has to be unpinned after the last tuple in the block has be processed  the pinning of blocks provides some control to restrict the time when blocks can be written back to disk .bsigner@vub.g.  A memory block can be marked to indicate that this block is not allowed to be written back to disk (pinned block)  note that if we want to use an MRU strategy for the inner loop of the previous example..

 The buffer manager can also use statistical information about the probability that a request will reference a particular relation (and its related blocks)  the system catalogue (data dictionary) with its metadata is one of the most frequently accessed parts of the database .bsigner@vub..Department of Computer Science .the recovery manager might demand that other blocks have to be written first (force-output) before a specific block can be written to disk April not remove index files from the buffer if not necessary  the crash recovery manager can also provide constraints for the buffer manager .Buffer Replacement Strategies . system catalogue blocks should always be in the buffer  index files might be accessed more often than the corresponding files themselves .if 2012 Beat Signer 28 .

System Catalogue / Data Dictionary  Stores metadata about the database     names of the relations of relation that is indexed . Beat Signer .. storage method.. .Department of Computer Science . domain and lengths of the attributes of each relation names of views names of indices . . 2012 .be 29 April 20.type of index     integrity constraints users and their authorisations statistical data of attributes ..number of tuples in relation. 30  There are different possible mappings of records to files   April 20.File Organisation  A file is a logically organised as a sequence of records    each record contains a sequence of fields name.bsigner@vub. 2012 . datatype and offset of record fields are defined by the schema record types (schema) might change over time  The records are mapped to disk blocks   the block size is fixed and defined by the physical properties of the disk and the operating system the record size might vary for different relations and even between tuples of the same relation (variable field size) use multiple files and only store fixed-length records in each file store variable-length records in a file Beat Signer .Department of Computer Science .

name varchar(30) street varchar(30) end  If we assume that an integer requires 2 bytes and characters are represented by one April 20.. 2012 Beat Signer .Department of Computer Science .bsigner@vub..Fixed-Length Records type customer = record cID int.. cID 0 2 name 33 street 64 . then the customer record is 64 bytes long Block .be 31 . 32 .. dividable by 4) April 0 cID 4 name 36 street 68 .  It might be necessary to ensure that data elements begin at an offset that is a multiple of 4 (8 for 64-bit processors)   the first byte of a block loaded from disk is placed at a memory address that is a multiple of 4 we have to ensure that we have the appropriate offsets (e.Department of Computer Science .Fixed-Length Records . 2012 Beat Signer . Block .

Fixed-Length Records .. Block . April 20...Department of Computer Science 0 s t l 12 cID 16 name 48 street 80 .  Often a record header is added to each record for managing metadata about    the record schema (pointer s to the DBMS schema information) timestamp t about the last access or modification time the length l of the record .be 33 .....bsigner@vub.could be computed from the schema but the information is convenient if we want to quickly access the next record without having to consult the schema  . 2012 Beat Signer .

some records will cross block boundaries and we need two block accesses to read/write such a record Beat Signer . 2012 Records in Blocks/Files h h h 1 2 5 Max Frisch Eddy Merckx Claude Debussy Bahnhofstrasse 7 Pleinlaan 25 12 Rue Louise record 0 record 1 record 2 h h 53 8 Albert Einstein Max Frisch Bergstrasse 18 ETH Zentrum record 3 record 4  Problems with this fixed length representation  after a record has been deleted.could move all records after the deleted one but that is too expensive .Department of Computer Science .be 34 April 20. its space has to be filled with another record .can move the last record to the deleted record's position but also that might require an additional block access  if the block size is not a multiple of the record size.

bsigner@vub. 2012 each deleted record contains a pointer (address) to the next deleted record the linked list of deleted records is called a free list Beat Signer .be 35 ...Department of Computer Science .  Since insert operations tend to be more frequent that delete operations.Fixed-Length Records in Blocks/Files .ac. it might be acceptable to leave the space of the deleted record open until a new record is inserted   we cannot just add an additional boolean flag ("free") to the record since it will be hard to find the free records allocate a certain amount of bytes for a file header containing metadata about the file  The block/file header contains a pointer (address) to the first deleted record   April 20.

ac.. the pointers of the free list can also be stored in the unused space of deleted records (no additional field) Beat Signer .bsigner@vub.Fixed-Length Records in Blocks/Files .be 36 April 20. header h 1 Max Frisch Bahnhofstrasse 7 record 0 record 1 h h 5 8 Claude Debussy Max Frisch 12 Rue Louise ETH Zentrum record 2 record 3 record 4  To insert a new record.Department of Computer Science . the first free record pointed to by the header is used and the address in the header is updated to the free record that the used record was pointing to  to save some space.. 2012 .

2012 Beat Signer .Department of Computer Science .be 37 .storage device identifier ( . hard disk ID) within the track .potential offset of record within the block  logical addresses consisting of an arbitrary string of length n April 20.Address Space  There are several ways how the database address space (blocks and block offsets) can be represented  physical addresses consisting of byte strings (up to 16 bytes) that address .track within the cylinder (for multi-surface disks) .cylinder number of the disk .

bsigner@vub. 2012 38 April 20...Address Space Mapping logical address logical physical . physical address  A map table is stored at a known disk location and provides a mapping between the logical and physical address spaces    introduces some indirection since the map table has to be consulted to get the physical address flexibility to rearrange records within blocks or move them to other blocks without affecting the record's logical address different combinations of logical and physical addresses are possible (structured address schemes) Beat Signer . map table .Department of Computer Science .ac.

.g.Variable-Length Data  Records of the same type may have different lengths  We may want to represent    record fields with varying size (e.Department of Computer Science . images) .  We need an alternative data representation to deal with these requirements April 2012 Beat Signer .bsigner@vub.g. VARCHAR(n)) large fields ( 39 .

g. we do not have to store an offset for the first variable-length field (e. cID) add the length of the record to the record header add the offsets of the variable-length fields to the record header  Note that if the order of the variable-length fields is always the same. name) April 20.Variable-Length Record Fields cID record length name street  Scheme for records with variable-length fields    put all fixed-length fields first (e.g.bsigner@vub.Department of Computer Science . 2012 Beat Signer 40 .

.Variable-Length Records offset table . free 41  Structured address scheme (slotted-page structure)   April of Computer Science . record3 record2 record1  There are different reasons why we might have to use variable-length records   to store records that have at least one field with a variable length to store different record types in a single block/file address of a record consists of the block address in combination with an offset table index records can be moved around Beat Signer .. 2012 .

bsigner@vub. telling whether first or last fragment of record  Extra header information   April 42 .ac.fragments have some more bits. 2012 potential pointers to previous and next fragment Beat Signer .g. audio or movie clips)   a record that is split across two or more blocks is called a spanned record spanned records can also be used to pack blocks more efficiently each record header carries a bit to indicate if it is a fragment .Department of Computer Science .Large Records record1 record header block header record2a record2b block 2 record3 block 1  Sometimes we have to deal with values that do not fit into a single block (e.

be 43 .Department of Computer Science . 2012 Beat Signer of Binary Large Objects (BLOBS)  BLOB is stored as a sequence of blocks  often blocks allocated successively on a disk cylinder  BLOB might be striped across multiple disks for more efficient retrieval  BLOB field might not be automatically fetched into memory   user has to explicitly load parts of the BLOB possibly index structures to retrieve parts of a BLOB April 20.

2012 ..Department of Computer Science . free .be 44 April that an overflow block might point to another overflow block and so on Beat Signer . there are two alternatives   find space in a nearby block and rearrange some records create an overflow block and link it from the header of the original block . record3 record2 record1  If the records are not kept in a particular order... but there is no space in the block.. we can just find a block with some empty space or create a new block if there is no such space  If the record has to be inserted in a particular order.Insertion of Records offset table .bsigner@vub.

ac.bsigner@vub..... we may compact the free space in the block (slide around the records)  If the records cannot be moved.Deletion of Records offset table . free . 2012 Beat Signer .be 45 . we might have a free list in the header  We might also be able to remove an overflow block after a delete operation April 20.Department of Computer Science . record3 record2 record1  If we use an offset table.

ac. then we might have to create more space  same options as discussed for insert operation  If the updated record is 46 .Update of Records offset table .. then we may compact some free space or remove overflow blocks  April 20.Department of Computer Science ... free .. record3 record2 record1  If we have to update a fixed-length record there is no problem since we will still use the same space  If the updated record is larger than the original version.bsigner@vub. 2012 similar to delete operation Beat Signer .

ac.Storage and File Structure April 20.9 . 2012 Beat Signer .sections 47 .Department of Computer Science .Homework  Study the following chapter of the Database System Concepts book  chapter 10 .1-10.bsigner@vub.

bsigner@vub. 2012 Beat Signer .ac.Department of Computer Science .Exercise 8  Structured Query Language (SQL)  PostgreSQL April 48 .

Widom. J. Database System Concepts (Sixth Edition). 2012 Beat Signer . 2002  A.D. Sudarshan.bsigner@vub. Silberschatz. 2010 April 20. Database Systems – The Complete Book. McGraw-Hill. H. Korth and S. Ullman and J. Prentice Hall. Garcia-Molina.Department of Computer Science .be 49 .ac.References  H.

Next Lecture Access Methods 2 December 2005 .