You are on page 1of 121

DBMS Architecture

1
Instructor

• Zoia Sahab (Assistant Professor)

• Information Systems Department (ISD)


• Computer Science Faculty (CSF)
• Kabul University (KU)

2
2
Grading
• Homework/class work: 10-20%
• Midterm exam: 20%
• Final exam: 60-70%

3
Schedule

• Lecture (Room #2 )
– Mondays
• First hour
• Second hour

– Tuesdays
• First hour
• Second hour

4
Reading Books
– Database Systems A Practical Approach to Design,
Implementation, and Management. By Thomas
Connolly and Carolyn Begg
– Fundamentals of database systems. By Ramez Elmasri
and Shamkant B. Navathe
– An Introduction to Database Systems. Eight Edition.
By: C. J. Date.
– Database Concepts. By: Kroenke.
– DATABASE SYSTEMS: Design, Implementation, and
management. By: P. Rob and C. Coronel.
– Websites

5
Course Content
• Storage
• File structure and hashing
• Indexing structures
• Query processing
• Transactions processing
• Concurrency control
• Recovery techniques

6
What is DBMS

• A Database Management System (DBMS) is


software designed to store, retrieve, define,
and manage data in a database.

7
Advantages of Database
Management System
• Reducing Data Redundancy
• The file based data management systems
contained multiple files that were stored in
many different locations in a system or even
across multiple systems.
• Sharing of Data
• Data Integrity
• Data Security
• Privacy
• Backup and Recovery
• Data Consistency 8
Components of DBMS

• Software
• Hardware
• Data
• Procedures
• Database Access Language
• Query Processor
• Run Time Database Manager
• Data Manager
• Database Engine
• Data Dictionary
• Report Writer
9
Components of DBMS

• Software: This is the set of programs used


to control and manage the overall database.
• This includes the DBMS software itself, the
Operating System, the network software
being used to share the data among users,
and the application programs used to
access data in the DBMS.

10
Components of DBMS

• Hardware: Consists of a set of physical electronic


devices such as computers, I/O devices, storage
devices, etc., this provides the interface between
computers and the real world systems.

11
Components of DBMS

• Data: DBMS exists to collect, store, process


and access data, the most important
component.
• The database contains both the actual or
operational data and the metadata.

• Procedures: These are the instructions and


rules that assist on how to use the DBMS,
and in designing and running the database,
using documented procedures, to guide the
users that operate and manage it.
12
Components of DBMS

• Database Access Language: This is used


to access the data to and from the
database, to enter new data, update existing
data, or retrieve required data from
databases.
• The user writes a set of appropriate
commands in a database access language,
submits these to the DBMS, which then
processes the data and generates and
displays a set of results into a user readable
form.
13
Components of DBMS

• Query Processor: This transforms the user


queries into a series of low level instructions.
• This reads the online user’s query and translates
it into an efficient series of operations in a form
capable of being sent to the run time data
manager for execution.

14
Components of DBMS

• Run Time Database Manager: Sometimes


referred to as the database control system,
this is the central software component of the
DBMS that interfaces with user-submitted
application programs and queries, and
handles database access at run time.
• Its function is to convert operations in user’s
queries.
• It provides control to maintain the
consistency, integrity and security of the
data.
15
Components of DBMS

• Data Manager
Also called the cache manger, this is responsible for
handling of data in the database, providing a recovery to
the system that allows it to recover the data after a failure.
• Database Engine
The core service for storing, processing, and securing
data, this provides controlled access and rapid transaction
processing to address the requirements of the most
demanding data consuming applications.
• It is often used to create relational databases for online
transaction processing or online analytical processing
data.
16
Components of DBMS

• Data Dictionary
This is a reserved space within a database used to store
information about the database itself.
• A data dictionary is a set of read-only table and views,
containing the different information about the data used
in the enterprise to ensure that database representation
of the data follow one standard as defined in the
dictionary.

17
Components of DBMS
• Report Writer: Also referred to as the report
generator, it is a program that extracts information
from one or more files and presents the information in
a specified format.
• Most report writers allow the user to select records
that meet certain conditions and to display selected
fields in rows and columns, or also format the data into
different charts.

18
Structure of a DBMS

• A database management system is partitioned into


modules that deal with each of the responsibilities
of the overall system.
• The functional components of a DBMS can be
broadly divided into the storage manager and
the query processor components.
• The storage manager is important because
databases typically require a large amount of
storage space.
• The query processor is important because it helps
the database system simplify and facilitate access
to data.
19
Disk Storage, Basic
File Structures, and
Hashing
Introduction
• The collection of data that makes up a
computerized database must be stored
physically on some computer storage
medium.
• Computer storage media form a storage
hierarchy that includes two main categories:
– Primary storage devices
– Secondary storage devices

21
Introduction
• Primary storage. This category includes
storage media that can be operated on
directly by the computer’s central processing
unit (CPU), such as the computer’s main
memory and smaller but faster cache
memories.
• fast access to data but is of limited storage
capacity.

22
Introduction
• Secondary and tertiary storage. This category
includes magnetic disks, optical disks (CD-ROMs,
DVDs, and other similar storage media), and
tapes.
• Removable media such as optical disks and tapes
are considered tertiary storage.
• larger capacity, cost less, and provide slower
access to data than do primary storage devices.
• Their data must be copied into primary storage
and then processed by the CPU.
23
Memory Hierarchies and Storage
Devices
• The highest-speed memory is the most expensive
and is therefore available with the least capacity.
• At the primary storage level, the memory
hierarchy includes at the most expensive, cache
memory, which is a static RAM (Random Access
Memory).
• The next level of primary storage is DRAM
(Dynamic RAM), which provides the main work
area for the CPU for keeping program instructions
and data (main memory).
24
Memory Hierarchies and Storage
Devices
• At the secondary and tertiary storage level,
the hierarchy includes magnetic disks, as well
as mass storage in the form of CD-ROM
(Compact Disk–Read-Only Memory) and DVD
(Digital Video Disk or Digital Versatile Disk)
• and finally tapes at the least expensive end of
the hierarchy.

25
Memory Hierarchies and Storage
Devices
• Between DRAM and magnetic disk storage,
another form of memory, flash memory.
• Flash memories are high-density, high-
performance memories using EEPROM
(Electrically Erasable Programmable Read-
Only Memory) technology.

26
Memory Hierarchies and Storage
Devices
• USB (Universal Serial Bus) flash drives have
become the most portable medium for
carrying data between personal computers;
they have a flash memory storage device
integrated with a USB interface.

27
Memory Hierarchies and Storage
Devices
• CD-ROMs (Compact Disk – Read Only
Memory)
– Is a pre-pressed optical compact disc that contains
data.
– Computers can read – but not write to or erase.
• WORM (Write-Once-Read-Many)
– They hold about half a gigabyte of data per disk
and last much longer than magnetic disks.

28
Memory Hierarchies and Storage
Devices
• Optical jukebox memories use an array of CD-
ROM platters, which are loaded onto drives on
demand.
• DVDs (Digital Video/Versatile Disks)

29
Optical Jukeboxes DVD Jukeboxes

30
Memory Hierarchies and Storage
Devices
• Typically stored in a singe track.
• Track divided into evenly sized sectors that
store items.

31
Memory Hierarchies and Storage
Devices
• Finally, magnetic tapes are used for archiving
and backup storage of data.

32
Memory Hierarchies and Storage
Devices

33
34
Storage of Databases
• Databases typically store large amounts of
data that must persist over long periods time,
and hence is often referred to as persistent
data.
• Parts of this data are accessed and processed
repeatedly during this period.
• Most databases are stored permanently (or
persistently) on magnetic disk secondary
storage, for the following reasons:

35
Storage of Databases
• Generally, databases are too large to fit
entirely in main memory.
• The circumstances that cause permanent loss
of stored data arise less frequently for disk
secondary storage than for primary storage.
• Hence, we refer to disk—and other secondary
storage devices—as nonvolatile storage,
whereas main memory is often called volatile
storage.

36
Storage of Databases
• The cost of storage per unit of data is an order
of magnitude less for disk secondary storage
than for primary storage.

37
Storage of Databases
• Some of the newer technologies such as
optical disks, DVDs, and tape jukeboxes are
likely to provide viable alternatives to the use
of magnetic disks.
• However, it is anticipated that magnetic disks
will continue to be the primary medium of
choice for large databases for years to come.

38
Storage of Databases
• Magnetic tapes are frequently used as a
storage medium for backing up databases
because storage on tape costs even less than
storage on disk.
• However, access to data on tape is quite slow.
Data stored on tapes is offline.
• In contrast, disks are online devices that can
be accessed directly at any time.

39
Storage of Databases
• The techniques used to store large amounts of
structured data on disk are important for
database designers, the DBA, and
implementers of a DBMS.
• Database designers and the DBA must know
the advantages and disadvantages of each
storage technique when they design,
implement, and operate a database on a
specific DBMS.

40
Storage of Databases
• The data stored on disk is organized as files of
records.
• Each record is a collection of data values that
can be interpreted as facts about entities,
their attributes, and their relationships.
• Records should be stored on disk in a manner
that makes it possible to locate them
efficiently when they are needed.

41
Storage of Databases
• There are several primary file organizations,
which determine how the file records are
physically placed on the disk, and hence how
the records can be accessed.
• We discuss primary file organizations and
secondary organization or auxiliary access
structure that allows efficient access to file
records based on alternate fields in next days.

42
Disk Storage Devices (Secondary
Storage Devices)
• Preferred secondary storage device for high
storage capacity and low cost.
• Data stored as magnetized areas on magnetic
disk surfaces.
• A disk pack contains several magnetic disks
connected to a rotating spindle.

43
Disk Storage Devices (Secondary
Storage Devices)
• Disks are divided into concentric circular
tracks on each disk surface.
– Track capacities vary typically from tens of Kbytes
to 150 Kbytes or more
• Because a track usually contains a large
amount of information, it is divided into
smaller blocks or sectors.

44
Disk Storage Devices (Secondary
Storage Devices)
• The division of a track into sectors is hard-coded on
the disk surface and cannot be changed.
• The division of a track into equal-sized disk blocks (or
pages) is set by the operating system during disk
formatting (or initialization). Block size is fixed during
initialization and cannot be changed dynamically.
• Typical disk block sizes range from 512 to 8192 bytes.
• Blocks are separated by fixed-size interblock gaps,
which include specially coded control information
written during disk initialization.

45
Disk Storage Devices (Secondary
Storage Devices)

46
Disk Storage Devices (Secondary
Storage Devices)

47
Disk Storage Devices (Secondary
Storage Devices)
• A read-write head moves to the track that contains the block
to be transferred.
– Disk rotation moves the block under the read-write head for
reading or writing.
• A physical disk block (hardware) address consists of:
– a cylinder number (imaginary collection of tracks of same radius
from all recorded surfaces)
– the track number or surface number (within the cylinder)
– and block number (within track).
• Reading or writing a disk block is time consuming because of
the seek time s and rotational delay (latency).
• Double buffering can be used to speed up the transfer of
contiguous disk blocks.

48
Disk Storage Devices (Secondary
Storage Devices)
• To transfer a disk block, given its address, the
disk controller must first mechanically position
the read/write head on the correct track. The
time required to do this is called the seek
time.
• another delay—called the rotational delay or
latency—while the beginning of the desired
block rotates into position under the
read/write head.

49
Disk Storage Devices (Secondary Storage Devices)

50
Hard drive types
• SATA drives

• SSD hard drives

• NVMe

51
Hard drive types
 SATA(Serial Advanced Technology
Attachment) is the default interface for most
desktop and laptop hard drives.
 A single drive can range from 500 GB to 16 TB
and are available at a lower cost than any of
the other drive types.
 They are good drives if you need a lot of
cheap storage and don’t need extremely high
reads or writes.

52
Hard drive types
• Since data is physically written to a disk, it
can also become fragmented, meaning that
different sectors can be spread across
different areas of the disk, slowing down the
drive.
• They also are vulnerable to shock and
sudden movement since there are moving
parts in each drive, which makes them a poor
choice for laptops.

53
Hard drive types
• Pros:
• Low cost
• High disk sizes
• Cons:
• Not good for laptops
• Requires regular de-fragmentation

54
Hard drive types
 SSD (Solid State Drive) These disks don't
have any moving parts.
 Instead, all of the data is stored on non-volatile
flash memory.
 That means that there isn't a needle that has
to move to read or write data and that they are
significantly faster than SATA drives.
 It's difficult to find an exact speed because it
varies by manufacturer and form factor, but
even the lower-performing drives are
comparable to SATA drives. 55
Hard drive types
 The downside is that these drives are
significantly more expensive and don't come in
as many sizes.
 SSD drives range from about 120 GB to 2 TB,
and are about 2-4 times the price of a SATA
hard drive of the same size.
 Since there are no moving parts, these drives
are also a lot more durable, and there are form
factors built specifically for laptops, making
them ideal for storage on the go.
56
Hard drive types
 Pros:
 Fast

 More durable, especially for laptops

 Cons:
 More expensive than SATA drives

 Lower disk sizes

57
Hard drive types
 NVMe (Non-Volatile Memory Express) is a
type of SSD that's attached to a PCI Express
(PCIe) slot on a main board.
 These slots were originally designed for
graphics cards, so they are incredibly fast.
 That can be very useful if you are doing
something that needs a lot of disk throughput,
like gaming or high-resolution video editing.

58
Hard drive types
 For as fast as it is, there are some drawbacks
to NVMes.
 For starters, they are only available on
desktop PCs and are very expensive.
 Also, while they can be used as secondary
drives, to use it to its full potential, you'll want
to install your operating system on it.
 Most BIOS don't support booting from NVMe
at this time.
 It's still possible to get one that does, but it
might mean replacing your entire main board.
59
Hard drive types
 Pros
 Fastest disk type on the market

 Cons:
 Extremely expensive

 Available for desktop PCs only

 May require replacing main board to get full

benefit

60
Buffering blocks
 When several blocks need to be transferred from
disk to main memory and all the block addresses
are known, several buffers can be reserved in main
memory to
 speed up the transfer.
 While one buffer is being read or written, the CPU
can process data in the other buffer.

61
Buffering blocks

62
Buffering blocks

63
Placing File Records on Disk

64
Records and Record Types
• Data is usually stored in the form of records.
• Each record consists of a collection of related
data values or items, where each value is
formed of one or more bytes and
corresponds to a particular field of the record.
• Records usually describe entities and their
attributes.

65
Records and Record Types
• A collection of field names and their
corresponding data types constitutes a record
type or record format definition.
• A data type, associated with each field, specifies
the types of values a field can take.
• The data type of a field is usually one of the
standard data types used in programming.
• Like numeric, string of characters, Boolean, and
sometimes specially coded date and time data
types.
66
Records and Record Types
• In some database applications, the need may
arise for storing data items that consist of
large unstructured objects, which represent
images, digitized video or audio streams, or
free text referred to as BLOBs (binary large
objects).
• A BLOB data item is typically stored separately
from its record in a pool of disk blocks, and a
pointer to the BLOB is included in the record.

67
Fixed and variable length records
• A file is a sequence of records.
• In many cases, all records in a file are of the
same record type.
• If every record in the file has exactly the same
size (in bytes), the file is said to be made up of
fixed-length records.
• If different records in the file have different
sizes, the file is said to be made up of
variable-length records.

68
Fixed and variable length records
• A file may have variable-length records for several
reasons:
– The file records are of the same record type, but one or
more of the fields are of varying size (variable-length
fields). For example, the Name field of EMPLOYEE can be
a variable-length field.
– The file records are of the same record type, but one or
more of the fields may have multiple values for individual
records; such a field is called a repeating field and a group
of values for the field is often called a repeating group.
– The file records are of the same record type, but one or
more of the fields are optional; that is, they may have
values for some but not all of the file records (optional
fields).

69
Fixed and variable length records
• Records contain fields which have values of a
particular type
– E.g., amount, date, time, age
• Fields themselves may be fixed length or
variable length
• Variable length fields can be mixed into one
record:
– Separator characters or length fields are needed
so that the record can be “parsed.”

70
71
Record Blocking
• Blocking:
– The records of a file must be allocated to disk
blocks because a block is the unit of data transfer
between disk and memory.
– Blocking Refers to storing a number of records in
one block on the disk.

72
Record Blocking
• Suppose that the block size is B bytes:
– For a file of fixed-length records of size R bytes, with B
≥ R,
• Blocking factor (bfr) refers to the number of
records per block (bfr = ⎣B/R⎦ )
• In general, R may not divide B exactly, so we have
some unused space in each block equal to B − (bfr
* R) bytes
• There may be empty space in a block if an
integral number of records do not fit in one block.

73
Record Blocking
• To utilize this unused space, we can store part
of a record on one block and the rest on
another.
• A pointer at the end of the first block points
to the block containing the remainder of the
record in case it is not the next consecutive
block on disk.
• This organization is called spanned because
records can span more than one block.

74
Record Blocking
• Spanned Records:
– Refers to records that exceed the size of one or
more blocks and hence span a number of blocks.
• Unspanned Records
– If records are not allowed to cross block
boundaries.
– This is used with fixed-length records

75
Record Blocking

76
Record Blocking
• For variable-length records, either a spanned
or an unspanned organization can be used.
• And each block may store a different number
of records.
• In this case, the blocking factor bfr represents
the average number of records per block for
the file.
• We can use bfr to calculate the number of
blocks b needed for a file of r records:

77
Record Blocking

• b= ⎡(r/bfr)⎤

• where the ⎡(x)⎤ (ceiling function) rounds the


value x up to the next integer.

78
Allocating File Blocks on Disk
• There are several standard techniques for
allocating the blocks of a file on disk.
• contiguous allocation: consecutive disk
blocks-reading the whole file very fast, using-
double buffering, but it makes expanding the
file difficult.

79
Allocating File Blocks on Disk
• linked allocation: each file block contains a
pointer to the next file block- This makes it
easy to expand the file but makes it slow to
read the whole file.

80
Allocating File Blocks on Disk
• A combination of contiguous allocation and
linked allocation: allocates clusters of
consecutive disk blocks, and the clusters are
linked.

81
Allocating File Blocks on Disk
• indexed allocation: where one or more index
blocks contain pointers to the actual file
blocks.
• It is also common to use combinations of
these techniques.

82
File Headers
• A file header or file descriptor contains information
about a file that is needed by the system programs
that access the file records.
• The header includes information to determine the
disk addresses of the file blocks as well as to record
format descriptions, which may include field lengths
and the order of fields within a record for fixed-
length unspanned records and field type codes,
separator characters, and record type codes for
variable-length records.
83
File Headers
• To search for a record on disk, one or more blocks
are copied into main memory buffers.
• Programs then search for the desired record or
records within the buffers, using the information
in the file header.
• If the address of the block that contains the
desired record is not known, the search programs
must do a linear search through the file blocks.
• The goal of a good file organization is to locate
the block that contains a desired record with a
minimal number of block transfers.
84
Operations on Files
• Operations on files are usually grouped into
retrieval operations and update operations.
• The first one do not change any data in the
file, but only locate certain records so that
their field values can be examined and
processed.
• The second one change the file by insertion or
deletion of records or by modification of field
values.

85
Operations on Files
• Search operations on files are generally based
on simple selection conditions.
• A complex condition must be decomposed by
the DBMS (or the programmer) to extract a
simple condition that can be used to locate
the records on disk.

86
Operations on Files
• Actual operations for locating and accessing
file records vary from system to system.
• There is a set of representative operations.
• Typically, high-level programs, such as DBMS
software programs, access records by using
these commands, so we sometimes refer to
program variables in the following
descriptions:

87
Operation on Files
• Typical file operations include:
– OPEN: Readies the file for access, and allocates appropriate
buffers (typically at least two) to hold file blocks from disk,
and retrieves the file header. Sets the file pointer to the
beginning of the file.
– FIND: Searches for the first file record that satisfies a certain
condition, and makes it the current file record.
– FINDNEXT: Searches for the next file record (from the
current record) that satisfies a certain condition, and makes
it the current file record.
– READ: Reads the current file record into a program variable.
– INSERT: Inserts a new record into the file & makes it the
current file record.

88
Operation on Files
– DELETE: Removes the current file record from the file,
usually by marking the record to indicate that it is no
longer valid.
– MODIFY: Changes the values of some fields of the current
file record.
– CLOSE: Terminates access to the file.
– REORGANIZE: Reorganizes the file records.
• For example, the records marked deleted are physically
removed from the file or a new organization of the file
records is created.
– READ_ORDERED: Read the file blocks in order of a specific
field of the file.

89
Unordered Files
• In this simplest and most basic type of
organization, records are placed in the file in the
order in which they are inserted
• Also called a heap or a pile file.
• New records are inserted at the end of the file.
• A linear search through the file records is
necessary to search for a record.
– This requires reading and searching half the file blocks
on the average, and is hence quite expensive.

90
Unordered Files

91
Unordered Files
• Record insertion is quite efficient.
• The last disk block of the file is copied into a
buffer, the new record is added, and the block
is then rewritten back to disk.
• The address of the last file block is kept in the
file header.
• Reading the records in order of a particular
field requires sorting the file records.

92
Unordered Files (cont.)
• To delete a record:
– a program must first find its block,
– copy the block into a buffer,
– delete the record from the buffer,
– and finally rewrite the block back to the disk.
• This leaves unused space in the disk block.
• Deleting a large number of records in this way
results in wasted storage space.

93
Unordered Files (cont.)
• Another technique used for record deletion is
to have an extra byte or bit, called a deletion
marker, stored with each record.
• A record is deleted by setting the deletion
marker to a certain value.
• A different value for the marker indicates a
valid (not deleted) record.
• Search programs consider only valid records in
a block when conducting their search.

94
Unordered Files (cont.)
• Both of these deletion techniques require
periodic reorganization of the file to reclaim
the unused space of deleted records.
• During reorganization, the file blocks are
accessed consecutively, and records are
packed by removing deleted records.

95
Ordered Files (Sorted Files)
• Also called a sequential file.
• File records are kept sorted by the values of an
ordering field(if it is guaranteed that it has
unique value for each record then it is a
ordering key for the file).

96
Ordered Files (cont.)
• Insertion is expensive: records must be
inserted in the correct order.
– It is common to keep a separate unordered overflow file
for new records to improve insertion efficiency; this is
periodically merged with the main ordered file.
• A binary search can be used to search for a
record on its ordering field value.
– This requires reading and searching log2 of the file blocks
on the average, an improvement over linear search.
• Reading the records in order of the ordering
field is quite efficient.

97
Ordered Files (cont.)
• Ordered records have some advantages over
unordered files.
– First, reading the records in order of the ordering key
values becomes extremely efficient because no
sorting is required.
– Second, finding the next record from the current one
in order of the ordering key usually requires no
additional block accesses because the next record is in
the same block as the current one (unless the current
record is the last one in the block).

98
Ordered Files (cont.)
– Third, using a search condition based on the value
of an ordering key field results in faster access
when the binary search technique is used, which
constitutes an improvement over linear searches,
although it is not often used for disk files.
– Ordered files are blocked and stored on
contiguous cylinders to minimize the seek time.

99
Ordered Files (cont.)

100
Average Access Times
• The following table shows the average access
time to access a specific record for a given
type of file

101
Hashed Files
• Another type of primary file organization is
based on hashing, which provides very fast
access to records under certain search
conditions.
• Usually called a hash file.
• The search condition must be an equality
condition on a single field, called the hash
field (when at the same time it is a key field
then it is called a hash key).

102
Hashed Files (cont.)
• Hash function or randomizing function is
used to find the address of the disk block in
which the record is stored.
• For most records, we need only a single-block
access to retrieve that record.
• Hashing is also used as an internal search
structure within a program whenever a group
of records is accessed exclusively by using the
value of one field

103
Internal Hashing
• For internal files, hashing is typically
implemented as a hash table through the use
of an array of records.
• Suppose that the array index range is from 0
to M – 1, as shown in Figure 17.8(a); then we
have M slots whose addresses correspond to
the array indexes.

104
Internal Hashing (cont.)
• We choose a hash function that transforms
the hash field value into an integer between 0
and M − 1.
• One common hash function is the h(K) = K
mod M function, which returns the remainder
of an integer hash field value K after division
by M; this value is then used for the record
address.

105
Internal Hashing (cont.)

106
Internal Hashing (cont.)
• Other hashing functions can be used.
• One technique, called folding, involves applying
an arithmetic function such as addition or a
logical function such as exclusive or to different
portions of the hash field value to calculate the
hash address
• Another technique involves picking some digits of
the hash field value—for instance, the third, fifth,
and eighth digits—to form the hash address.
107
Internal Hashing (cont.)
• for example, storing 1,000 employees with
Social Security numbers of 10 digits into a
hash file with 1,000 positions would give the
Social Security number 301-67- 89232 a hash
value of 172 by this hash function.
• The problem with most hashing functions is
that they do not guarantee that distinct values
will hash to distinct addresses.

108
Internal Hashing (cont.)
• A collision occurs when the hash field value
of a record that is being inserted hashes to an
address that already contains a different
record.
• The process of finding another position is
called collision resolution.
• There are numerous methods for collision
resolution, in next slides:

109
Internal Hashing (cont.)
• Open addressing: Proceeding from the
occupied position specified by the hash
address, the program checks the subsequent
positions in order until an unused (empty)
position is found.

110
Internal Hashing (cont.)
• Chaining: For this method, various overflow locations are
kept, usually by extending the array with a number of
overflow positions.
• Additionally, a pointer field is added to each record location.
• A collision is resolved by placing the new record in an unused
overflow location and setting the pointer of the occupied hash
address location to the address of that overflow location.
• A linked list of overflow records for each hash address is thus
maintained, as shown in Figure 17.8(b).

111
Internal Hashing (cont.)
• Multiple hashing: The program applies a second
hash function if the first results in a collision.
• If another collision results, the program uses
open addressing or applies a third hash function
and then uses open addressing if necessary.
• The goal of a good hashing function is to
distribute the records uniformly over the address
space so as to minimize collisions while not
leaving many unused locations.

112
Internal Hashing (cont.)
• Simulation and analysis studies have shown
that it is usually best to keep a hash table
between 70 and 90 percent full so that the
number of collisions remains low and we do
not waste too much space.

113
114
External Hashing for Disk Files
• Hashing for disk files is called external hashing.
• To suit the characteristics of disk storage, the target address
space is made of buckets, each of which holds multiple
records.
• A bucket is either one disk block or a cluster of contiguous
disk blocks.
• The hashing function maps a key into a relative bucket
number, rather than assigning an absolute block address to
the bucket.
• A table maintained in the file header converts the bucket
number into the corresponding disk block address, as
illustrated in Figure 17.9.

115
External Hashing (cont.)

116
External Hashing (cont.)
• The collision problem is less severe with buckets,
because as many records as will fit in a bucket
can hash to the same bucket without causing
problems.
• However, we must make provisions for the case
where a bucket is filled to capacity and a new
record being inserted hashes to that bucket.
• We can use a variation of chaining in which a
pointer is maintained in each bucket to a linked
list of overflow records for the bucket, as shown
in Figure 17.10.

117
118
External Hashing (cont.)
• Hashing provides the fastest possible access
for retrieving an arbitrary record given the
value of its hash field.
• Although most good hash functions do not
maintain records in order of hash field values,
some functions called order preserving do.

119
External Hashing (cont.)
• To use an integer hash key directly as an index
to a relative file, if the hash key values fill up a
particular interval; for example, if employee
numbers in a company are assigned as 1, 2, 3,
... up to the total number of employees, we
can use the identity hash function that
maintains order.
• Unfortunately, this only works if keys are
generated in order by some application.

120
External Hashing (cont.)
• The hashing scheme described so far is called
static hashing because a fixed number of
buckets M is allocated.
• This can be a serious drawback for dynamic
files.

121

You might also like