You are on page 1of 73

Saveetha School of Engineering

Saveetha Institute of Medical and Technical Sciences

Department of Computer Science and Engineering

Subject Code : CSA10

Subject Name: Database Management Systems

Sem / Year: III / II

Academic Year: 2018-19

UNIT 1

A database-management system (DBMS) is a collection of interrelated data and a set of


programs to access those data. The collection of data, usually referred to as the database,
contains information relevant to an enterprise. The primary goal of a DBMS is to provide a way
to store and retrieve database information that is both convenient and efficient.

Database System Applications

● Banking: customer information, accounts, and loans, and banking transactions.


● Airlines: reservations and schedule information. Airlines were among the first to use
databases in a geographically distributed manner—terminals situated around the world
accessed the central database system through phone lines and other data networks.
● Universities: student information, course registrations, and grades.
● Credit card transactions: purchases on credit cards and generation of monthly
statements.
● Telecommunication: keeping records of calls made, generating monthly bills,
maintaining balances on prepaid calling cards, and storing information about the
communication networks.
● Finance: storing information about holdings, sales, and purchases of financial
instruments such as stocks and bonds.
● Sales: customer, product, and purchase information.
● Manufacturing: management of supply chain and for tracking production of items in
factories, inventories of items in warehouses/stores, and orders for items.
● Human resources: information about employees, salaries, payroll taxes and benefits,
and for generation of paychecks.

Purpose of Database Systems

Consider part of a savings-bank enterprise that keeps information about all customers and
savings accounts. One way to keep the information on a computer is to store it in operating
system files. To allow users to manipulate the information, the system has a number of
application programs that manipulate the files, including

● A program to debit or credit an account


● A program to add a new account
● A program to find the balance of an account
● A program to generate monthly statements

System programmers wrote these application programs to meet the needs of the bank.

This typical file-processing system is supported by a conventional operating system. The system
stores permanent records in various files, and it needs different application programs to extract
records from, and add records to, the appropriate files. Before database management systems
(DBMSs) came along, organizations usually stored information in such systems. keeping
organizational information in a file-processing system has a number of major disadvantages

● Data redundancy and inconsistency.


● Difficulty in accessing data.
● Data isolation.
● Integrity problems.
● Atomicity problems
● Concurrent-access anomalies
● Security problems.

View of Data

A major purpose of a database system is to provide users with an abstract view of the
data. That is, the system hides certain details of how the data are stored and maintained.

Data Abstraction
. The need for efficiency has led designers to use complex data structures to represent data in the
database. Since many database-systems users are not computer trained, developers hide the
complexity from users through several levels of abstraction, to simplify users’ interactions with
the system.

• Physical level. The lowest level of abstraction describes how the data are actually stored. The
physical level describes complex low-level data structures in detail.

• Logical level. The next-higher level of abstraction describes what data are stored in the
database, and what relationships exist among those data. The logical level thus describes the
entire database in terms of a small number of relatively simple structures. Database
administrators, who must decide what information to keep in the database, use the logical level
of abstraction.

• View level. The highest level of abstraction describes only part of the entire database. Many
users of the database system do not need all this information; instead, they need to access only a
part of the database. The view level of abstraction exists to simplify their interaction with the
system.

Instances and Schemas

Databases change over time as information is inserted and deleted. The collection of information
stored in the database at a particular moment is called an instance of the database. The overall
design of the database is called the database schema. Database systems have several schemas,
partitioned according to the levels of abstraction.

The physical schema describes the database design at the physical level, while the logical
schema describes the database design at the logical level.A database may also have several
schemas at the view level, sometimes called subschemas, that describe different views of the
database.

Data Models

Underlying the structure of a database is the data model: a collection of conceptual tools for
describing data, data relationships, data semantics, and consistency constraints.

The Entity-Relationship Model


The entity-relationship (E-R) data model is based on a perception of a real world that consists of
a collection of basic objects, called entities, and of relationships among these objects. An entity
is a “thing” or “object” in the real world that is distinguishable from other objects. For example,
each person is an entity, and bank accounts can be considered as entities. Entities are described
in a database by a set of attributes. For example, the attributes account-number and balance may
describe one particular account in a bank, and they form attributes of the account entity set.
Similarly, attributes customer-name, customer-street address and customer-city may describe a
customer entity.

A relationship is an association among several entities. For example, a depositor relationship


associates a customer with each account that she has. The set of all entities of the same type and
the set of all relationships of the same type are termed an entity set and relationship set,
The overall logical structure (schema) of a database can be expressed graphically by an E-R
diagram, which is built up from the following components:
• Rectangles, which represent entity sets
• Ellipses, which represent attributes
• Diamonds, which represent relationships among entity sets
• Lines, which link attributes to entity sets and entity sets to relationships

E-R diagram indicates that there are two entity sets, customer and account, with attributes as
outlined earlier. The diagram also shows a relationship depositor between customer and account.
In addition to entities and relationships, the E-R model represents certain constraints to which the
contents of a database must conform. One important constraint is mapping cardinalities, which
express the number of entities to which another entity can be associated via a relationship set.
For example, if each account must belong to only one customer, the E-R model can express that
constraint

Relational Model

The relational model uses a collection of tables to represent both data and the relationships
among those data. Each table has multiple columns, and each column has a unique name.
Example table shows details of bank customers, shows which accounts belong to which
customers. The first table, the customer table, shows, for example, that the customer identified
by customer-id 192-83-7465 is named Johnson and lives at 12 Alma St. in Palo Alto.

The relational model is an example of a record-based model. Record-based models are so named
because the database is structured in fixed-format records of several types. Each table contains
records of a particular type. Each record type defines a fixed number of fields, or attributes.
Other Data Models

The object-oriented model can be seen as extending the E-R model with notions of
encapsulation, methods (functions), and object identityThe object-relational data model
combines features of the object-oriented data model and relational data model.

The Semistructured data models permit the specification of data where individual data items
of the same type may have different sets of attributes. every data item of a particular type must
have the same set of attributes. The extensible markup language (XML) is widely used to
represent semi structured data..

Two other data models, the network data model and the hierarchical data model, preceded the
relational data model. These models were tied closely to the underlying implementation, and
complicated the task of modeling data

Database Languages

A database system provides a data definition language to specify the database schema and a
data manipulation language to express database queries and updates. the data definition and
data manipulation languages are not two separate languages; instead they simply form parts of a
single database language, such as the widely used SQL language.

Data-Definition Language
We specify a database schema by a set of definitions expressed by a special language called a
data-definition language (DDL). For instance, the following statement in the SQL language
defines the account table:
create table account
(account-number char(10),
balance integer)

Execution of the above DDL statement creates the account table. In addition, it updates a special
set of tables called the data dictionary or data directory.

A data dictionary contains metadata—that is, data about data. The schema of a table is an
example of metadata. A database system consults the data dictionary before reading or
modifying actual data. We specify the storage structure and access methods used by the database
system by a set of statements in a special type of DDL called a data storage and definition
language. These statements define the implementation details of the database schemas, which are
usually hidden from the users. The data values stored in the database must satisfy certain
consistency constraints. For example, suppose the balance on an account should not fall below
$100. The DDL provides facilities to specify such constraints. The database systems check these
constraints every time the database is updated.

Data-Manipulation Language
• The retrieval of information stored in the database
• The insertion of new information into the database
• The deletion of information from the database
• The modification of information stored in the database

A data-manipulation language (DML) is a language that enables users to access or manipulate


data as organized by the appropriate data model. There are basically two types

• Procedural DMLs require a user to specify what data are needed and how to
get those data.
• Declarative DMLs (also referred to as nonprocedural DMLs) require a user to specify what
data are needed without specifying how to get those data. Declarative DMLs are usually easier to
learn and use than are procedural DMLs. However, since a user does not have to specify how to
get the data, the database system has to figure out an efficient means of accessing data. The
DML component of the SQL language is nonprocedural.
A query is a statement requesting the retrieval of information. The portion of a DML that
involves information retrieval is called a query language.

This query in the SQL language finds the name of the customer whose customer-idis 192

select customer.customer-name
from customer
where customer.customer-id = 192

The query specifies that those rows from the table customer where the customer-id is 192 must
be retrieved, and the customer-name attribute of these rows must be displayed.

Database Access from Application Programs

Application programs are programs that are used to interact with the database. Application
programs are usually written in a host language, such as Cobol, C, C++, or Java. Examples in a
banking system are programs that generate payroll checks, debit accounts, credit accounts, or
transfer funds between accounts. To access the database, DML statements need to be executed
from the host language.

Database Users and Administrators

A primary goal of a database system is to retrieve information from and store new information in
the database. People who work with a database can be categorized as database users or database
administrators.

Database Users and User Interfaces


There are four different types of database-system users, differentiated by the way they
expect to interact with the system. Different types of user interfaces have been designed for the
different types of users.
• Naive users users who interact with the system by invoking one of the application programs
that have been written previously. As example, consider a user who wishes to find her account
balance over the World Wide Web. Such a user may access a form, where a person enters her
account number.. Naive users may also simply read reports generated from the database.

• Application programmers are computer professionals who write application programs.


Application programmers can choose from many tools to develop user interfaces. Rapid
application development (RAD) tools are tools that enable an application programmer to
construct forms and reports without writing a program.

• Sophisticated users interact with the system without writing programs. Instead, they form their
requests in a database query language. They submit each such query to a query processor,
whose function is to break down DML statements into instructions that the storage manager
understands.

• Specialized users are sophisticated users who write specialized database applications that do
not fit into the traditional data-processing framework. base and expert systems, systems that store
data with complex data types (for example, graphics data and audio data), and environment-
modeling systems.

Database Administrator

One of the main reasons for using DBMSs is to have central control of both the data and the
programs that access those data. A person who has such central control over the system is called
a database administrator (DBA).

The functions of a DBA are

•Schema definition. The DBA creates the original database schema by executing a set of data
definition statements in the DDL.

• Storage structure and access-method definition.

• Schema and physical-organization modification. The DBA carries out changes to the schema
and physical organization to reflect the changing needs of the organization, or to alter the
physical organization to improve performance.

• Granting of authorization for data access. By granting different types of authorization, the
database administrator can regulate which parts of the database various users can access. The
authorization information is kept in a special system structure that the database system consults
whenever someone attempts to access the data in the system.

• Routine maintenance. Examples of the database administrator’s routine maintenance activities


are:
􀀀 Periodically backing up the database, to prevent loss of data in case of disasters, flooding,etc.
􀀀 Ensuring that enough free disk space is available and upgrading disk space if required.
􀀀 Monitoring jobs running on the database and ensuring that performance is not degraded

Database Architecture

A database system has several subsystems.

Storage Manager
A storage manager is a program module that provides the interface between the lowlevel data
stored in the database and the application programs and queries submitted to the system. The
storage manager is responsible for the interaction with the file manager. The raw data are stored
on the disk using the file system, which is usually provided by a conventional operating system.
The storage manager translates the various DML statements into low-level file-system
commands. Thus, the storage manager is responsible for storing, retrieving, and updating data in
the database.

The storage manager components include:

• Authorization and integrity manager, which tests for the satisfaction of integrity constraints
and checks the authority of users to access data.

• Buffer manager, which is responsible for fetching data from disk storage into main memory,
and deciding what data to cache in main memory. The buffer manager is a critical part of the
database system, since it enables the database to handle data sizes that are much larger than the
size of main memory. The storage manager implements several data structures as part of the
physical system implementation
•Transaction manager, which ensures that the database remains in a consistent (correct) state
despite system failures, and that concurrent transaction executions proceed without conflicting.
• File manager, which manages the allocation of space on disk storage and the data structures
used to represent information stored on disk.

• Data files, which store the database itself.


• Data dictionary, which stores metadata about the structure of the database, in particular the
schema of the database.
• Indices, which provide fast access to data items that hold particular values.

The Query Processor


The query processor components include

• DDL interpreter, which interprets DDL statements and records the definitions in the data
dictionary.
• DML compiler, which translates DML statements in a query language into an evaluation plan
consisting of low-level instructions that the query evaluation engine understands. A query can
usually be translated into any of a number of alternative evaluation plans that all give the same
result. The DML compiler also performs query optimization, that is, it picks the lowest cost
evaluation plan from among the alternatives.

• Query evaluation engine, which executes low-level instructions generated by


the DML compiler.

Application Architectures

Database applications are typically broken up into a front-end part that runs at client machines
and a part that runs at the back end. In two-tier architectures, the front-end directly
communicates with a database running at the back end. In three-tier architectures, the back end
part is itself broken up into an application server and a database server.
Entity-Relationship Model

The entity-relationship (E-R) data model perceives the real world as consisting of basic objects,
called entities, and relationships among these objects. It was developed to facilitate database
design by allowing specification of an enterprise schema, which represents the overall logical
structure of a database . The E-R data model is one of several semantic data models; the
semantic aspect of the model lies in its representation of the meaning of the data. The E-R model
is very useful in mapping the meanings and interactions of real-world enterprises onto a
conceptual schema.

Basic Concepts
The E-R data model employs three basic notions: entity sets, relationship sets, and attributes.

Entity Sets
An entity is a “thing” or “object” in the real world that is distinguishable from all other objects.
For example, each person in an enterprise is an entity. An entity has a set of properties, and the
values for some set of properties may uniquely identify an entity. For instance, a person may
have a person-id property whose value uniquely identifies that person. Thus, the value 677-89-
9011 for person-id would uniquely identify one particular person in the enterprise. An entity may
be concrete, such as a person or a book, or it may be abstract, such as a loan, or a holiday, or a
concept.

An entity set is a set of entities of the same type that share the same properties, orattributes. The
set of all persons who are customers at a given bank, for example, can be defined as the entity set
customer. Similarly, the entity set loan might represent the set of all loans awarded by a
particular bank. and the entity set of all customers of the bank(customer).

Attributes
An entity is represented by a set of attributes. Attributes are descriptive properties possessed by
each member of an entity set. Possible attributes of the customer entity set are customer-id,
customer-name, customerstreet,and customer-city.Possible attributes of the loan entity set are
loan-number and amount.

• Simple and composite attributes. In our examples thus far, the attributes havebeen simple; that
is, they are not divided into subparts. Composite attributes,on the other hand, can be divided into
subparts (that is, other attributes). Forexample, an attribute name could be structured as a
composite attribute consisting of first-name, middle-initial, and last-name.

• Single-valued and multivalued attributes. The attributes in our examples all have a single
value for a particular entity. For instance, the loan-number attribute for a specific loan entity
refers to only one loan number. Such attributes are said to be single valued. There may be
instances where an attribute has
a set of values for a specific entity. Consider an employee entity set with the attribute phone-
number. An employee may have zero, one, or several phone numbers, and different employees
may have different numbers of phones. This type of attribute is said to be multivalued.

• Derived attribute. The value for this type of attribute can be derived from the values of other
related attributes or entities. As example, suppose that the customer entity set has an attributeage,
which indicates the customer’s age. If the customer entity set also has anattribute date-of-birth,
we can calculate age from date-of-birth and the current date. Thus, age is a derived attribute. In
this case, date-of-birth may be referred to as a base attribute, or a stored attribute. The value of a
derived attribute is not stored, but is computed when required.

Relationship Sets
A relationship is an association among several entities. For example, we can define a
relationship that associates customer Hayes with loan L-15. This relationship specifies that
Hayes is a customer with loan number L-15.
A relationship set is a set of relationships of the same type.

As an example, consider the entity sets employee, branch, and job. Examples of job entities
could include manager, teller, auditor, and so on. Job entities may have the attributes title and
level. The relationship set works-on among employee, branch, and job is an example of a ternary
relationship. A ternary relationship among Jones, Perryridge, and manager indicates that Jones
acts as a manager at the Perryridge branch. Jones could also act as auditor at the Downtown
branch, which would be represented by another relationship.

Constraints
An E-R enterprise schema may define certain constraints to which the contents of a database
must conform. In this section, we examine mapping cardinalities and participation constraints,
which are two of the most important types of constraints.

Mapping Cardinalities

Mapping cardinalities, or cardinality ratios, express the number of entities to which another
entity can be associated via a relationship set. Mapping cardinalities are most useful in
describing binary relationship sets, although they can contribute to the description of relationship
sets that involve more than two entity sets. In this section, we shall concentrate on only binary
relationship sets.

For a binary relationship set R between entity sets A and B, the mapping cardinality must be one
of the following:
• One to one. An entity in A is associated with at most one entity in B, and an entity in B is
associated with at most one entity in A
• One to many. An entity in A is associated with any number (zero or more) of entities in B. An
entity in B, however, can be associated with at most one entity in A.
• Many to one. An entity in A is associated with at most one entity in B. An entity in B, however,
can be associated with any number (zero or more) of entities in A.
• Many to many. An entity in A is associated with any number (zero or more) of
entities in B, and an entity in B is associated with any number (zero or more)
of entities in A.

2.2.2 Participation Constraints


The participation of an entity set E in a relationship set R is said to be total if every entity in E
participates in at least one relationship in R. If only some entities in Eparticipate in relationships
in R, the participation of entity set E in relationship R is said to be partial. For example, we
expect every loan entity to be related to at least one customer through the borrower relationship.
Therefore the participation of loan in the relationship set borrower is total. In contrast, an
individual can be a bank customer whether or not she has a loan with the bank. Hence, it is
possible that only some of the customer entities are related to the loan entity set through the
borrower relationship, and the participation of customer in the borrower relationship set is
therefore partial.

Entity-Relationship Diagram
An E-R diagram can express the overall logical structure of a database graphically. E-R
diagrams are simple and clear—qualities that may well account in large part for the widespread
use of the E-R model. Such a diagram consists of the following major components:
• Rectangles, which represent entity sets
• Ellipses, which represent attributes
• Diamonds, which represent relationship sets
• Lines, which link attributes to entity sets and entity sets to relationship sets
• Double ellipses, which represent multivalued attributes
• Dashed ellipses, which denote derived attributes
• Double lines, which indicate total participation of an entity in a relationship set
• Double rectangles, which represent weak entity sets

E-R diagram Example - Online Bookstore


E-R diagram – Banking
Database Model
A Database model defines the logical design of data. The model describes the relationships between
different parts of the data. Historically, in database design, three models are commonly used. They
are,

● Hierarchical Model
● Network Model
● Relational Model

Hierarchical Model
In this model each entity has only one parent but can have several children . At the top of hierarchy
there is only one entity which is called Root.

Network Model
In the network model, entities are organised in a graph,in which some entities can be accessed
through several path
Relational Model

In this model, data is organised in two-dimesional tables called relations. The tables or relation are
related to each other.
Saveetha School of Engineering

Saveetha University, Chennai.

Department of Computer Science And Engineering Programme

Subject Code : CS016

Subject Name: Database Management Systems

Sem / Year: III / II

Academic Year: 2017-18

UNIT 1

A database-management system (DBMS) is a collection of interrelated data and a set of programs to


access those data. The collection of data, usually referred to as the database, contains information
relevant to an enterprise. The primary goal of a DBMS is to provide a way to store and retrieve
database information that is both convenient and efficient.

Database System Applications

● Banking: customer information, accounts, and loans, and banking transactions.


● Airlines: reservations and schedule information. Airlines were among the first to use databases
in a geographically distributed manner—terminals situated around the world accessed the
central database system through phone lines and other data networks.
● Universities: student information, course registrations, and grades.
● Credit card transactions: purchases on credit cards and generation of monthly statements.
● Telecommunication: keeping records of calls made, generating monthly bills, maintaining
balances on prepaid calling cards, and storing information about the communication networks.
● Finance: storing information about holdings, sales, and purchases of financial instruments such
as stocks and bonds.
● Sales: customer, product, and purchase information.
● Manufacturing: management of supply chain and for tracking production of items in factories,
inventories of items in warehouses/stores, and orders for items.
● Human resources: information about employees, salaries, payroll taxes and benefits, and for
generation of paychecks.

Purpose of Database Systems

Consider part of a savings-bank enterprise that keeps information about all customers and savings
accounts. One way to keep the information on a computer is to store it in operating system files. To
allow users to manipulate the information, the system has a number of application programs that
manipulate the files, including

● A program to debit or credit an account


● A program to add a new account
● A program to find the balance of an account
● A program to generate monthly statements

System programmers wrote these application programs to meet the needs of the bank.

This typical file-processing system is supported by a conventional operating system. The system stores
permanent records in various files, and it needs different application programs to extract records from,
and add records to, the appropriate files. Before database management systems (DBMSs) came along,
organizations usually stored information in such systems. keeping organizational information in a file-
processing system has a number of major disadvantages

● Data redundancy and inconsistency.


● Difficulty in accessing data.
● Data isolation.
● Integrity problems.
● Atomicity problems
● Concurrent-access anomalies
● Security problems.

View of Data

A major purpose of a database system is to provide users with an abstract view of the data. That
is, the system hides certain details of how the data are stored and maintained.

Data Abstraction
. The need for efficiency has led designers to use complex data structures to represent data in the
database. Since many database-systems users are not computer trained, developers hide the complexity
from users through several levels of abstraction, to simplify users’ interactions with the system.

• Physical level. The lowest level of abstraction describes how the data are actually stored. The
physical level describes complex low-level data structures in detail.

• Logical level. The next-higher level of abstraction describes what data are stored in the database, and
what relationships exist among those data. The logical level thus describes the entire database in terms
of a small number of relatively simple structures. Database administrators, who must decide what
information to keep in the database, use the logical level of abstraction.

• View level. The highest level of abstraction describes only part of the entire database. Many users of
the database system do not need all this information; instead, they need to access only a part of the
database. The view level of abstraction exists to simplify their interaction with the system.
Instances and Schemas

Databases change over time as information is inserted and deleted. The collection of information stored
in the database at a particular moment is called an instance of the database. The overall design of the
database is called the database schema. Database systems have several schemas, partitioned according
to the levels of abstraction.

The physical schema describes the database design at the physical level, while the logical schema
describes the database design at the logical level.A database may also have several schemas at the view
level, sometimes called subschemas, that describe different views of the database.

Data Models

Underlying the structure of a database is the data model: a collection of conceptual tools for describing
data, data relationships, data semantics, and consistency constraints.

The Entity-Relationship Model


The entity-relationship (E-R) data model is based on a perception of a real world that consists of a
collection of basic objects, called entities, and of relationships among these objects. An entity is a
“thing” or “object” in the real world that is distinguishable from other objects. For example, each
person is an entity, and bank accounts can be considered as entities. Entities are described in a database
by a set of attributes. For example, the attributes account-number and balance may describe one
particular account in a bank, and they form attributes of the account entity set. Similarly, attributes
customer-name, customer-street address and customer-city may describe a customer entity.

A relationship is an association among several entities. For example, a depositor relationship


associates a customer with each account that she has. The set of all entities of the same type and the set
of all relationships of the same type are termed an entity set and relationship set,

The overall logical structure (schema) of a database can be expressed graphically by an E-R diagram,
which is built up from the following components:
• Rectangles, which represent entity sets
• Ellipses, which represent attributes
• Diamonds, which represent relationships among entity sets
• Lines, which link attributes to entity sets and entity sets to relationships

E-R diagram indicates that there are two entity sets, customer and account, with attributes as outlined
earlier. The diagram also shows a relationship depositor between customer and account. In addition to
entities and relationships, the E-R model represents certain constraints to which the contents of a
database must conform. One important constraint is mapping cardinalities, which express the number
of entities to which another entity can be associated via a relationship set. For example, if each account
must belong to only one customer, the E-R model can express that constraint
Relational Model

The relational model uses a collection of tables to represent both data and the relationships among
those data. Each table has multiple columns, and each column has a unique name. Example table
shows details of bank customers, shows which accounts belong to which customers. The first table, the
customer table, shows, for example, that the customer identified by customer-id 192-83-7465 is named
Johnson and lives at 12 Alma St. in Palo Alto.

The relational model is an example of a record-based model. Record-based models are so named
because the database is structured in fixed-format records of several types. Each table contains records
of a particular type. Each record type defines a fixed number of fields, or attributes.

Other Data Models

The object-oriented model can be seen as extending the E-R model with notions of encapsulation,
methods (functions), and object identityThe object-relational data model combines features of the
object-oriented data model and relational data model.

The Semistructured data models permit the specification of data where individual data items of the
same type may have different sets of attributes. every data item of a particular type must have the same
set of attributes. The extensible markup language (XML) is widely used to represent semi structured
data..

Two other data models, the network data model and the hierarchical data model, preceded the
relational data model. These models were tied closely to the underlying implementation, and
complicated the task of modeling data

Database Languages

A database system provides a data definition language to specify the database schema and a data
manipulation language to express database queries and updates. the data definition and data
manipulation languages are not two separate languages; instead they simply form parts of a single
database language, such as the widely used SQL language.

Data-Definition Language
We specify a database schema by a set of definitions expressed by a special language called a data-
definition language (DDL). For instance, the following statement in the SQL language defines the
account table:
create table account
(account-number char(10),
balance integer)

Execution of the above DDL statement creates the account table. In addition, it updates a special set of
tables called the data dictionary or data directory.

A data dictionary contains metadata—that is, data about data. The schema of a table is an example of
metadata. A database system consults the data dictionary before reading or modifying actual data. We
specify the storage structure and access methods used by the database system by a set of statements in a
special type of DDL called a data storage and definition language. These statements define the
implementation details of the database schemas, which are usually hidden from the users. The data
values stored in the database must satisfy certain consistency constraints. For example, suppose the
balance on an account should not fall below $100. The DDL provides facilities to specify such
constraints. The database systems check these constraints every time the database is updated.

Data-Manipulation Language

• The retrieval of information stored in the database


• The insertion of new information into the database
• The deletion of information from the database
• The modification of information stored in the database

A data-manipulation language (DML) is a language that enables users to access or manipulate data
as organized by the appropriate data model. There are basically two types

• Procedural DMLs require a user to specify what data are needed and how to
get those data.
• Declarative DMLs (also referred to as nonprocedural DMLs) require a user to specify what data are
needed without specifying how to get those data. Declarative DMLs are usually easier to learn and use
than are procedural DMLs. However, since a user does not have to specify how to get the data, the
database system has to figure out an efficient means of accessing data. The DML component of the
SQL language is nonprocedural.
A query is a statement requesting the retrieval of information. The portion of a DML that involves
information retrieval is called a query language.

This query in the SQL language finds the name of the customer whose customer-idis 192

select customer.customer-name
from customer
where customer.customer-id = 192

The query specifies that those rows from the table customer where the customer-id is 192 must be
retrieved, and the customer-name attribute of these rows must be displayed.
Database Access from Application Programs

Application programs are programs that are used to interact with the database. Application
programs are usually written in a host language, such as Cobol, C, C++, or Java. Examples in a
banking system are programs that generate payroll checks, debit accounts, credit accounts, or transfer
funds between accounts. To access the database, DML statements need to be executed from the host
language.

Database Users and Administrators

A primary goal of a database system is to retrieve information from and store new information in the
database. People who work with a database can be categorized as database users or database
administrators.

Database Users and User Interfaces


There are four different types of database-system users, differentiated by the way they expect to
interact with the system. Different types of user interfaces have been designed for the different types of
users.

• Naive users users who interact with the system by invoking one of the application programs that have
been written previously. As example, consider a user who wishes to find her account balance over the
World Wide Web. Such a user may access a form, where a person enters her account number.. Naive
users may also simply read reports generated from the database.

• Application programmers are computer professionals who write application programs. Application
programmers can choose from many tools to develop user interfaces. Rapid application development
(RAD) tools are tools that enable an application programmer to construct forms and reports without
writing a program.

• Sophisticated users interact with the system without writing programs. Instead, they form their
requests in a database query language. They submit each such query to a query processor, whose
function is to break down DML statements into instructions that the storage manager understands.

• Specialized users are sophisticated users who write specialized database applications that do not fit
into the traditional data-processing framework. base and expert systems, systems that store data with
complex data types (for example, graphics data and audio data), and environment-modeling systems.

Database Administrator

One of the main reasons for using DBMSs is to have central control of both the data and the programs
that access those data. A person who has such central control over the system is called a database
administrator (DBA).

The functions of a DBA are

•Schema definition. The DBA creates the original database schema by executing a set of data
definition statements in the DDL.

• Storage structure and access-method definition.


• Schema and physical-organization modification. The DBA carries out changes to the schema and
physical organization to reflect the changing needs of the organization, or to alter the physical
organization to improve performance.

• Granting of authorization for data access. By granting different types of authorization, the
database administrator can regulate which parts of the database various users can access. The
authorization information is kept in a special system structure that the database system consults
whenever someone attempts to access the data in the system.

• Routine maintenance. Examples of the database administrator’s routine maintenance activities are:
􀀀 Periodically backing up the database, to prevent loss of data in case of disasters, flooding,etc.
􀀀 Ensuring that enough free disk space is available and upgrading disk space if required.
􀀀 Monitoring jobs running on the database and ensuring that performance is not degraded

Database Architecture

A database system has several subsystems.

Storage Manager
A storage manager is a program module that provides the interface between the lowlevel data stored in
the database and the application programs and queries submitted to the system. The storage manager is
responsible for the interaction with the file manager. The raw data are stored on the disk using the file
system, which is usually provided by a conventional operating system. The storage manager translates
the various DML statements into low-level file-system commands. Thus, the storage manager is
responsible for storing, retrieving, and updating data in the database.

The storage manager components include:

• Authorization and integrity manager, which tests for the satisfaction of integrity constraints and
checks the authority of users to access data.

• Buffer manager, which is responsible for fetching data from disk storage into main memory, and
deciding what data to cache in main memory. The buffer manager is a critical part of the database
system, since it enables the database to handle data sizes that are much larger than the size of main
memory. The storage manager implements several data structures as part of the physical system
implementation
•Transaction manager, which ensures that the database remains in a consistent (correct) state despite
system failures, and that concurrent transaction executions proceed without conflicting.
• File manager, which manages the allocation of space on disk storage and the data structures used to
represent information stored on disk.

• Data files, which store the database itself.


• Data dictionary, which stores metadata about the structure of the database, in particular the schema
of the database.
• Indices, which provide fast access to data items that hold particular values.

The Query Processor


The query processor components include

• DDL interpreter, which interprets DDL statements and records the definitions in the data dictionary.
• DML compiler, which translates DML statements in a query language into an evaluation plan
consisting of low-level instructions that the query evaluation engine understands. A query can usually
be translated into any of a number of alternative evaluation plans that all give the same result. The
DML compiler also performs query optimization, that is, it picks the lowest cost evaluation plan from
among the alternatives.

• Query evaluation engine, which executes low-level instructions generated by


the DML compiler.

Application Architectures

Database applications are typically broken up into a front-end part that runs at client machines and a
part that runs at the back end. In two-tier architectures, the front-end directly communicates with a
database running at the back end. In three-tier architectures, the back end part is itself broken up into an
application server and a database server.

Entity-Relationship Model

The entity-relationship (E-R) data model perceives the real world as consisting of basic objects,
called entities, and relationships among these objects. It was developed to facilitate database design by
allowing specification of an enterprise schema, which represents the overall logical structure of a
database . The E-R data model is one of several semantic data models; the semantic aspect of the
model lies in its representation of the meaning of the data. The E-R model is very useful in mapping
the meanings and interactions of real-world enterprises onto a conceptual schema.

Basic Concepts
The E-R data model employs three basic notions: entity sets, relationship sets, and attributes.

Entity Sets
An entity is a “thing” or “object” in the real world that is distinguishable from all other objects. For
example, each person in an enterprise is an entity. An entity has a set of properties, and the values for
some set of properties may uniquely identify an entity. For instance, a person may have a person-id
property whose value uniquely identifies that person. Thus, the value 677-89-9011 for person-id would
uniquely identify one particular person in the enterprise. An entity may be concrete, such as a person or
a book, or it may be abstract, such as a loan, or a holiday, or a concept.

An entity set is a set of entities of the same type that share the same properties, orattributes. The set of
all persons who are customers at a given bank, for example, can be defined as the entity set customer.
Similarly, the entity set loan might represent the set of all loans awarded by a particular bank. and the
entity set of all customers of the bank(customer).

Attributes
An entity is represented by a set of attributes. Attributes are descriptive properties possessed by each
member of an entity set. Possible attributes of the customer entity set are customer-id, customer-name,
customerstreet,and customer-city.Possible attributes of the loan entity set are loan-number and amount.

• Simple and composite attributes. In our examples thus far, the attributes havebeen simple; that is,
they are not divided into subparts. Composite attributes,on the other hand, can be divided into subparts
(that is, other attributes). Forexample, an attribute name could be structured as a composite attribute
consisting of first-name, middle-initial, and last-name.

• Single-valued and multivalued attributes. The attributes in our examples all have a single value for a
particular entity. For instance, the loan-number attribute for a specific loan entity refers to only one
loan number. Such attributes are said to be single valued. There may be instances where an attribute
has
a set of values for a specific entity. Consider an employee entity set with the attribute phone-number.
An employee may have zero, one, or several phone numbers, and different employees may have
different numbers of phones. This type of attribute is said to be multivalued.

• Derived attribute. The value for this type of attribute can be derived from the values of other related
attributes or entities. As example, suppose that the customer entity set has an attributeage, which
indicates the customer’s age. If the customer entity set also has anattribute date-of-birth, we can
calculate age from date-of-birth and the current date. Thus, age is a derived attribute. In this case,
date-of-birth may be referred to as a base attribute, or a stored attribute. The value of a derived
attribute is not stored, but is computed when required.

Relationship Sets
A relationship is an association among several entities. For example, we can define a relationship that
associates customer Hayes with loan L-15. This relationship specifies that Hayes is a customer with
loan number L-15.
A relationship set is a set of relationships of the same type.

As an example, consider the entity sets employee, branch, and job. Examples of job entities could
include manager, teller, auditor, and so on. Job entities may have the attributes title and level. The
relationship set works-on among employee, branch, and job is an example of a ternary relationship. A
ternary relationship among Jones, Perryridge, and manager indicates that Jones acts as a manager at the
Perryridge branch. Jones could also act as auditor at the Downtown branch, which would be
represented by another relationship.

Constraints
An E-R enterprise schema may define certain constraints to which the contents of a database must
conform. In this section, we examine mapping cardinalities and participation constraints, which are two
of the most important types of constraints.
Mapping Cardinalities

Mapping cardinalities, or cardinality ratios, express the number of entities to which another entity can
be associated via a relationship set. Mapping cardinalities are most useful in describing binary
relationship sets, although they can contribute to the description of relationship sets that involve more
than two entity sets. In this section, we shall concentrate on only binary relationship sets.

For a binary relationship set R between entity sets A and B, the mapping cardinality must be one of the
following:
• One to one. An entity in A is associated with at most one entity in B, and an entity in B is associated
with at most one entity in A
• One to many. An entity in A is associated with any number (zero or more) of entities in B. An entity
in B, however, can be associated with at most one entity in A.
• Many to one. An entity in A is associated with at most one entity in B. An entity in B, however, can
be associated with any number (zero or more) of entities in A.
• Many to many. An entity in A is associated with any number (zero or more) of
entities in B, and an entity in B is associated with any number (zero or more)
of entities in A.

2.2.2 Participation Constraints


The participation of an entity set E in a relationship set R is said to be total if every entity in E
participates in at least one relationship in R. If only some entities in Eparticipate in relationships in R,
the participation of entity set E in relationship R is said to be partial. For example, we expect every
loan entity to be related to at least one customer through the borrower relationship. Therefore the
participation of loan in the relationship set borrower is total. In contrast, an individual can be a bank
customer whether or not she has a loan with the bank. Hence, it is possible that only some of the
customer entities are related to the loan entity set through the borrower relationship, and the
participation of customer in the borrower relationship set is therefore partial.

Entity-Relationship Diagram
An E-R diagram can express the overall logical structure of a database graphically. E-R diagrams are
simple and clear—qualities that may well account in large part for the widespread use of the E-R
model. Such a diagram consists of the following major components:

• Rectangles, which represent entity sets


• Ellipses, which represent attributes
• Diamonds, which represent relationship sets
• Lines, which link attributes to entity sets and entity sets to relationship sets
• Double ellipses, which represent multivalued attributes
• Dashed ellipses, which denote derived attributes
• Double lines, which indicate total participation of an entity in a relationship set
• Double rectangles, which represent weak entity sets

E-R diagram Example - Online Bookstore

E-R diagram – Banking


UNIT-II

Introduction to Relational Databases

A relational database consists of a collection of tables, each of which is assigned a unique name A row
in a table represents a relationship among a set of values.

Consider the account table of Figure It has three column headers: account-number, branch-name, and
balance. For each attribute, there is a set of permitted values, called the domain of that attribute. For
the attribute branch-name, for example, the domain is the set of all branch names. Let D1 denote the
set of all account numbers, D2 the set of all branch names, and D3 the set of all balances. any row of
account must consist of a 3-tuple (v1, v2, v3), where v1 is an account number (that is, v1 is in domain
D1), v2 is a branch name (that is, v2 is in domain D2), and v3 is a balance (that is, v3 is indomain D3).
In general, account will contain only a subset of the set of all possible
rows. Therefore, account is a subset of
D1 ?D2 ?D3
In general, a table of n attributes must be a subset of
D1 ?D2 ?????Dn−1 ?Dn

The account relation. account-number branch-name balance

account-number branch-name balance


A-101 Downtown 500
A-102 Perryridge 400
A-201 Brighton 900
A-215 Mianus 700
A-217 Brighton 750
A-222 Redwood 700
A-305 Round Hill 350
Functional Dependencies (FD)

A functional dependency can be described as follows:

1. An attribute is functionally dependent if its value is determined by


another attribute which is a key.
2. That is, if we know the value of one (or several) data items, then we can
find the value of another (or several).
3. Functional dependencies are expressed as X Y, where X is the
determinant and Y is the functionally dependent attribute.
4. If A (B,C) then A B and A C.
5. If (A,B) C, then it is not necessarily true that A C and B C.
6. If A B and B A, then A and B are in a 1-1 relationship.
7. If A B then for A there can only ever be one value for B.

Transitive Dependencies (TD)

A transitive dependency can be described as follows:

1. An attribute is transitively dependent if its value is determined by another


attribute which is not a key.
2. If X Y and X is not a key then this is a transitive dependency.
3. A transitive dependency exists when A B C but NOT A C.

Multi-Valued Dependencies (MVD)

A multi-valued dependency can be described as follows:

1. A table involves a multi-valued dependency if it may contain multiple


values for an entity.
2. A multi-valued dependency may arise as a result of enforcing 1st normal
form.
3. X Y, ie X multi-determines Y, when for each value of X we can have
more than one value of Y.
4. If A B and A C then we have a single attribute A which multi-
determines two other independent attributes, B and C.
5. If A (B,C) then we have an attribute A which multi-determines a set of
associated attributes, B and C.
Join Dependencies (JD)

A join dependency can be described as follows:

1. If a table can be decomposed into three or more smaller tables, it must be


capable of being joined again on common keys to form the original table.

Modification Anomalies

A major objective of data normalization is to avoid modification anomalies.


These come in two flavors:

1. An insertion anomaly is a failure to place information about a new


database entry into all the places in the database where information
about that new entry needs to be stored. In a properly normalized
database, information about a new entry needs to be inserted into only
one place in the database. In an inadequately normalized database,
information about a new entry may need to be inserted into more than
one place, and, human fallibility being what it is, some of the needed
additional insertions may be missed.

2. A deletion anomaly is a failure to remove information about an existing


database entry when it is time to remove that entry. In a properly
normalized database, information about an old, to-be-gotten-rid-of entry
needs to be deleted from only one place in the database. In an
inadequately normalized database, information about that old entry may
need to be deleted from more than one place, and, human fallibility being
what it is, some of the needed additional deletions may be missed.

An update of a database involves modifications that may be additions,


deletions, or both. Thus 'update anomalies' can be either of the kinds of
anomalies discussed above.

All three kinds of anomalies are highly undesirable, since their occurrence
constitutes corruption of the database. Properly normalized databases are
much less susceptible to corruption than are un normalized databases.
RELATIONAL ALGEBRA

NORMALIZATION

Relational database theory, and the principles of normalization, were first


constructed by people with a strong mathematical background. They wrote
about databases using terminology which was not easily understood outside
those mathematical circles. Below is an attempt to provide understandable
explanations.

Data normalization is a set of rules and techniques concerned with:

● Identifying relationships among attributes.


● Combining attributes to form relations.
● Combining relations to form a database.

It follows a set of rules worked out by E F Codd in 1970. A normalized


relational database provides several benefits:

● Elimination of redundant data storage.


● Close modeling of real world entities, processes, and their relationships.
● Structuring of data so that the model is flexible.

NORMAL FORMS

1. First Normal Form (1NF)


2. Second Normal Form (2NF)
3. Third Normal Form (3NF)
4. Boyce-Codd Normal Form (BCNF)
5. Fourth Normal Form (4NF)
6. Fifth Normal Form (5NF)

UNNORMALIZED RELATIONS

• FIRST STEP IN NORMALIZATION IS TO CONVERT THE DATA INTO A


TWO-DIMENSIONAL TABLE

• IN UNNORMALIZED RELATIONS DATA CAN REPEAT WITHIN A


COLUMN
FIRST NORMAL FORM

• TO MOVE TO FIRST NORMAL FORM A RELATION MUST CONTAIN


ONLY ATOMIC VALUES AT EACH ROW AND COLUMN.

– NO REPEATING GROUPS

– A COLUMN OR SET OF COLUMNS IS CALLED A CANDIDATE KEY


WHEN ITS VALUES CAN UNIQUELY IDENTIFY THE ROW IN THE
RELATION.
1NF STORAGE ANOMALIES

• INSERTION: A NEW PATIENT HAS NOT YET UNDERGONE SURGERY --


HENCE NO SURGEON # -- SINCE SURGEON # IS PART OF THE KEY
WE CAN’T INSERT.

• INSERTION: IF A SURGEON IS NEWLY HIRED AND HASN’T OPERATED


YET -- THERE WILL BE NO WAY TO INCLUDE THAT PERSON IN THE
DATABASE.

• UPDATE: IF A PATIENT COMES IN FOR A NEW PROCEDURE, AND HAS


MOVED, WE NEED TO CHANGE MULTIPLE ADDRESS ENTRIES.

• DELETION (TYPE 1): DELETING A PATIENT RECORD MAY ALSO


DELETE ALL INFO ABOUT A SURGEON.

• DELETION (TYPE 2): WHEN THERE ARE FUNCTIONAL DEPENDENCIES


(LIKE SIDE EFFECTS AND DRUG) CHANGING ONE ITEM ELIMINATES
OTHER INFORMATION.
SECOND NORMAL FORM

• A RELATION IS SAID TO BE IN SECOND NORMAL FORM WHEN EVERY


NONKEY ATTRIBUTE IS FULLY FUNCTIONALLY DEPENDENT ON THE
PRIMARY KEY.

– THAT IS, EVERY NONKEY ATTRIBUTE NEEDS THE FULL


PRIMARY KEY FOR UNIQUE IDENTIFICATION
1NF STORAGE ANOMALIES REMOVED

• INSERTION: CAN NOW ENTER NEW PATIENTS WITHOUT SURGERY.

• INSERTION: CAN NOW ENTER SURGEONS WHO HAVEN’T OPERATED.

• DELETION (TYPE 1): IF CHARLES BROWN DIES THE CORRESPONDING


TUPLES FROM PATIENT AND SURGERY TABLES CAN BE DELETED
WITHOUT LOSING INFORMATION ON DAVID ROSEN.

• UPDATE: IF JOHN WHITE COMES IN FOR THIRD TIME, AND HAS


MOVED, WE ONLY NEED TO CHANGE THE PATIENT TABLE

2NF STORAGE ANOMALIES

• INSERTION: CANNOT ENTER THE FACT THAT A PARTICULAR DRUG


HAS A PARTICULAR SIDE EFFECT UNLESS IT IS GIVEN TO A PATIENT.
• DELETION: IF JOHN WHITE RECEIVES SOME OTHER DRUG BECAUSE
OF THE PENICILLIN RASH, AND A NEW DRUG AND SIDE EFFECT ARE
ENTERED, WE LOSE THE INFORMATION THAT PENICILLIN CAN
CAUSE A RASH

• UPDATE: IF DRUG SIDE EFFECTS CHANGE (A NEW FORMULA) WE


HAVE TO UPDATE MULTIPLE OCCURRENCES OF SIDE EFFECTS.

THIRD NORMAL FORM

• A RELATION IS SAID TO BE IN THIRD NORMAL FORM IF THERE IS NO


TRANSITIVE FUNCTIONAL DEPENDENCY BETWEEN NONKEY
ATTRIBUTES

– WHEN ONE NONKEY ATTRIBUTE CAN BE DETERMINED WITH


ONE OR MORE NONKEY ATTRIBUTES THERE IS SAID TO BE A
TRANSITIVE FUNCTIONAL DEPENDENCY.

• THE SIDE EFFECT COLUMN IN THE SURGERY TABLE IS DETERMINED


BY THE DRUG ADMINISTERED

– SIDE EFFECT IS TRANSITIVELY FUNCTIONALLY DEPENDENT ON


DRUG SO SIDE EFFECT IS NOT 3NF
2NF STORAGE ANOMALIES REMOVED

• INSERTION: WE CAN NOW ENTER THE FACT THAT A PARTICULAR


DRUG HAS A PARTICULAR SIDE EFFECT IN THE DRUG RELATION.

• DELETION: IF JOHN WHITE RECIEVES SOME OTHER DRUG AS A


RESULT OF THE RASH FROM PENICILLIN, BUT THE INFORMATION ON
PENICILLIN AND RASH IS MAINTAINED.

• UPDATE: THE SIDE EFFECTS FOR EACH DRUG APPEAR ONLY ONCE.

BOYCE-CODD NORMAL FORM

• MOST 3NF RELATIONS ARE ALSO BCNF RELATIONS.

• A 3NF RELATION IS NOT IN BCNF IF:

– CANDIDATE KEYS IN THE RELATION ARE COMPOSITE KEYS


(THEY ARE NOT SINGLE ATTRIBUTES)

– THERE IS MORE THAN ONE CANDIDATE KEY IN THE RELATION,


AND

– THE KEYS ARE NOT DISJOINT, THAT IS, SOME ATTRIBUTES IN


THE KEYS ARE COMMON
MOST 3NF RELATIONS ARE ALSO BCNF – IS THIS ONE?

BCNF RELATIONS

FOURTH AND FIFTH NORMAL FORM

When attributes in a relation have multivalucd dependency, further


Normalization to 4NF and 5NF are required. We will illustrate this with an
example. Consider a vendor supplying many items to many projects in an
organization.

The following are the assumptions:

1. A vendor is capable of supplying many items.


2. A project uses many items.
3. A vendor supplies to many projects.
4. An item may be supplied by many vendors.

Table 8 gives a relation for this problem and figure 10 The dependency
diagram(s).

Figure 10) dependency diagram for vendor-supply-project relation

The relation given in table 8 has a number of problems. For example:

⮚ If vendor V1 has to supply to project P2, but the item is not yet decided,
then a row with a blank for item code has to be introduced.
⮚ The information about item 1 is stored twice for vendor V3.

Observe that the relation given in Table 8 is in 3NF and also in BCNF It still
has the problems mentioned above. The problem is reduced by expressing this
relation as two relations in the Fourth Normal Form (4NF). A relation is in 4NF
if it has no more than one independent multivalued dependency or one
independent multivalued dependency with a functional dependency. Table 8
can be expressed as the two 4NF relations given in Table 9. The fact that
vendors are capable of supplying certain items and that they are assigned to
supply for some projects in independently specified in the 4NF relation.

These relations still have a problem. Even though vendor Vl 's capability to
supply items and his allotment to supply for specified projects may not need it.
We thus need another relation which specifies this. This is called 5NF form. the
5NF relations are the relations in Table 9(a) and 9(b) together with the relation
given in table 10.
JOIN DEPENDENCIES:
The normal forms discussed so far required that the given relation R if
not in the given normal form be decomposed in two relations to meet the
requirements of the normal form. In some cases, a relation can have problems
like redundant information and update anomalies because of it but cannot be
decomposed in two relations to remove the problems. In such cases it may
possible to decompose the relation in three or more relations using the 5NF.

The fifth normal form deals with join dependencies which is a


generalization of the MVD. The aim of fifth normal form is to have relations
that cannot be decomposed further. A relation in 5NF cannot be constructed
from several smaller relations. A relation R satisfies join dependency(R1,R2,
…….Rn) if and only if R is equal to the join of R1,R2,…..,Rn where Ri are
subsets of the set of attributes.
UNIT – 3 & 4
DATA STORAGE AND QUERY PROCESSING

Classification of Physical Storage Media


n Speed with which data can be accessed
n Cost per unit of data
n Reliability
H data loss on power failure or system crash
H physical failure of the storage device
n Can differentiate storage into:
H volatile storage: loses contents when power is switched off
H non-volatile storage:
4 Contents persist even when power is switched off.
4 Includes secondary and tertiary storage, as well as batter-backed up main-
memory.
PHYSICAL STORAGE MEDIA
n Cache – fastest and most costly form of storage; volatile; managed by the computer
system hardware.
n Main memory:
H fast access (10s to 100s of nanoseconds; 1 nanosecond = 10–9 seconds)
H generally too small (or too expensive) to store the entire database
4 capacities of up to a few Gigabytes widely used currently
4 Capacities have gone up and per-byte costs have decreased steadily and
rapidly (roughly factor of 2 every 2 to 3 years)
H Volatile — contents of main memory are usually lost if a power failure or system
crash occurs.
n Flash memory
H Data survives power failure
H Data can be written at a location only once, but location can be erased and written
to again
4 Can support only a limited number of write/erase cycles.
4 Erasing of memory has to be done to an entire bank of memory
H Reads are roughly as fast as main memory
H But writes are slow (few microseconds), erase is slower
H Cost per unit of storage roughly similar to main memory
H Widely used in embedded devices such as digital cameras
H also known as EEPROM (Electrically Erasable Programmable Read-Only
Memory)
n Magnetic-disk
H Data is stored on spinning disk, and read/written magnetically
H Primary medium for the long-term storage of data; typically stores entire
database.
HData must be moved from disk to main memory for access, and written back for
storage
4 Much slower access than main memory (more on this later)
H direct-access – possible to read data on disk in any order, unlike magnetic tape
H Hard disks vs floppy disks
H Capacities range up to roughly 100 GB currently
4 Much larger capacity and cost/byte than main memory/flash memory
4 Growing constantly and rapidly with technology improvements (factor of
2 to 3 every 2 years)
H Survives power failures and system crashes
4 disk failure can destroy data, but is very rare
n Optical storage
H non-volatile, data is read optically from a spinning disk using a laser
H CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms
H Write-one, read-many (WORM) optical disks used for archival storage (CD-R
and DVD-R)
H Multiple write versions also available (CD-RW, DVD-RW, and DVD-RAM)
H Reads and writes are slower than with magnetic disk
H Juke-box systems, with large numbers of removable disks, a few drives, and a
mechanism for automatic loading/unloading of disks available for storing large
volumes of data
n Tape storage
H non-volatile, used primarily for backup (to recover from disk failure), and for
archival data
H sequential-access – much slower than disk
H very high capacity (40 to 300 GB tapes available)
H tape can be removed from drive ⇒ storage costs much cheaper than disk, but
drives are expensive
H Tape jukeboxes available for storing massive amounts of data
4 hundreds of terabytes (1 terabyte = 109 bytes) to even a petabyte (1
petabyte = 1012 bytes)
STORAGE HIERARCHY
n primary storage: Fastest media but volatile (cache, main memory).
n secondary storage: next level in hierarchy, non-volatile, moderately fast access time
H also called on-line storage
H E.g. flash memory, magnetic disks
n tertiary storage: lowest level in hierarchy, non-volatile, slow access time
H also called off-line storage
H E.g. magnetic tape, optical storage
MAGNETIC DISKS

n Read-write head
H Positioned very close to the platter surface (almost touching it)
H Reads or writes magnetically encoded information.
n Surface of platter divided into circular tracks
H Over 16,000 tracks per platter on typical hard disks
n Each track is divided into sectors.
H A sector is the smallest unit of data that can be read or written.
H Sector size typically 512 bytes
H Typical sectors per track: 200 (on inner tracks) to 400 (on outer tracks)
n To read/write a sector
H disk arm swings to position head on right track
H platter spins continually; data is read/written as sector passes under head
n Head-disk assemblies
H multiple disk platters on a single spindle (typically 2 to 4)
H one head per platter, mounted on a common arm.
n Cylinder i consists of ith track of all the platters
n Earlier generation disks were susceptible to head-crashes
n Surface of earlier generation disks had metal-oxide coatings which would disintegrate on
head crash and damage all data on disk
n Current generation disks are less susceptible to such disastrous failures, although
individual sectors may get corrupted
n Disk controller – interfaces between the computer system and the disk drive hardware.
n accepts high-level commands to read or write a sector
n initiates actions such as moving the disk arm to the right track and actually reading or
writing the data
n Computes and attaches checksums to each sector to verify that data is read back correctly
n If data is corrupted, with very high probability stored checksum won’t match recomputed
checksum
n Ensures successful writing by reading back sector after writing it
n Performs remapping of bad sectors

DISK SUBSYSTEM

n Multiple disks connected to a computer system through a controller


H Controllers functionality (checksum, bad sector remapping) often carried out by
individual disks; reduces load on controller
n Disk interface standards families
H ATA (AT adaptor) range of standards
H SCSI (Small Computer System Interconnect) range of standards
H Several variants of each standard (different speeds and capabilities)

Performance Measures of Disks


n Access time – the time it takes from when a read or write request is issued to
when data transfer begins. Consists of:
H Seek time – time it takes to reposition the arm over the correct track.
4 Average seek time is 1/2 the worst case seek time.
– Would be 1/3 if all tracks had the same number of sectors, and we
ignore the time to start and stop arm movement
4 4 to 10 milliseconds on typical disks
H Rotational latency – time it takes for the sector to be accessed to appear under the
head.
4 Average latency is 1/2 of the worst case latency.
4 4 to 11 milliseconds on typical disks (5400 to 15000 r.p.m.)
n Data-transfer rate – the rate at which data can be retrieved from or stored to the disk.
H 4 to 8 MB per second is typical
H Multiple disks may share a controller, so rate that controller can handle is also
important
4 E.g. ATA-5: 66 MB/second, SCSI-3: 40 MB/s
4 Fiber Channel: 256 MB/s
n Mean time to failure (MTTF) – the average time the disk is expected to run
continuously without any failure.
H Typically 3 to 5 years
H Probability of failure of new disks is quite low, corresponding to a
“theoretical MTTF” of 30,000 to 1,200,000 hours for a new disk
4 E.g., an MTTF of 1,200,000 hours for a new disk means that given 1000
relatively new disks, on an average one will fail every 1200 hours
H MTTF decreases as disk ages

RAID (REDUNDANT ARRAY OF INDEPENDENT DISK):

Improvement of Reliability via Redundancy


Redundancy – store extra information that can be used to rebuild information lost in a
disk failure
E.g., Mirroring (or shadowing)
H Duplicate every disk. Logical disk consists of two physical disks.
H Every write is carried out on both disks
4 Reads can take place from either disk
H If one disk in a pair fails, data still available in the other
4 Data loss would occur only if a disk fails, and its mirror disk also fails
before the system is repaired
– Probability of combined event is very small
» Except for dependent failure modes such as fire or building
collapse or electrical power surges
Mean time to data loss depends on mean time to failure,
and mean time to repair
H E.g. MTTF of 100,000 hours, mean time to repair of 10 hours gives mean time to
data loss of 500*106 hours (or 57,000 years) for a mirrored pair of disks (ignoring
dependent failure modes)
Improvement in Performance via Parallelism
Two main goals of parallelism in a disk system:
1. Load balance multiple small accesses to increase throughput
2. Parallelize large accesses to reduce response time.
Improve transfer rate by striping data across multiple disks.
Bit-level striping – split the bits of each byte across multiple disks
In an array of eight disks, write bit i of each byte to disk i.
Each access can read data at eight times the rate of a single disk.
But seek/access time worse than for a single disk
Bit level striping is not used much any more
Block-level striping – with n disks, block i of a file goes to disk (i mod n) + 1
Requests for different blocks can run in parallel if the blocks reside on different
disks
A request for a long sequence of blocks can utilize all disks in parallel
RAID Levels
Schemes to provide redundancy at lower cost by using disk striping combined with parity
bits
Different RAID organizations, or RAID levels, have differing cost, performance
and reliability characteristics
RAID Level 0: Block striping; non-redundant.
Used in high-performance applications where data lost is not critical.
RAID Level 1: Mirrored disks with block striping
Offers best write performance.
Popular for applications such as storing log files in a database system.

Faster data transfer than with a single disk, but fewer I/Os per second since every
disk has to participate in every I/O.
Subsumes Level 2 (provides all its benefits, at lower cost).
RAID Level 4: Block-Interleaved Parity; uses block-level striping, and keeps a parity
block on a separate disk for corresponding blocks from N other disks.
When writing data block, corresponding block of parity bits must also be
computed and written to parity disk
To find value of a damaged block, compute XOR of bits from corresponding blocks
(including parity block) from other disks
Provides higher I/O rates for independent block reads than Level 3
block read goes to a single disk, so blocks stored on different disks can be
read in parallel
Provides high transfer rates for reads of multiple blocks than no-striping
Before writing a block, parity data must be computed
Can be done by using old parity block, old value of current block and new
value of current block (2 block reads + 2 block writes)
Or by recomputing the parity value using the new values of blocks
corresponding to the parity block
– More efficient for writing large amounts of data sequentially
Parity block becomes a bottleneck for independent block writes since every block
write also writes to parity disk

RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all
N + 1 disks, rather than storing data in N disks and parity in 1 disk.

E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) +
1, with the data blocks stored on the other 4 disks.
Higher I/O rates than Level 4.
4 Block writes occur in parallel if the blocks and their parity blocks are on
different disks.
Subsumes Level 4: provides same benefits, but avoids bottleneck of parity disk.

RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but stores extra redundant
information to guard against multiple disk failures.
H Better reliability than Level 5 at a higher cost; not used as widely.
FILE ORGANIZATION
n The database is stored as a collection of files. Each file is a sequence of records. A
record is a sequence of fields.

n One approach:
H assume record size is fixed
H each file has records of one particular type only
H different files are used for different relations
This case is easiest to implement; will consider variable length records
later.
Fixed-Length Records
n Simple approach:
H Store record i starting from byte n * (i – 1), where n is the size of each record.
H Record access is simple but records may cross blocks
4 Modification: do not allow records to cross block boundaries
n Deletion of record I:
alternatives:
H move records i + 1, . . ., n
to i, . . . , n – 1
H move record n to i
H do not move records, but
link all free records on a
free list

Free Lists
n Store the address of the first deleted record in the file header.
n Use this first record to store the address of the second deleted record, and so on
n Can think of these stored addresses as pointers since they “point” to the location of a
record.
n More space efficient representation: reuse space for normal attributes of free records to
store pointers. (No pointers stored in in-use records.)

Variable-Length Records
n Variable-length records arise in database systems in several ways:
H Storage of multiple record types in a file.
H Record types that allow variable lengths for one or more fields.
H Record types that allow repeating fields (used in some older data models).
n Byte string representation
H Attach an end-of-record (⊥) control character to the end of each record
H Difficulty with deletion
H Difficulty with growth
H
Variable-Length Records: Slotted Page Structure

n Slotted page header contains:


H number of record entries
H end of free space in the block
H location and size of each record
n Records can be moved around within a page to keep them contiguous with no empty
space between them; entry in the header must be updated.
n Pointers should not point directly to record — instead they should point to the entry for
the record in header.
n Fixed-length representation:
H reserved space
H pointers
n Reserved space – can use fixed-length records of a known maximum length; unused
space in shorter records filled with a null or end-of-record symbol.
Pointer Method
H A variable-length record is represented by a list of fixed-length records, chained
together via pointers.
H Can be used even if the maximum record length is not known

n Disadvantage to pointer structure; space is wasted in all records except the first in a a
chain.
n Solution is to allow two kinds of block in file:
H Anchor block – contains the first records of chain
H Overflow block – contains records other than those that are the first records of
chairs.

ORGANIZATION OF RECORDS IN FILES


n Heap – a record can be placed anywhere in the file where there is space
n Sequential – store records in sequential order, based on the value of the search key of
each record
n Hashing – a hash function computed on some attribute of each record; the result specifies
in which block of the file the record should be placed
n Records of each relation may be stored in a separate file. In a clustering file
organization records of several different relations can be stored in the same file
H Motivation: store related records on the same block to minimize I/O
Sequential File Organization
n Suitable for applications that require sequential processing of the entire file
n The records in the file are ordered by a search-key

n Deletion – use pointer chains


n Insertion –locate the position where the record is to be inserted
H if there is free space insert there
H if no free space, insert the record in an overflow block
H In either case, pointer chain must be updated
n Need to reorganize the file
from time to time to restore sequential orgder
sequential order

Clustering File Organization


n Simple file structure stores each relation in a separate file
n Can instead store several relations in one file using a clustering file organization
n E.g., clustering organization of customer and depositor:
H good for queries involving depositor customer, and for queries involving one
single customer and his accounts
H bad for queries involving only customer
H results in variable size records
H
INDEXING
Basic Concepts
n Indexing mechanisms used to speed up access to desired data.
n E.g., author catalog in library
n Search Key - attribute to set of attributes used to look up records in a file.
n An index file consists of records (called index entries) of the form

search-key pointer

n Index files are typically much smaller than the original file
n Two basic kinds of indices:
n Ordered indices: search keys are stored in sorted order
n Hash indices: search keys are distributed uniformly across “buckets” using a “hash
function”.
Ordered Indices
Indexing techniques evaluated on basis of:
n In an ordered index, index entries are stored sorted on the search key value. E.g., author
catalog in library.
n Primary index: in a sequentially ordered file, the index whose search key specifies the
sequential order of the file.
⮚ Also called clustering index
⮚ The search key of a primary index is usually but not necessarily the primary key.
n Secondary index: an index whose search key specifies an order different from the
sequential order of the file. Also called
non-clustering index.
n Index-sequential file: ordered sequential file with a primary index.
Dense Index Files
n Dense index — Index record appears for every search-key value in the file.

Sparse Index Files


n Sparse Index: contains index records for only some search-key values.
⮚ Applicable when records are sequentially ordered on search-key
n To locate a record with search-key value K we:
⮚ Find index record with largest search-key value < K
⮚ Search file sequentially starting at the record to which the index record points
n Less space and less maintenance overhead for insertions and deletions.
n Generally slower than dense index for locating records.
n Good tradeoff: sparse index with an index entry for every block in file, corresponding to
least search-key value in the block.

Multilevel Index

n If primary index does not fit in memory, access becomes expensive.


n To reduce number of disk accesses to index records, treat primary index kept on disk as a
sequential file and construct a sparse index on it.
⮚ outer index – a sparse index of primary index
⮚ inner index – the primary index file
n If even outer index is too large to fit in main memory, yet another level of index can be
created, and so on.
n Indices at all levels must be updated on insertion or deletion from the file.

Index Update: Deletion


n If deleted record was the only record in the file with its particular search-key value, the
search-key is deleted from the index also.
n Single-level index deletion:
⮚ Dense indices – deletion of search-key is similar to file record deletion.
⮚ Sparse indices – if an entry for the search key exists in the index, it is deleted by
replacing the entry in the index with the next search-key value in the file (in
search-key order). If the next search-key value already has an index entry, the
entry is deleted instead of being replaced.
Index Update: Insertion
n Single-level index insertion:
⮚ Perform a lookup using the search-key value appearing in the record to be
inserted.
⮚ Dense indices – if the search-key value does not appear in the index, insert it.
⮚ Sparse indices – if index stores an entry for each block of the file, no change
needs to be made to the index unless a new block is created. In this case, the first
search-key value appearing in the new block is inserted into the index.
n Multilevel insertion (as well as deletion) algorithms are simple extensions of the single-
level algorithms
Secondary Indices
n Frequently, one wants to find all the records whose values in a certain field (which is not
the search-key of the primary index satisfy some condition.
⮚ Example 1: In the account database stored sequentially by account number, we
may want to find all accounts in a particular branch
⮚ Example 2: as above, but where we want to find all accounts with a specified
balance or range of balances
n We can have a secondary index with an index record for each search-key value; index
record points to a bucket that contains pointers to all the actual records with that
particular search-key value.

n
Primary and Secondary Indices
n Secondary indices have to be dense.
n Indices offer substantial benefits when searching for records.
n When a file is modified, every index on the file must be updated, Updating indices
imposes overhead on database modification.
n Sequential scan using primary index is efficient, but a sequential scan using a secondary
index is expensive
⮚ each record access may fetch a new block from disk
B+ TREE INDEX FILE

B+-tree indices are an alternative to indexed-sequential files.


n Disadvantage of indexed-sequential files: performance degrades as file grows, since
many overflow blocks get created. Periodic reorganization of entire file is required.
n Advantage of B+-tree index files: automatically reorganizes itself with small, local,
changes, in the face of insertions and deletions. Reorganization of entire file is not
required to maintain performance.
n Disadvantage of B+-trees: extra insertion and deletion overhead, space overhead.
n Advantages of B+-trees outweigh disadvantages, and they are used extensively.
A B+-tree is a rooted tree satisfying the following properties:
n All paths from root to leaf are of the same length
n Each node that is not a root or a leaf has between [n/2] and n children.
n A leaf node has between [(n–1)/2] and n–1 values
n Special cases:
⮚ If the root is not a leaf, it has at least 2 children.
⮚ If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0
and (n–1) values.
+
B -Tree Node Structure
Typical node

⮚ Ki are the search-key values


⮚ Pi are pointers to children (for non-leaf nodes) or pointers to records or buckets of
records (for leaf nodes).
n The search-keys in a node are ordered
K1 < K2 < K3 < . . . < Kn–1
Leaf Nodes in B+-Trees
Properties of a leaf node:
n For i = 1, 2, . . ., n–1, pointer Pi either points to a file record with search-key value Ki, or
to a bucket of pointers to file reco
n
n
n rds, each record having search-key value Ki. Only need bucket structure if search-key
does not form a primary key.
n If Li, Lj are leaf nodes and i < j, Li’s search-key values are less than Lj’s search-key values
n Pn points to next leaf node in search-key order
Non-Leaf Nodes in B+-Trees
n Non leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node
with m pointers:
⮚ All the search-keys in the subtree to which P1 points are less than K1
⮚ For 2 ≤ i ≤ n – 1, all the search-keys in the subtree to which Pi points have values
greater than or equal to Ki–1 and less than Km–1

B+-tree for account file (n = 3)

B+-tree for account file (n - 5)


n Leaf nodes must have between 2 and 4 values
(⎡(n–1)/2⎤ and n –1, with n = 5).
n Non-leaf nodes other than root must have between 3 and 5 children (⎡(n/2⎤ and n with n
=5).
n Root must have at least 2 children.
Observations about B+-trees
n Since the inter-node connections are done by pointers, “logically” close blocks need not
be “physically” close.
n The non-leaf levels of the B+-tree form a hierarchy of sparse indices.
n The B+-tree contains a relatively small number of levels (logarithmic in the size of the
main file), thus searches can be conducted efficiently.
n Insertions and deletions to the main file can be handled efficiently, as the index can be
restructured in logarithmic time (as we shall see).
Queries on B+-Trees
n Find all records with a search-key value of k.
1. Start with the root node
1. Examine the node for the smallest search-key value > k.
2. If such a value exists, assume it is Kj. Then follow Pi to the child node
3. Otherwise k ≥ Km–1, where there are m pointers in the node. Then follow
Pm to the child node.
2. If the node reached by following the pointer above is not a leaf node, repeat the
above procedure on the node, and follow the corresponding pointer.
3. Eventually reach a leaf node. If for some i, key Ki = k follow pointer Pi to the
desired record or bucket. Else no record with search-key value k exists.
n In processing a query, a path is traversed in the tree from the root to some leaf node.
n If there are K search-key values in the file, the path is no longer than ⎡ log⎡n/2⎤(K)⎤.
n A node is generally the same size as a disk block, typically 4 kilobytes, and n is typically
around 100 (40 bytes per index entry).
n With 1 million search key values and n = 100, at most
log50(1,000,000) = 4 nodes are accessed in a lookup.
n Contrast this with a balanced binary free with 1 million search key values — around 20
nodes are accessed in a lookup
1. above difference is significant since every node access may need a disk I/O,
costing around 20 milliseconds!
Updates on B+-Trees: Insertion
n Find the leaf node in which the search-key value would appear
n If the search-key value is already there in the leaf node, record is added to file and if
necessary a pointer is inserted into the bucket.
n If the search-key value is not there, then add the record to the main file and create a
bucket if necessary. Then:
⮚ If there is room in the leaf node, insert (key-value, pointer) pair in the leaf node
⮚ Otherwise, split the node (along with the new (key-value, pointer) entry) as
discussed in the next slide.
n Splitting a node:
⮚ take the n(search-key value, pointer) pairs (including the one being inserted) in
sorted order. Place the first ⎡ n/2 ⎤ in the original node, and the rest in a new
node.
⮚ let the new node be p, and let k be the least key value in p. Insert (k,p) in the
parent of the node being split. If the parent is full, split it and propagate the split
further up.
n The splitting of nodes proceeds upwards till a node that is not full is found. In the worst
case the root node may be split increasing the height of the tree by 1.
Result of splitting node containing Brighton and Downtown on inserting Clearview
B+ tree Before
inserting

B+ tree After inserting


n Find the record to be deleted, and remove it from the main file and from the bucket (if
present)
n Remove (search-key value, pointer) from the leaf node if there is no bucket or if the
bucket has become empty
n If the node has too few entries due to the removal, and the entries in the node and a
sibling fit into a single node, then
⮚ Insert all the search-key values in the two nodes into a single node (the one on the
left), and delete the other node.
⮚ Delete the pair (Ki–1, Pi), where Pi is the pointer to the deleted node, from its
parent, recursively using the above procedure.
n Otherwise, if the node has too few entries due to the removal, and the entries in the node
and a sibling fit into a single node, then
⮚ Redistribute the pointers between the node and a sibling such that both have more
than the minimum number of entries.
⮚ Update the corresponding search-key value in the parent of the node.
n The node deletions may cascade upwards till a node which has ⎡n/2 ⎤ or more pointers is
found. If the root node has only one pointer after deletion, it is deleted and the sole child
becomes the root.
Before deleting

After deleting
n The removal of the leaf node containing “Downtown” did not result in its parent having
too little pointers. So the cascaded deletions stopped with the deleted leaf node’s parent.

Deletion of “Perryridge” from result of previous example


n Node with “Perryridge” becomes underfull (actually empty, in this special case) and
merged with its sibling.
n As a result “Perryridge” node’s parent became underfull, and was merged with its sibling
(and an entry was deleted from their parent)
n Root node then had only one child, and was deleted and its child became the new root
node

B+-TREE FILE ORGANIZATION


n Index file degradation problem is solved by using B+-Tree indices. Data file degradation
problem is solved by using B+-Tree File Organization.
n The leaf nodes in a B+-tree file organization store records, instead of pointers.
n Since records are larger than pointers, the maximum number of records that can be stored
in a leaf node is less than the number of pointers in a nonleaf node.
n Leaf nodes are still required to be half full.
n Insertion and deletion are handled in the same way as insertion and deletion of entries in
a B+-tree index.
B-Tree Index Files
n Similar to B+-tree, but B-tree allows search-key values to appear only once; eliminates
redundant storage of search keys.
n Search keys in nonleaf nodes appear nowhere else in the B-tree; an additional pointer
field for each search key in a nonleaf node must be included.
n Generalized B-tree leaf node
n Nonleaf node – pointers Bi are the bucket or file record pointers.

n Advantages of B-Tree indices:


⮚ May use less tree nodes than a corresponding B+-Tree.
⮚ Sometimes possible to find search-key value before reaching leaf node.
n Disadvantages of B-Tree indices:
⮚ Only small fraction of all search-key values are found early
⮚ Non-leaf nodes are larger, so fan-out is reduced. Thus B-Trees typically have
greater depth than corresponding
B+-Tree
⮚ Insertion and deletion more complicated than in B+-Trees
⮚ Implementation is harder than B+-Trees.

HASHING
Static Hashing
n A bucket is a unit of storage containing one or more records (a bucket is typically a disk
block).
n In a hash file organization we obtain the bucket of a record directly from its search-key
value using a hash function.
n Hash function h is a function from the set of all search-key values K to the set of all
bucket addresses B.
n Hash function is used to locate records for access, insertion as well as deletion.
n Records with different search-key values may be mapped to the same bucket; thus entire
bucket has to be searched sequentially to locate a record.
Example of Hash File Organization
Hash file organization of account file, using branch-name as key
There are 10 buckets,
n The binary representation of the ith character is assumed to be the integer i.
n The hash function returns the sum of the binary representations of the characters modulo
10
⮚ E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3
n Hash file organization of account file, using branch-name as key

Hash Functions
n Worst has function maps all search-key values to the same bucket; this makes access time
proportional to the number of search-key values in the file.
n An ideal hash function is uniform, i.e., each bucket is assigned the same number of
search-key values from the set of all possible values.
n Ideal hash function is random, so each bucket will have the same number of records
assigned to it irrespective of the actual distribution of search-key values in the file.
n Typical hash functions perform computation on the internal binary representation of the
search-key.
⮚ For example, for a string search-key, the binary representations of all the
characters in the string could be added and the sum modulo the number of buckets
could be returned. .
Handling of Bucket Overflows
n Bucket overflow can occur because of
⮚ Insufficient buckets
⮚ Skew in distribution of records. This can occur due to two reasons:
★ multiple records have same search-key value
★ chosen hash function produces non-uniform distribution of key values
n Although the probability of bucket overflow can be reduced, it cannot be eliminated; it is
handled by using overflow buckets.
n Overflow chaining – the overflow buckets of a given bucket are chained together in a
linked list.
n Above scheme is called closed hashing.
⮚ An alternative, called open hashing, which does not use overflow buckets, is not
suitable for database applications.

Hash Indices
n Hashing can be used not only for file organization, but also for index-structure creation.
n A hash index organizes the search keys, with their associated record pointers, into a hash
file structure.
n Strictly speaking, hash indices are always secondary indices
⮚ if the file itself is organized using hashing, a separate primary hash index on it
using the same search-key is unnecessary.
⮚ However, we use the term hash index to refer to both secondary index structures
and hash organized files.

Disadvantages of Static Hashing


n In static hashing, function h maps search-key values to a fixed set of B of bucket
addresses.
⮚ Databases grow with time. If initial number of buckets is too small, performance
will degrade due to too much overflows.
⮚ If file size at some point in the future is anticipated and number of buckets
allocated accordingly, significant amount of space will be wasted initially.
⮚ If database shrinks, again space will be wasted.
⮚ One option is periodic re-organization of the file with a new hash function, but it
is very expensive.
n These problems can be avoided by using techniques that allow the number of buckets to
be modified dynamically.

Dynamic Hashing

n Good for database that grows and shrinks in size


n Allows the hash function to be modified dynamically
n Extendable hashing – one form of dynamic hashing
⮚ Hash function generates values over a large range — typically b-bit integers, with
b = 32.
⮚ At any time use only a prefix of the hash function to index into a table of bucket
addresses.
⮚ Let the length of the prefix be i bits, 0 ≤ i ≤ 32.
⮚ Bucket address table size = 2i. Initially i = 0
⮚ Value of i grows and shrinks as the size of the database grows and shrinks.
⮚ Multiple entries in the bucket address table may point to a bucket.
⮚ Thus, actual number of buckets is < 2i
★ The number of buckets also changes dynamically due to coalescing and
splitting of buckets.

n Each bucket j stores a value ij; all the entries that point to the same bucket have the same
values on the first ij bits.
n To locate the bucket containing search-key Kj:
Compute h(Kj) = X
Use the first i high order bits of X as a displacement into bucket address table, and follow
the pointer to appropriate bucket
n To insert a record with search-key value Kj
⮚ follow same procedure as look-up and locate the bucket, say j.
⮚ If there is room in the bucket j insert record in the bucket.
⮚ Else the bucket must be split and insertion re-attempted
⮚ Overflow buckets used instead in some cases

UNIT-IV

QUERY PROCESSING
Basic Steps in Query Processing
1. Parsing and translation
2. Optimization
3. Evaluation

n Parsing and translation


★ translate the query into its internal form. This is then translated into relational
algebra.
★ Parser checks syntax, verifies relations
n Evaluation
★ The query-execution engine takes a query-evaluation plan, executes that plan, and
returns the answers to the query.
Optimization
n A relational algebra expression may have many equivalent expressions
★ E.g., σbalance<2500(∏balance(account)) is equivalent to
∏balance(σbalance<2500(account))
n Each relational algebra operation can be evaluated using one of several different
algorithms
★ Correspondingly, a relational-algebra expression can be evaluated in many ways.
n Annotated expression specifying detailed evaluation strategy is called an evaluation-
plan.
★ E.g., can use an index on balance to find accounts with balance < 2500,
★ or can perform complete relation scan and discard accounts with balance ≥ 2500
n Query Optimization: Amongst all equivalent evaluation plans choose the one with
lowest cost.
★ Cost is estimated using statistical information from the
database catalog
n e.g. number of tuples in each relation, size of tuples, etc.
Measures of Query Cost
n Cost is generally measured as total elapsed time for answering query
★ Many factors contribute to time cost
è disk accesses, CPU, or even network communication
n Typically disk access is the predominant cost, and is also relatively easy to estimate.
Measured by taking into account
★ Number of seeks * average-seek-cost
★ Number of blocks read * average-block-read-cost
★ Number of blocks written * average-block-write-cost
è Cost to write a block is greater than cost to read a block
– data is read back after being written to ensure that the write was
successful
n Costs depends on the size of the buffer in main memory
★ Having more memory reduces need for disk access
★ Amount of real memory available to buffer depends on other concurrent OS
processes, and hard to determine ahead of actual execution
★ We often use worst case estimates, assuming only the minimum amount of
memory needed for the operation is available

You might also like