You are on page 1of 88

DATABASE MANAGEMENT SYSTEM

COURSE: DM UNIT: 3 Pg. 1


Database Design A Historical Perspective

From the earliest days of computers, storing and manipulation of data


have been a major application focus. Historically, the initial computer
applications focused on clerical tasks, for example, employee’s payroll
calculation, work scheduling of a manufacturing industry, order and entry
processing and so on. Based on the request from the users, such
applications accessed data stored in computer files, converted stored data
into information, and generated various reports useful for the organization.
These were called file-based systems. Decades-long evolution in computer
technology, data processing and information management, have resulted
into development of sophisticated modern database system. ...

COURSE: DM UNIT: 3 Pg. 2


File Systems versus a DBMS
 File System :
  The file system is basically a way of arranging the files in a storage medium like
a hard disk. The file system organizes the files and helps in the retrieval of files when
they are required. File systems consist of different files which are grouped into
directories. The directories further contain other folders and files. The file system
performs basic operations like management, file naming, giving access rules, etc. 
Example: NTFS(New Technology File System), EXT(Extended File System).

COURSE: DM UNIT: 3 Pg. 3


Cont.,

 DBMS(Database Management System) : 


Database Management System is basically software that manages the collection
of related data. It is used for storing data and retrieving the data effectively when it is
needed. It also provides proper security measures for protecting the data from
unauthorized access. In Database Management System the data can be fetched by SQL
queries and relational algebra. It also provides mechanisms for data recovery and data
backup. 
Example: 
Oracle, MySQL, MS SQL server.

COURSE: DM UNIT: 3 Pg. 4


Cont.,
Basis File System DBMS

The file system is software


that manages and organizes DBMS is software for
Structure
the files in a storage medium managing the database.
within a computer.

Redundant data can be In DBMS there is no


Data Redundancy
present in a file system. redundant data.
It doesn’t provide backup It provides backup and
Backup and Recovery and recovery of data if it is recovery of data even if it is
lost. lost.

There is no efficient query Efficient query processing is


Query processing
processing in the file system. there in DBMS.

There is less data There is more data


Consistency consistency in the file consistency because of the
system. process of normalization.

It has more complexity in


It is less complex as
Complexity handling as compared to the
compared to DBMS.
file system.
Example Cobol, C++ Oracle, SQL Server

COURSE: DM UNIT: 3 Pg. 5


The Data Model

A Data Model in Database Management System (DBMS), is the concept of tools that
are developed to summarize the description of the database.
  It is classified into 3 types:

COURSE: DM UNIT: 3 Pg. 6


Cont.,
1. Conceptual Data Model :
Conceptual data model, describes the database at a very high level and is useful
to understand the needs or requirements of the database. It is this model, that is
used in the requirement gathering process i.e., before the Database Designers start
making a particular database.
2. Representational Data Model :
This type of data model is used to represent only the logical part of the
database and does not represent the physical structure of the databases. The
representational data model allows us to focus primarily, on the design part of the
database.
3. Physical Data Model :
Ultimately, all data in a database is stored physically on a secondary storage
device such as discs and tapes. This is stored in the form of files, records and certain
other data structures. It has all the information of the format in which the files are
present and the structure of the databases, presence of external data structures and
their relation to each other.

COURSE: DM UNIT: 3 Pg. 7


Levels of Abstraction in a DBMS
Data Abstraction refers to the process of hiding irrelevant details from the user. 
There are mainly three levels of data abstraction and we divide it into three levels in
order to achieve Data Independence. Data Independence means users and data should
not directly interact with each other.

COURSE: DM UNIT: 3 Pg. 8


Cont.,

View Level or External Schema


This level tells the application about how the data should be shown to the user. 
Example: If we have a login-id and password in a university system, then as a
student, we can view our marks, attendance, fee structure, etc. But the faculty
of the university will have a different view. He will have options like salary, edit
marks of a student, enter attendance of the students, etc. So, both the student
and the faculty have a different view. By doing so, the security of the system
also increases. In this example, the student can't edit his marks but the faculty
who is authorized to edit the marks can edit the student's marks. Similarly, the
dean of the college or university will have some more authorization and
accordingly, he will has his view. So, different users will have a different view
according to the authorization they have.

COURSE: DM UNIT: 3 Pg. 9


Cont.,

Conceptual Level or Logical Level


This level tells how the data is actually stored and structured. We have
different data models by which we can store the data.
Example: Let us take an example where we use the relational model for
storing the data. We have to store the data of a student, the columns in the
student table will be student_name, age, mail_id, roll_no etc. We have to
define all these at this level while we are creating the database. Though the
data is stored in the database but the structure of the tables like the
student table, teacher table, books table, etc are defined here in the
conceptual level or logical level. Also, how the tables are related to each
other are defined here. Overall, we can say that we are creating a blueprint
of the data at the conceptual level.

COURSE: DM UNIT: 3 Pg. 10


Cont.,

Physical Level or Internal Schema


As the name suggests, the Physical level tells us that where the data is
actually stored i.e. it tells the actual location of the data that is being stored
by the user.
The Database Administrators(DBA) decide that which data should be kept at
which particular disk drive, how the data has to be fragmented, where it
has to be stored etc. They decide if the data has to be centralized or
distributed. Though we see the data in the form of tables at view level the
data here is actually stored in the form of files only. It totally depends on
the DBA, how he/she manages the database at the physical level.

COURSE: DM UNIT: 3 Pg. 11


Database Languages
Read, Update, Manipulate, and Store data in a database using Database Languages.
The following are the database languages −
 Data Definition Language
 Data Manipulation Language
 Data Control Language
 Transaction Control Language

Data Definition Language:


The language is used to create database, tables, alter them, etc. With this, you can
also rename the database, or drop them. It specifies the database schema.
The DDL statements include −
  CREATE:  Create new database, table, etc.
  ALTER: Alter existing database, table, etc.
  DROP: Drop the database
  RENAME: Set a new name for the table.

COURSE: DM UNIT: 3 Pg. 12


Cont.,
Data Manipulation Language:
The language used to manipulate the database like inserting data, updating table,
retrieving record  from a table, etc. is known as Data Manipulation Language −
  SELECT: Retrieve data from the database
  INSERT: Insert data
  UPDATE:  Update data
  DELETE: Delete all records
Data Control Language:
Grant privilege to a user using the GRANT statement. In the same way, revoke the privilege
using the REVOKE statement. Both of these statements come under the Data Control
Language (DCL). −
  GRANT: Give privilege to access the database.
  REVOKE: Take back the privilege to access the database.
Transaction Control Language:
Manage transactions in the Database using the Transaction Control Language −
  COMMIT: Save the work.
  SAVEPOINT: Set a point in transaction to rollback later
  ROLLBACK: Restores since last commit

COURSE: DM UNIT: 3 Pg. 13


Data Independence

Data independence is the ability to modify the scheme without affecting the
programs and the application to be rewritten. Data is separated from the programs,
so that the changes made to the data will not affect the program execution and the
application.
We know the main purpose of the three levels of data abstraction is to achieve data
independence. If the database changes and expands over time, it is very important
that the changes in one level should not affect the data at other levels of the
database. This would save time and cost required when changing the database.
There are two levels of data independence based on three levels of abstraction.
These are as follows −
 Physical Data Independence
 Logical Data Independence

COURSE: DM UNIT: 3 Pg. 14


Cont.,

Physical Data Independence:


All the schemas are logical, and the actual data is stored in bit format on the
disk. Physical data independence is the power to change the physical data
without impacting the schema or logical data.
For example, in case we want to change or upgrade the storage system itself −
suppose we want to replace hard-disks with SSD − it should not have any
impact on the logical data or schemas.
Logical Data Independence:
Logical data is data about database, that is, it stores information about how data is
managed inside. For example, a table (relation) stored in the database and all its
constraints, applied on that relation.
Logical data independence is a kind of mechanism, which liberalizes itself from actual
data stored on the disk. If we do some changes on table format, it should not change
the data residing on the disk.

COURSE: DM UNIT: 3 Pg. 15


Structure of a DBMS Database Engine

DBMS is a software that allows access to data stored in a database and provides an
easy and effective method of – 
  Defining the information.
  Storing the information.
  Manipulating the information.
  Protecting the information from system crashes or data theft.
  Differentiating access permissions for different users.
The Structure of  Database Management System is also referred to as  Overall
System Structure or Database Architecture but it is different from the tier
architecture of Database. 
The database system is divided into three components: Query Processor, Storage
Manager, and Disk Storage

COURSE: DM UNIT: 3 Pg. 16


Cont.,

COURSE: DM UNIT: 3 Pg. 17


Cont.,

1. Query Processor : 
It interprets the requests (queries) received from end user via an application
program into instructions. It also executes the user request which is received
from the DML compiler. 
Query Processor contains the following components – 
 DML Compiler – 
It processes the DML statements into low level instruction (machine language), so
that they can be executed. 
 DDL Interpreter – 
It processes the DDL statements into a set of table containing meta data (data
about data). 
 Embedded DML Pre-compiler – 
It processes DML statements embedded in an application program into
procedural calls. 
 Query Optimizer – 
It executes the instruction generated by DML Compiler.

COURSE: DM UNIT: 3 Pg. 18


Cont.,

2. Storage Manager :
 Storage Manager is a program that provides an interface between the data stored in the database
and the queries received. It is also known as Database Control System. It maintains the consistency
and integrity of the database by applying the constraints and executes the DCL statements. It is
responsible for updating, storing, deleting, and retrieving data in the database.
Authorization Manager – 
It ensures role-based access control, i.e,. checks whether the particular person is privileged to
perform the requested operation or not. 
Integrity Manager – 
It checks the integrity constraints when the database is modified. 
Transaction Manager – 
It controls concurrent access by performing the operations in a scheduled way that it receives the
transaction. Thus, it ensures that the database remains in the consistent state before and after the
execution of a transaction. 
File Manager – 
It manages the file space and the data structure used to represent information in the database. 
Buffer Manager – 
It is responsible for cache memory and the transfer of data between the secondary storage and
main memory. 
COURSE: DM UNIT: 3 Pg. 19
Cont.,

3. Disk Storage: It contains the following components – 


  Data Files – 
It stores the data. 
  Data Dictionary – 
It contains the information about the structure of any database object. It
is the repository of information that governs the metadata.
  Indices – 
It provides faster retrieval of data item.

COURSE: DM UNIT: 3 Pg. 20


Database and Application Architecture

The overall design of the Database Management System (DBMS) depends on its
architecture. A large amount of data on web servers, Personal Computers (PC) and
other elements are linked with networks with the help of basic client or server
architecture.
PCs and workstations are part of Client architecture that are connected over the
network. The architecture of DBMS depends on how the users are linked to the
database.
There are three kinds of DBMS Architecture, which are as follows −
Tier-1 Architecture.
Tier-2 Architecture.
Tier-3 Architecture.

COURSE: DM UNIT: 3 Pg. 21


Cont.,

In 1-tier architecture, the DBMS is the only entity where the user directly sits on the
DBMS and uses it. Any changes done here will directly be done on the DBMS itself. It
does not provide handy tools for end-users. Database designers and programmers
normally prefer to use single-tier architecture.
If the architecture of DBMS is 2-tier, then it must have an application through which
the DBMS can be accessed. Programmers use 2-tier architecture where they access
the DBMS by means of an application. Here the application tier is entirely
independent of the database in terms of operation, design, and programming.
3-tier Architecture:
A 3-tier architecture separates its tiers from each other based on the complexity of
the users and how they use the data present in the database. It is the most widely
used architecture to design a DBMS.

COURSE: DM UNIT: 3 Pg. 22


Cont.,

Database (Data) Tier − At this tier, the database resides along


with its query processing languages. We also have the relations
that define the data and their constraints at this level.
Application (Middle) Tier − At this tier reside the application
server and the programs that access the database. For a user,
this application tier presents an abstracted view of the
database. End-users are unaware of any existence of the
database beyond the application. At the other end, the
database tier is not aware of any other user beyond the
application tier. Hence, the application layer sits in the middle
and acts as a mediator between the end-user and the database.
User (Presentation) Tier − End-users operate on this tier and
they know nothing about any existence of the database beyond
this layer. At this layer, multiple views of the database can be
provided by the application. All views are generated by
applications that reside in the application tier.
COURSE: DM UNIT: 3 Pg. 23
Database Users and Administrators
Database Users:
Database users are the ones who really use and take the benefits of the database. There will be different
types of users depending on their needs and way of accessing the database.
1) Application Programmers – They are the developers who interact with the database by means of 
DML queries. These DML queries are written in the application programs like C, C++, JAVA, Pascal,
etc. These queries are converted into object code to communicate with the database.
2) Sophisticated Users – They are database developers, who write SQL queries to
select/insert/delete/update data. They do not use any application or programs to request the
database. They directly interact with the database by means of a query language like SQL. These
users will be scientists, engineers, analysts who thoroughly study SQL and DBMS to apply the
concepts in their requirements.
3) Specialized Users – These are also sophisticated users, but they write special database application
programs. They are the developers who develop the complex programs to the requirement.
4) Stand-alone Users – These users will have a stand-alone database for their personal use. These kinds
of the database will have readymade database packages which will have menus and graphical
interfaces.
5) Native Users – these are the users who use the existing application to interact with the database. For
example, online library system, ticket booking systems, ATMs etc which has existing application and
users use them to interact with the database to fulfill their requests.

COURSE: DM UNIT: 3 Pg. 24


Cont.,
Database Administrators:
The life cycle of a database starts from designing, implementing to the administration of it. A database for
any kind of requirement needs to be designed perfectly so that it should work without any issues.
1) Administrative DBA – This DBA is mainly concerned with installing, and maintaining DBMS servers. His
prime tasks are installing, backups, recovery, security, replications, memory management,
configurations, and tuning. He is mainly responsible for all administrative tasks of a database.
2) Development DBA – He is responsible for creating queries and procedures for the requirement.
Basically, his task is similar to any database developer.
3) Database Architect – Database architect is responsible for creating and maintaining the users, roles,
access rights, tables, views, constraints, and indexes. He is mainly responsible for designing the
structure of the database depending on the requirement. These structures will be used by developers
and development DBA to code.
4) Data Warehouse DBA –DBA should be able to maintain the data and procedures from various sources in
the data warehouse. These sources can be files, COBOL, or any other programs. Here data and programs
will be from different sources. A good DBA should be able to keep the performance and function levels
from these sources at the same pace to make the data warehouse work.
5) Application DBA –He acts like a bridge between the application program and the database. He makes
sure all the application program is optimized to interact with the database. He ensures all the activities
from installing, upgrading, and patching, maintaining, backup, recovery to executing the records work
without any issues.
6) OLAP DBA – He is responsible for installing and maintaining the database in OLAP systems. He maintains
only OLAP databases.
COURSE: DM UNIT: 3 Pg. 25
Overview of the Design Process
Database design can be generally defined as a collection of tasks or processes that
enhance the designing, development, implementation, and maintenance of enterprise
data management system. Designing a proper database reduces the maintenance cost
thereby improving data consistency and the cost-effective measures are greatly
influenced in terms of disk storage space. Therefore, there has to be a brilliant
concept of designing a database. The designer should follow the constraints and
decide how the elements correlate and what kind of data must be stored.
The main objectives behind database designing are to produce physical and logical
design models of the proposed database system. To elaborate this, the logical model
is primarily concentrated on the requirements of data and the considerations must be
made in terms of monolithic considerations and hence the stored physical data must
be stored independent of the physical conditions. On the other hand, the physical
database design model includes a translation of the logical design model of the
database by keep control of physical media using hardware resources and software
systems such as Database Management System (DBMS).

COURSE: DM UNIT: 3 Pg. 26


Cont.,

Life Cycle

Database Designing
The next step involves designing the database considering the user-based requirements and
splitting them out into various models so that load or heavy dependencies on a single aspect
are not imposed. Therefore, there has been some model-centric approach and that's where
logical and physical models play a crucial role.
Physical Model - The physical model is concerned with the practices and implementations of
the logical model.
Logical Model - This stage is primarily concerned with developing a model based on the
proposed requirements. The entire model is designed on paper without any implementation or
adopting DBMS considerations.

COURSE: DM UNIT: 3 Pg. 27


The Entity-Relationship Model
A high-level data model diagram is the entity-relationship model or ER Model. We depict the real-
world problem in visual form in this model to make it easier for stakeholders to comprehend. The
developers can also quickly grasp the system by simply looking at the ER Diagram.
Features of an Entity-Relationship Model
1. Graphical Representation for Better Understanding – It is really straightforward and easy to
comprehend, so developers can use it to interact with stakeholders.
2. Database Design – This approach is extensively used in database design and aids database
designers in the creation of databases.
3. ER Diagram – The ER diagram is a visual representation of the model.
Components of an ER Diagram
The ER diagram is a visual representation of an ER Model. The three components of an ER diagram
are as follows:
1. Entities: An entity is a real-life concept. It could be a person, a location, or even an idea. A school
management system, for example, has entities such as teachers, students, courses, buildings,
departments, and so on.
2. Attributes: An attribute is a real-world property that exists in an entity. For example, the entity
teacher has properties such as teacher salary, id, age, and so on.
3. Relationship: The relationship between two traits describes how they are linked. A teacher, for
example, works for a department.
COURSE: DM UNIT: 3 Pg. 28
Cont.,

COURSE: DM UNIT: 3 Pg. 29


Cont.,

Relationship Set- A relationship set is a set of relationships of same type.


 
Example- 
             

Degree of a relationship set = Number of entity sets participating in a relationship set

COURSE: DM UNIT: 3 Pg. 30


Cont.,

COURSE: DM UNIT: 3 Pg. 31


Conceptual Design with the ER Model

Conceptual database design


– High level description of the data and the constraint
– This step can use ER or similar high level models
Conceptual Database Design with ER Model is one of the most important topic in
Databases. Whenever a programmer wants to develop a database the ER Model is
the basic thing he starts with.

COURSE: DM UNIT: 3 Pg. 32


Primary Key

A primary key is a field in a table which uniquely identifies each row/record in a


database table. Primary keys must contain unique values. A primary key column
cannot have NULL values.
A table can have only one primary key, which may consist of single or multiple fields.
When multiple fields are used as a primary key, they are called a composite key.
If a table has a primary key defined on any field(s), then you cannot have two
records having the same value of that field(s).

  Create Primary Key


  Example: Here is the syntax to define the ID attribute as a primary key in a
CUSTOMERS table.

  CREATE TABLE CUSTOMERS( ID INT NOT NULL, NAME VARCHAR (20) NOT NULL, AGE INT NOT
NULL, ADDRESS CHAR (25) , SALARY DECIMAL (18, 2), PRIMARY KEY (ID) );

COURSE: DM UNIT: 3 Pg. 33


Removing Redundant Attributes in Entity Sets

Example: The attribute dept name appears in both entity sets. Since it is the primary key
for the entity set department , it is redundant in the entity set instructor and needs to be
removed. Removing the attribute dept name from the instructor entity set may appear
rather unintuitive, since the relation instructor that we used in the earlier chap-ters had
an attribute dept name . As we shall see later, when we create a relational schema from
the E-R diagram, the attribute dept name in fact gets added to the relation instructor , but
only if each instructor has at most one associated depart-ment. If an instructor has ore
than one associated department, the relationship between instructors and departments is
recorded in a separate relation inst dept. Treating he connection between instructors and
departments uniformly as a relationship, rather than as an attribute of instructor , makes
the logical relationship explicit, and helps avoid a premature assumption that each
instructor is associated with only one department. Similarly, the student entity set is
related to the department entity set through the relationship set student dept and thus
there is no need for a dept name attribute in student .

COURSE: DM UNIT: 3 Pg. 34


Reducing E-R Diagrams to Relational Schemas

Let there be ‘m’ number of entities and ‘n’ number of relationships in a given
E-R diagram.
The number of relations, after converting the E-R diagram to relations, is m+n.
 Example:

In the above example, m = 2 and n = 1. (Two entities and one relationship)


The attributes in the relationship table are the combination of primary keys of
the entities that are connected through this relationship.
Table-1 :  INSTRUCTOR   :       ID, I-name, salary
Table-2 : STUDENT         :       ID, S-name, tot_credits
Table-3 : ADVISOR         :       I-ID, S-ID

COURSE: DM UNIT: 3 Pg. 35


Additional Features of the ER Model

The basic E-R concepts can model most database features, some aspects of a database may be more
aptly expressed by certain extensions to the basic E-R model. The extended E-R features are
specialization, generalization, higher- and lower-level entity sets, attribute inheritance, and
aggregation.
Specialization – An entity set broken down sub-entities that are distinct in some way from other
entities in the set. For instance, a subset of entities within an entity set may have attributes that are
not shared by all the entities in the entity set. The E-R model provides a means for representing these
distinctive entity groupings.Specialization is an “aTop-down approach” where a high-level entity is
specialized into two or more level entities.
Generalization – It is a process of extracting common properties from a set of entities and creating a
generalized entity from it. generalization is a “Bottom-up approach”. In which two or more entities
can be combined to form a higher-level entity if they have some attributes in common. In
generalization, Subclasses are combined to make a superclass.
Inheritance – An entity that is a member of a subclass inherits all the attributes of the entity as the
member of the superclass, the entity also inherits all the relationships that the superclass participates
in. Inheritance is an important feature of Generalization and Specialization. It allows lower-level
entities to inherit the attributes of higher-level entities.
Aggregation – In aggregation, the relation between two entities is treated as a single entity. In
aggregation, the relationship with its corresponding entities is aggregated into a higher-level entity.

COURSE: DM UNIT: 3 Pg. 36


Integrity constraint over relations
There are different types of data integrity constraints that are commonly found in relational databases,
including the following −
o Required data − Some columns in a database contain a valid data value in each row; they are not allowed to
contain NULL values. In the sample database, every order has an associated customer who placed the order.
The DBMS can be asked to prevent NULL values in this column.
o Validity checking − Every column in a database has a domain, a set of data values which are legal for that
column. The DBMS allowed preventing other data values in these columns.
o Entity integrity − The primary key of a table contains a unique value in each row that is different from the
values in all other rows. Duplicate values are illegal because they are not allowing the database to
differentiate one entity from another. The DBMS can be asked to enforce this unique values constraint.
o Referential integrity − A foreign key in a relational database links each row in the child table containing the
foreign key to the row of the parent table containing the matching primary key value. The DBMS can be asked
to enforce this foreign key/primary key constraint.
o Other data relationships − The real-world situation which is modeled by a database often has additional
constraints which govern the legal data values that may appear in the database. The DBMS is allowed to
check modifications to the tables to make sure that their values are constrained in this way.
o Business rules − Updates to a database that are constrained by business rules governing the real-world
transactions which are represented by the updates.
o Consistency − Many real-world transactions that cause multiple updates to a database. The DBMS is allowed
to enforce this type of consistency rule or to support applications that implement such rules.

COURSE: DM UNIT: 3 Pg. 37


Enforcing integrity constraints

Integrity constraints are a set of rules. It is used to maintain the quality of


information.
Integrity constraints ensure that the data insertion, updating, and other processes
have to be performed in such a way that data integrity is not affected.
Thus, integrity constraint is used to guard against accidental damage to the
database.
Types of Integrity Constraint

COURSE: DM UNIT: 3 Pg. 38


Cont.,

Domain constraints
 Domain constraints can be defined as the definition of a valid set of values for an
attribute.
 The data type of domain includes string, character, integer, time, date, currency,
etc. The value of the attribute must be available in the corresponding domain.
Entity integrity constraints
 The entity integrity constraint states that primary key value can't be null.
 This is because the primary key value is used to identify individual rows in relation
and if the primary key has a null value, then we can't identify those rows.
 A table can contain a null value other than the primary key field.
Referential Integrity Constraints
 A referential integrity constraint is specified between two tables.
Key constraints
 Keys are the entity set that is used to identify an entity within its entity set
uniquely.
 An entity set can have multiple keys, but out of which one key will be the primary
key. A primary key can contain a unique and null value in the relational table.
COURSE: DM UNIT: 3 Pg. 39
Querying relational data,

Relational algebra is used to break the user requests and instruct the DBMS to execute them.
Relational Query language is used by the user to communicate with the database. They are
generally on a higher level than any other programming language.
This is further divided into two types
Procedural Query Language
Non-Procedural Language
 
Procedural Query Language
The user instructs the system to perform a set of operations on the database to determine the
desired results.
Non-Procedural Language 
The user outlines the desired information without giving a specific procedure for attaining the
information.

COURSE: DM UNIT: 3 Pg. 40


Cont.,

Relational Algebra
The query language ‘Relational Algebra’ defines a set of operations on relations.
Consider STUD, DEPT and FACULTY DATABASES.
There are five types of operators :
  Select (𝛔) :  Returns rows of the input relation that satisfy the predicate.
  Projection(ℼ) : Outputs specified attributes from all the rows of the input Relation. Remove
duplicate tuples from the output.
  Natural Join (|X|) : Outputs pairs of rows from the two input relations that have the same
value on all common attributes.
  r1 |X|  r2
  Cartesian Product (X) : Outputs all pairs of rows from both input relations. 
  r1  X  r2
  Union (U) : Outputs the union of tuples from both the relations. 
  r1  U  r2

COURSE: DM UNIT: 3 Pg. 41


Logical data base design
As a designer of logical databases, you can develop a set of processes that serve a business or
organization. Logical database design is a duty of a database administrator and involves gathering a
business' organization and processes so the database can accommodate the business. After this
data has been gathered, you could create a logical model to map the functions and relationships
between the processes and data. Once a logical model is diagrammed, you can make a physical
database model that implements the information gathered for the logical model.
  Logical database design involves three phases:
1. Requirements analysis phase
2. Data modeling phase
3. Normalization phase
  Requirements Analysis Phase: The requirements analysis phase involves examining the business
being modeled, interviewing users and management to assess the current system and to analyze
future needs, and determining information requirements for the business as a whole. This
process is relatively straightforward.
  Data Modeling Phase: The data modeling phase involves modeling the database structure itself.
This involves using a data modeling method which provides a means of visually representing
various aspects of the database structure, such as the tables, table relationships, and
relationship characteristics.

COURSE: DM UNIT: 3 Pg. 42


Introduction to views

A view is a virtual or logical table that allows to view or manipulate parts of the tables. To
reduce REDUNDANT DATA to the minimum possible, Oracle allows the creation of an object
called a VIEW. 
A View is mapped, to a SELECT sentence. The table on which the view is based is described in
the FROM clause of the SELECT statement. 
Creating view
  A view can be created using the CREATE VIEW statement. We can create a view from a single
table or multiple tables.
Syntax:
  CREATE VIEW view_name AS  
  SELECT column1, column2.....  
  FROM table_name  
  WHERE condition; 

COURSE: DM UNIT: 3 Pg. 43


Destroying/altering tables and views

COURSE: DM UNIT: 3 Pg. 44


Relational Algebra
The relational algebra is a theoretical procedural query language which takes an instance of relations
and does operations that work on one or more relations to describe another relation without altering
the original relation(s). Thus, both the operands and the outputs are relations. So the output from
one operation can turn into the input to another operation, which allows expressions to be nested in
the relational algebra, just as you nest arithmetic operations. This property is called closure: relations
are closed under the algebra, just as numbers are closed under arithmetic operations.
The relational algebra is a relation-at-a-time (or set) language where all tuples are controlled in one
statement without the use of a loop. There are several variations of syntax for relational algebra
commands, and you use a common symbolic notation for the commands and present it informally.
The primary operations of relational algebra are as follows:
 Select
 Project
 Union
 Set different
 Cartesian product
 Rename

COURSE: DM UNIT: 3 Pg. 45


Tuple relational Calculus

 Relational algebra specifies procedures and methods to fetch data hence is called
as a procedural query language ,whereas relational calculus is a non procedural
query language focuses on just fetching data rather than how the query will
work and how data will be fetched 
 Simply relational calculus is nothing but focusing on what to do rather than
focusing on how to do 
Relational calculus is present in two formats -
 Tuples relational calculus(TRC) 
 Domain relational calculus(DRC)

COURSE: DM UNIT: 3 Pg. 46


Cont.,
Tuple Relational Calculus (TRC) in DBMS :
Relational calculus peoples are filtered based on a condition 
Syntax:
  Relation part 
  Here t represents the tuple variable which is nothing but representing that it is a table 
  It can be any variable but for understanding we use variable t which stands for the table as per
our context
  For example, if our table is Student, we would put it as Student(T)
  Condition part 
  Condition is specified using this dot variable the common and column we need to operate
  Example  1:
• where T is our tuple variable and age is a column that is used for filtering the records in the
relation 
• Now combine both relational and conditional part and see how the final quotation will look
like T.name | Student(T) AND T.age > 21 

COURSE: DM UNIT: 3 Pg. 47


Domain relational calculus

In Domain relational calculus filtering of records is done based on the domain of the


attributes rather than tuple values 
• A domain is nothing but the set of allowed values in the column of a table 
  Syntax: 

  where, c1, c2… etc represents the domain of attributes(columns) and F represents


the formula including the condition for fetching the data. 
  Example 1

  Again, the above query will return the names and ages of the students in the table
Student who not greater than 21 years old 

COURSE: DM UNIT: 3 Pg. 48


SQL Data Definition

 DDL stands for Data Definition Language.


 It is a language used for defining and modifying the data and its structure.
 It is used to build and modify the structure of your tables and other objects in the database.
DDL commands are as follows,
1) CREATE
2) DROP
3) ALTER
4) RENAME
5) TRUNCATE

 These commands can be used to add, remove or modify tables within a database.
 DDL has pre-defined syntax for describing the data.

COURSE: DM UNIT: 3 Pg. 49


SQL Data Types and Schemas

SQL Datatype
SQL Datatype is used to define the values that a column can contain.
Every column is required to have a name and data type in the database table.

COURSE: DM UNIT: 3 Pg. 50


Cont.,

Database Schema
A database schema is the skeleton structure that represents the logical view of the entire
database. It defines how the data is organized and how the relations among them are associated.
It formulates all the constraints that are to be applied on the data.
A database schema defines its entities and the relationship among them. It contains a descriptive
detail of the database, which can be depicted by means of schema diagrams. It’s the database
designers who design the schema to help programmers understand the database and make it
useful.

• Physical Database Schema − This schema


pertains to the actual storage of data and its
form of storage like files, indices, etc. It defines
how the data will be stored in a secondary
storage.
• Logical Database Schema − This schema
defines all the logical constraints that need to
be applied on the data stored. It defines tables,
views, and integrity constraints.

COURSE: DM UNIT: 3 Pg. 51


Basic Structure of SQL Queries

The fundamental structure of SQL queries includes three clauses that are select,
from, and where clause. What we want in the final result relation is specified in
the select clause. Which relations we need to access to get the result is specified
in from clause. How the relation must be operated to get the result is specified in
the where clause.

select A1, A2, . . . , An


from r1, r2, . . . , rm
where P;

• In the select clause, you have to specify the attributes that you want to see in the
result relation
• In the from clause, you have to specify the list of relations that has to be accessed
for evaluating the query.
• In the where clause involves a predicate that includes attributes of the relations that
we have listed in the from clause.

COURSE: DM UNIT: 3 Pg. 52


Additional Basic Operations

COURSE: DM UNIT: 3 Pg. 53


Set Operations
The SQL Set operation is used to combine the two or more SQL SELECT statements.
Types of Set Operation: Union, Union All, Intersect, Minus
1. Union
 The SQL Union operation is used to combine the result of two or more SQL SELECT queries.
 In the union operation, all the number of datatype and columns must be same in both the
tables on which UNION operation is being applied.
 The union operation eliminates the duplicate rows from its result set.
2. Union All
 Union All operation is equal to the Union operation. It returns the set without removing
duplication and sorting the data.
3. Intersect
 It is used to combine two SELECT statements. The Intersect operation returns the common
rows from both the SELECT statements.
 In the Intersect operation, the number of datatype and columns must be the same.
 It has no duplicates and it arranges the data in ascending order by default.
4. Minus
 It combines the result of two SELECT statements. Minus operator is used to display the rows
which are present in the first query but absent in the second query.
 It has no duplicates and data arranged in ascending order by default.
COURSE: DM UNIT: 3 Pg. 54
Null Values

Special value that is supported by SQL is called as null which is used to represent values
of attributes that are unknown or do not apply for that particular row
 For example age of a particular student is not available in the age column of student
table then it is represented as null but not as zero
 It is important to know that null values is always different from zero value
 A null value is used to represent the following different interpretations
 Value unknown (value exists but is not known)
 Value not available (exists but is purposely hidden)
 Attribute not applicable (undefined for that row)

COURSE: DM UNIT: 3 Pg. 55


Aggregate Functions

 SQL aggregation function is used to perform the calculations on multiple rows of a


single column of a table. It returns a single value.
 It is also used to summarize the data.
 When using data in the form of numerical values, the following operations can be
used to perform DBMS aggregation:
 Average (AVG): This function provides the mean or average of the data values.
 Sum: This provides a total value after the data values have been added.
 Count: This provides the number of records.
 Maximum (Max): This function provides the maximum value of a given set of
data.
 Minimum (Min): This provides the minimum value of a given set of data.

COURSE: DM UNIT: 3 Pg. 56


Nested Sub-queries

A nested query is a query that has another query embedded within it. The
embedded query is called a subquery.
A subquery typically appears within the WHERE clause of a query. It can
sometimes appear in the FROM clause or HAVING clause.
Example:
Find the names of employee who have regno=103
The query is as follows −
select E.ename from employee E where E.eid IN (select S.eid from salary S where
S.regno=103);

COURSE: DM UNIT: 3 Pg. 57


Modification of the Database
The modification of a database has three commands, namely:
 DELETE 
 INSERT
 UPDATE
Delete Command: This command helps us to remove rows from the table.
 Syntax :     DELETE        from r        where P
Insert Command: This command helps us to insert rows into the table.
Syntax :     INSERT into  relation-name   values (…..)
Update Command: This command helps us to modify columns in table rows.
Syntax :     update       <relation name>
                    set              <assignment>
                    where        <condition>

COURSE: DM UNIT: 3 Pg. 58


Arithmetic & logical operations

  Arithmetic Operators
  These operators are used to perform operations such as addition, multiplication,
subtraction etc.

COURSE: DM UNIT: 3 Pg. 59


Cont.,

Logical Operators
The logical operators are used to perform operations such as ALL, ANY, NOT,
BETWEEN etc.

COURSE: DM UNIT: 3 Pg. 60


String conversion

The following scalar functions perform an operation on a string input value and return a
string or numeric value:
  ASCII   NCHAR   STRING_AGG
  CHAR   PATINDEX   STRING_ESCAPE
  CHARINDEX   QUOTENAME   STRING_SPLIT
  CONCAT   REPLACE   STUFF
  CONCAT_WS   REPLICATE   SUBSTRING
  DIFFERENCE   REVERSE   TRANSLATE
  FORMAT   RIGHT   TRIM
  LEFT   RTRIM   UNICODE
  LEN   SOUNDEX   UPPER
  LOWER   SPACE
  LTRIM   STR

COURSE: DM UNIT: 3 Pg. 61


Transactions
A transaction can be defined as a group of tasks. A single task is the minimum processing unit which cannot
be divided further.
ACID Properties
A transaction is a very small unit of a program and it may contain several lowlevel tasks. A transaction in a
database system must maintain Atomicity, Consistency, Isolation, and Durability − commonly known as ACID
properties − in order to ensure accuracy, completeness, and data integrity.
Atomicity − This property states that a transaction must be treated as an atomic unit, that is, either all of its
operations are executed or none. There must be no state in a database where a transaction is left partially
completed. States should be defined either before the execution of the transaction or after the
execution/abortion/failure of the transaction.
Consistency − The database must remain in a consistent state after any transaction. No transaction should
have any adverse effect on the data residing in the database. If the database was in a consistent state before
the execution of a transaction, it must remain consistent after the execution of the transaction as well.
Durability − The database should be durable enough to hold all its latest updates even if the system fails or
restarts. If a transaction updates a chunk of data in a database and commits, then the database will hold the
modified data. If a transaction commits but the system fails before the data could be written on to the disk,
then that data will be updated once the system springs back into action.
Isolation − In a database system where more than one transaction are being executed simultaneously and in
parallel, the property of isolation states that all the transactions will be carried out and executed as if it is the
only transaction in the system. No transaction will affect the existence of any other transaction.

COURSE: DM UNIT: 3 Pg. 62


Triggers

Triggers
Triggers are the SQL statements that are automatically executed when there is any change
in the database. The triggers are executed in response to certain events(INSERT, UPDATE
or DELETE) in a particular table. These triggers help in maintaining the integrity of the data
by changing the data of the database in a systematic fashion.
  Syntax
  create trigger Trigger_name (before | after) [insert | update | delete] on [table_name]
[for each row] [trigger_body]

COURSE: DM UNIT: 3 Pg. 63


Problems caused by redundancy

  Redundancy means having multiple copies of same data in the database. This problem
arises when a database is not normalized. Suppose a table of student details attributes are:
student Id, student name, college name, college rank, course opted.
  1. Insertion Anomaly – 
If a student detail has to be inserted whose course is not being decided yet then insertion
will not be possible till the time course is decided for student. 
  2. Deletion Anomaly – 
If the details of students in this table are deleted then the details of college will also get
deleted which should not occur by common sense. 
This anomaly happens when deletion of a data record results in losing some unrelated
information that was stored as part of the record that was deleted from a table.  
It is not possible to delete some information without loosing some other information in the
table as well.
  3. Updation Anomaly – 
Suppose if the rank of the college changes then changes will have to be all over the
database which will be time-consuming and computationally costly. 

COURSE: DM UNIT: 3 Pg. 64


Decompositions
Decomposition in DBMS is to break a relation into multiple relations to bring it into an appropriate normal
form. It helps to remove redundancy, inconsistencies, and anomalies from a database. The decomposition of
a relation R in a relational schema is the process of replacing the original relation R with two or more
relations in a relational schema. Each of these relations contains a subset of the attributes of R and together
they include all attributes of R.
Advantages of decomposition :
There is tremendous sort of advantages offered by decomposition about which we have mentioned in detail
below:
Easy use of Codes
The availability of decomposition makes it easier for programs to copy and reuse important codes for other
works in DBMS. It only not helps in saving lots of time but also makes things convenient for the users.
Finding Mistakes
Another reason the programmers opt for decomposition is to allow them conveniently complete complex
programs. The mistakes are quite easier to find with this sort of programming.
Problem Solving Approach
It is considered a perfect problem-solving strategy using which complex computer programs can be written
easily. The users can precisely join tons of code together for adequate results.
Eliminating Errors
The biggest advantage of having decomposition in DBMS is eliminating the inconsistencies and duplication to
a greater extent. The data can be easily identified in cases when decomposition happens in DBMS.

COURSE: DM UNIT: 3 Pg. 65


Problems related to decomposition

There are many problems regarding the decomposition in DBMS mentioned below:
  Redundant Storage
  Many instances where the same information gets stored in a single place can
confuse the programmers. It will take lots of space in the system.
  Insertion Anomalies
  It isn’t essential for storing important details unless some kind of information is
stored in a consistent manner.
  Deletion Anomalies
  It isn’t possible to delete some details without eliminating any sort of
information.

COURSE: DM UNIT: 3 Pg. 66


Reasoning about functional dependencies

Functional Dependency (FD) is a constraint that determines the relation of one


attribute to another attribute in a Database Management System (DBMS).
Functional Dependency helps to maintain the quality of data in the database. It plays
a vital role to find the difference between good and bad database design.

A functional dependency is denoted by an arrow “→”. The functional dependency of


X on Y is represented by X → Y.

COURSE: DM UNIT: 3 Pg. 67


Normalization

 Normalization is the process of organizing the data in the database.


 Normalization is used to minimize the redundancy from a relation or set of
relations. It is also used to eliminate undesirable characteristics like Insertion,
Update, and Deletion Anomalies.
 Normalization divides the larger table into smaller and links them using
relationships.
 The normal form is used to reduce redundancy from the database table.
Types of Normal Forms:
Normalization works through a series of stages called Normal forms. The normal
forms apply to individual relations. The relation is said to be in particular normal
form if it satisfies constraints.

COURSE: DM UNIT: 3 Pg. 68


Cont.,

COURSE: DM UNIT: 3 Pg. 69


Lossless join decomposition

Lossless-join decomposition is a process in which a relation is decomposed into two or


more relations. This property guarantees that the extra or less tuple generation
problem does not occur and no information is lost from the original relation during the
decomposition. It is also known as non-additive join decomposition.
When the sub relations combine again then the new relation must be the same as the
original relation was before decomposition.
Consider a relation R if we decomposed it into sub-parts relation R1 and relation R2.
The decomposition is lossless when it satisfies the following statement −
 If we union the sub Relation R1 and R2 then it must contain all the attributes that
are available in the original relation R before decomposition.
 Intersections of R1 and R2 cannot be Null. The sub relation must contain a common
attribute. The common attribute must contain unique data.

COURSE: DM UNIT: 3 Pg. 70


Multivalued dependencies

When existence of one or more rows in a table implies one or more other rows in the
same table, then the Multi-valued dependencies occur.
If a table has attributes P, Q and R, then Q and R are multi-valued facts of P.
It is represented by double arrow −
->->

For our example:


P->->QP->->R

In the above case, Multivalued Dependency exists only if Q and R are independent
attributes.
A table with multivalued dependency violates the 4NF.

COURSE: DM UNIT: 3 Pg. 71


Concurrent Executions

Concurrency Control is the management procedure that is required for controlling concurrent
execution of the operations that take place on a database.
But before knowing about concurrency control, we should know about concurrent execution.
Concurrent Execution in DBMS
 In a multi-user system, multiple users can access and use the same database at one time, which
is known as the concurrent execution of the database. It means that the same database is
executed simultaneously on a multi-user system by different users.
 While working on the database transactions, there occurs the requirement of using the
database by multiple users for performing different operations, and in that case, concurrent
execution of the database is performed.
 The thing is that the simultaneous execution that is performed should be done in an interleaved
manner, and no operation should affect the other executing operations, thus maintaining the
consistency of the database. Thus, on making the concurrent execution of the transaction
operations, there occur several challenging problems that need to be solved.

COURSE: DM UNIT: 3 Pg. 72


Lock based Concurrency
Lock-Based Protocols- 
  It is a mechanism in which a transaction cannot read or write data unless the appropriate lock is
acquired. This helps in eliminating the concurrency problem by locking a particular transaction to a
particular user. The lock is a variable that denotes those operations that can be executed on the
particular data item.  
  The various types of lock include 
 Binary lock– It ensures that the data item can be in either locked or unlocked state 
 Shared Lock– A shared lock is also called read only lock because you don’t have permission to update
data on the data item. With this lock data item can be easily shared between different transactions.
For example, if two teams are working on employee payment accounts, they would be able to access it
but wouldn’t be able to modify the data on the payment account. 
 Exclusive Lock– With exclusive locks, the data items will not be just read but can also be written 
 Simplistic Lock Protocol– this lock protocol allows transactions to get lock on every object at the start of
operation. Transactions are able to unlock the data item after completing the write operations 
 Pre-claiming locking– This protocol evaluates the operations and builds a list of the necessary data items
which are required to initiate the execution of the transaction. As soon as the locks are acquired, the
execution of transaction takes place. When the operations are over, then all the locks release. 
 Starvation- It is the condition where a transaction has to wait for an indefinite period for acquiring a
lock. 
 Deadlock- It is the condition when two or more processes are waiting for each other to get a resource
released 

COURSE: DM UNIT: 3 Pg. 73


Dealing with Deadlocks

A deadlock is a condition where two or more transactions are waiting indefinitely for
one another to give up locks. Deadlock is said to be one of the most feared
complications in DBMS as no task ever gets finished and is in waiting state forever.
For example: In the student table, transaction T1 holds a lock on some rows and
needs to update some rows in the grade table. Simultaneously, transaction T2 holds
locks on some rows in the grade table and needs to update the rows in the Student
table held by Transaction T1.

COURSE: DM UNIT: 3 Pg. 74


Deadlock Handling

Deadlock Detection
The resource scheduler can detect a deadlock as it keeps track of all the resources that
are allocated to different processes. After a deadlock is detected, it can be resolved
using the following methods:
 All the processes involved in the deadlock are terminated. This is not a good
approach as all the progress made by the processes is destroyed.
 Resources can be preempted from some processes and given to others till the
deadlock is resolved.
Deadlock Prevention
It is imperative to prevent a deadlock before it can occur. So, the system rigorously
checks each transaction before it is executed to make sure it does not lead to deadlock.
If there is even a chance that a transaction may lead to deadlock, it is never allowed to
execute.
Deadlock Avoidance
It is better to avoid a deadlock rather than take measures after the deadlock has
occurred. The wait for graph can be used for deadlock avoidance. This is however only
useful for smaller databases as it can get quite complex in larger databases.
COURSE: DM UNIT: 3 Pg. 75
Use of Lock Conversions

Changing the mode of a lock that is already held is called lock conversion.


Lock conversion occurs when a process accesses a data object on which it already
holds a lock, and the access mode requires a more restrictive lock than the one
already held. A process can hold only one lock on a data object at any given time,
although it can request a lock on the same data object many times indirectly
through a query.
Some lock modes apply only to tables, others only to rows, blocks, or data
partitions. For rows or blocks, conversion usually occurs if an X lock is needed and an
S or U lock is held.

COURSE: DM UNIT: 3 Pg. 76


Crash Recovery ARIES algorithm

Algorithms for Recovery and Isolation Exploiting Semantics. Developed at IBM


research in early 1990s.
Not all systems implement ARIES exactly as defined in the original paper, but they
are similar enough.
Main ideas of the ARIES recovery protocol:
  • Write Ahead Logging: Any change is recorded in log on stable storage before
the database change is written to disk (STEAL + NO-FORCE).
  • Repeating History During Redo: On restart, retrace actions and restore
database to exact state before crash.
  • Logging Changes During Undo: Record undo actions to log to ensure action is
not repeated in the event of repeated failures.

COURSE: DM UNIT: 3 Pg. 77


Cont.,

Goals of crash recovery


  – Either transaction commits and is correct or aborts
  – Commit means all actions of transaction have been executed
Error model:
  – lose contents main memory
  – disk contents intact and correct
Crash recovery requirements
  • If transaction has committed then still have results (on disk)
  • If transaction in process, either
  1. Transaction completely aborts OR
  2. Transaction can continue after restore as if no crash
  • Get serializable schedule such that transactions that committed before crash still commit
and in same order
  => NEED LOG

COURSE: DM UNIT: 3 Pg. 78


Cont.,

  ARIES algorithm
Assumptions
  – Strict 2PL => no cascaded aborts
  – “in place” disk updates: data overwritten on disk
  • Page read into buffer, changed in buffer, written out again
  • Write of page to disk is atomic
Log:
  – Sequential writes on separate disk
  – Write differences only
  • Multiple updates on single log page
  • Each log record has unique Log Sequence Number
  – LSN strictly sequential.

COURSE: DM UNIT: 3 Pg. 79


The Write Ahead log Protocol

Write-ahead logging (WAL) is a family of techniques for providing atomicity and durability (two of the 


ACID properties) in database systems. A write ahead log is an append-only auxiliary disk-resident
structure used for crash and transaction recovery. The changes are first recorded in the log, which must
be written to stable storage, before the changes are written to the database.
The main functionality of a write-ahead log can be summarized as:
 Allow the page cache to buffer updates to disk-resident pages while ensuring durability semantics in
the larger context of a database system.
 Persist all operations on disk until the cached copies of pages affected by these operations are
synchronized on disk. Every operation that modifies the database state has to be logged on disk before
the contents on the associated pages can be modified
Allow lost in-memory changes to be reconstructed from the operation log in case of a crash.
In a system using WAL, all modifications are written to a log before they are applied. Usually both redo
and undo information is stored in the log.
The purpose of this can be illustrated by an example. Imagine a program that is in the middle of
performing some operation when the machine it is running on loses power. Upon restart, that program
might need to know whether the operation it was performing succeeded, succeeded partially, or failed. If
a write-ahead log is used, the program can check this log and compare what it was supposed to be doing
when it unexpectedly lost power to what was actually done. On the basis of this comparison, the program
could decide to undo what it had started, complete what it had started, or keep things as they are.
ARIES is a popular algorithm in the WAL family.
COURSE: DM UNIT: 3 Pg. 80
System Crash Recovery

Crash recovery is the operation through which the database is transferred back to a
compatible and operational condition. In DBMS, this is performed by rolling back
insufficient transactions and finishing perpetrated transactions that even now
existed in memory when the crash took place.
With many transactions being implemented with each second shows that, DBMS
may be a tremendously complex system. The fundamental hardware of the system
manages to sustain robustness and stiffness of software which depends upon its
complex design. It’s anticipated that the system would go behind with some
methodology or techniques to restore lost data when it fails or crashes in between
the transactions.

System Crash: There are issues which may stop the system unexpectedly from
outside and may create the system condition to crash. For example, disturbance or
interference in the power supply may create the system condition of fundamental
hardware or software to crash or failure.

COURSE: DM UNIT: 3 Pg. 81


File organization

File Organization refers to the logical relationships among various records that
constitute the file, particularly with respect to the means of identification and access
to any specific record. In simple terms, Storing the files in certain order is called file
Organization. File Structure refers to the format of the label and data blocks and of
any logical control record. 

COURSE: DM UNIT: 3 Pg. 82


Hash based and Tree based Indexing

Hash index
This technique is widely used for creating indices in main memory because its fast
retrieval by nature. It has average O(1) operation complexity and O(n) storage
complexity.
In many books, people use the term bucket to denote a unit of storage that stores one
or more records

There are two things to discuss when it comes to hashing:


 Hash function: maps search keys (as its input) to an integer representing that key in
the bucket.
 Hashing scheme: how to deal with key collision after hashing.

COURSE: DM UNIT: 3 Pg. 83


Cont.,

•Hash index is suitable for equality or primary key lookup. Queries can benefit from hash
index to get amortized O(1) lookup cost.
For example: SELECT name, id FROM student WHERE id = '1315';

COURSE: DM UNIT: 3 Pg. 84


Cont.,

B+Tree:
This is a self-balancing tree data structure that keeps data in sorted order and allows fast search
within each node, typically using binary search.
B+Tree is a standard index implementation in almost all relational database system.
B+Tree is basically a M-way search tree that have the following structure:
 perfectly balance: leaf nodes always have the same height.
 every inner node other than the root is at least half full (M/2 − 1 <= num of keys <= M − 1).
 every inner node with k keys has k+1 non-null children.
Every node of the tree has an array of sorted key-value pairs. The key-value pair is constructed
from (search-key value, pointer) for root and inner nodes. Leaf node values can be 2
possibilities:
 the actual record
 the pointer to actual record
Lookup a value v
 Start with root node
 While node is not a leaf node, we do:
 Find the smallest Ki where Ki >= v
 If Ki == v: set current node to the node pointed by Pi+1
 Otherwise, set current node to node pointed by Pi
COURSE: DM UNIT: 3 Pg. 85
Cont.,

COURSE: DM UNIT: 3 Pg. 86


Cont.,

Duplicate keys
In general, search-key can be duplicate, to solve this, most database implementations
come up with composite search key. For example, we want to create an index
on student_name then our composite search key should be (student_name, Ap) where
Ap is the primary key of the table.
Pros
There're two major features that B+tree offers:
•Minimizing I/O operations
•Reduced height: B+Tree has quite large branching factor (value between 50 and
2000 often used) which makes the tree fat and short. The figure below illustrates
a B+Tree with height of 2. As we can see nodes are spread out, it takes fewer
nodes to traverse down to a leaf. The cost of looking up a single value is the
height of the tree + 1 for the random access to the table.
•Scalability:
•You have predictable performance for all cases, O(log(n)) in particular. For
databases, it is usually more important than having better best or average case
performance.

COURSE: DM UNIT: 3 Pg. 87


Cont.,

Conclusion:

Although hash index performs better in terms of exact match queries, B+Tree is
arguably the most widely used index structure in RDBMS thanks to its consistent
performance in overall and high scalability.

B+Tree Hash
Lookup Time O(log(n)) O(log(1))
Insertion Time O(log(n)) O(log(1))
Deletion Time O(log(n)) O(log(1))

Recently, the log-structured merge tree (LSM-tree) has attracted significant interest as
a contender to B+-tree, because its data structure could enable better storage space
usage efficiency.

COURSE: DM UNIT: 3 Pg. 88

You might also like