You are on page 1of 89

Introduction of DBMS

Database: Database is a collection of inter-related data which helps in efficient


retrieval, insertion and deletion of data from database and organizes the data in
the form of tables, views, schemas, reports etc. For Example, university database
organizes the data about students, faculty, and admin staff etc.

DDL - Data Definition Language, which deals with database schemas and
descriptions, of how the data should reside in the database.

CREATE to create a database and its objects like (table, index, views, store procedure,
function, and triggers)
ALTER alters the structure of the existing database
DROP delete objects from the database
TRUNCATE remove all records from a table, including all spaces allocated for the records are
removed
COMMENT add comments to the data dictionary
RENAME rename an object

DML - Data Manipulation Language which deals with data manipulation and includes most
common SQL statements such SELECT, INSERT, UPDATE, DELETE, etc., and it is used
to store, modify, retrieve, delete and update data in a database.

SELECT retrieve data from a database


INSERT insert data into a table
UPDATE updates existing data within a table
DELETE Delete all records from a database table
MERGE UPSERT operation (insert or update)
CALL call a PL/SQL or Java subprogram
EXPLAIN PLAN interpretation of the data access path
LOCK TABLE concurrency Control

Database Management System: The software which is used to manage database


is called Database Management System (DBMS). For Example, MySQL, Oracle
etc. are popular commercial DBMS used in different applications. DBMS allows
users the following tasks:

Data Definition: It helps in creation, modification and removal of definitions that


define the organization of data in database.
Data Updation: It helps in insertion, modification and deletion of the actual data in
the database.
Data Retrieval: It helps in retrieval of data from the database which can be used
by applications for various purposes.
User Administration: It helps in registering and monitoring users, enforcing data
security, monitoring performance, maintaining data integrity, dealing with
concurrency control and recovering information corrupted by unexpected failure.

Paradigm Shift from File System to DBMS:


File System manages data using files in hard disk. Users are allowed to create,
delete, and update the files according to their requirement. The issues with this
system:
• Redundancy of data
• Inconsistency of Data
• Difficult Data Access
• Unauthorized Access
• No Concurrent Access
• No Backup and Recovery

DBMS 3-tier Architecture:


DBMS 3-tier architecture divides the complete system into three inter-related but
independent modules as shown below:

1. Physical Level: Physical level of a database describes how the data is being
stored in secondary storage devices like disks and tapes and also gives insights
on additional storage details. Various users of DBMS are unaware of the
locations of these objects.
2. Conceptual Level: At conceptual level, data is represented in the form of
various database tables. Also referred as logical schema, it describes what kind
of data is to be stored in the database.
3. External Level: An external level specifies a view of the data in terms of
conceptual level tables. Each external level view is used to cater to the needs
of a particular category of users. So, different views can be generated for
different users. The main focus of external level is data abstraction.
Data Independence
Data independence means a change of data at one level should not affect another
level. Two types of data independence:
1. Physical Data Independence: Any change in the physical location of tables
and indexes should not affect the conceptual level or external view of data.
This data independence is easy to achieve and implemented by most of the
DBMS.
2. Conceptual Data Independence: This means a change in conceptual
schema should not affect external schema. But this type of independence is
difficult to achieve as compared to physical data independence because the
changes in conceptual schema are reflected in the user’s view.

Phases of database design


Conceptual Design: The requirements of database are
captured using high level conceptual data model. For Example,
the ER model is used for the conceptual design of the
database.
Logical Design: Logical Design represents data in the form of
relational model. ER diagram produced in the conceptual
design phase is used to convert the data into the Relational
Model.
Physical Design: In physical design, data in relational model
is implemented using commercial DBMS like Oracle, DB2.

Advantages of DBMS
• Minimized redundancy and data inconsistency
• Simplified Data Access
• Multiple data views
• Data Security
• Concurrent access to data
• Backup and Recovery mechanism
Disadvantages of DBMS
• Increased Cost: Cost of Hardware and Software, Cost of Staff Training & Cost
of Data Conversion
• Complexity
• Currency Maintenance
• Slow performance in case of small databases
• Frequency Upgrade/Replacement Cycles
DBMS Architecture
Two-tier architecture:
The two-tier architecture is similar to a basic client-server model. The application
at the client end directly communicates with the database at the server-side. APIs
like ODBC, JDBC are used for this interaction. The server side is responsible for
providing query processing and transaction management functionalities. On the
client-side, the user interfaces and application programs are run. The application
on the client-side establishes a connection with the server-side in order to
communicate with the DBMS.
Advantage: maintenance and understanding are easier, compatible with existing
systems. However, this model gives poor performance when there are a large
number of users.

Three Tier architecture:


In this type, there is another layer between the client and the server. The client does
not directly communicate with the server. Instead, it interacts with an application
server which further communicates with the database system and then the query
processing and transaction management takes place. This intermediate layer acts
as a medium for the exchange of partially processed data between server and client.
This type of architecture is used in the case of large web applications.
Advantages:
• Enhanced scalability due to
distributed deployment of application
servers. Now, individual connections
need not be made between client
and server.
• Data Integrity is maintained. Since
there is a middle layer between client
and server, data corruption can be
avoided/removed.
• Security is improved. This type of
model prevents direct interaction of
the client with the server thereby
reducing access to unauthorized
data.
Disadvantages:
Increased complexity of implementation and communication. It becomes difficult for
this sort of interaction to take place due to the presence of middle layers.

Data Abstraction and Data Independence


There are mainly 3 levels of data abstraction:
Physical: This is the lowest level of data abstraction. It tells us how the data is
actually stored in memory. The access methods like sequential or random access
and file organization methods like B+ trees, hashing used for the same. Usability,
size of memory, and the number of times the records are factors that we need to
know while designing the database.
Suppose we need to store the details of an employee. Blocks of storage and the
amount of memory used for these purposes are kept hidden from the user.
Logical: This level comprises the information that is actually stored in the
database in the form of tables. It also stores the relationship among the data
entities in relatively simple structures. At this level, the information available to the
user at the view level is unknown.
We can store the various attributes of an employee and relationships, e.g. with the
manager can also be stored.
View: This is the highest level of abstraction. Only a part of the actual database is
viewed by the users. This level exists to ease the accessibility of the database by
an individual user. Users view data in the form of rows and columns. Tables and
relations are used to store data. Multiple views of the same database may exist.
Users can just view the data and interact with the database, storage and
implementation details are hidden from them.

The main purpose of data abstraction is to achieve data independence in order to


save time and cost required when the database is modified or altered.
We have namely two levels of data independence arising from these levels of
abstraction :
Physical level data independence: It refers to the characteristic of being able to
modify the physical schema without any alterations to the conceptual or logical
schema, done for optimization purposes, e.g., Conceptual structure of the
database would not be affected by any change in storage size of the database
system server. Changing from sequential to random access files is one such
example. These alterations or modifications to the physical structure may include:

• Utilizing new storage devices.


• Modifying data structures used for storage.
• Altering indexes or using alternative file organization techniques etc.
Logical level data independence: It refers characteristic of being able to modify
the logical schema without affecting the external schema or application program.
The user view of the data would not be affected by any changes to the conceptual
view of the data. These changes may include insertion or deletion of attributes,
altering table structures entities or relationships to the logical schema, etc.

Database Objects in DBMS


A database object is any defined object in a database that is used to store or
reference data. Anything which we make from create command is known as
Database Object. It can be used to hold and manipulate the data. Some of the
examples of database objects are : view, sequence, indexes, etc.
• Table – Basic unit of storage; composed rows and columns
• View – Logically represents subsets of data from one or more tables
• Sequence – Generates primary key values
• Index – Improves the performance of some queries
• Synonym – Alternative name for an object
Different database Objects :
1. Table – This database object is used to create a table in database.
Syntax :
CREATE TABLE [schema.]table
(column datatype [DEFAULT expr][, ...]);
Example :
CREATE TABLE dept
(deptno NUMBER(2),
dname VARCHAR2(14),
loc VARCHAR2(13));
Output :
DESCRIBE dept;
2. View – This database object is used to create a view in database. A view is a
logical table based on a table or another view. A view contains no data of its
own but is like a window through which data from tables can be viewed or
changed. The tables on which a view is based are called base tables. The
view is stored as a SELECT statement in the data dictionary.
Syntax :
CREATE [OR REPLACE] [FORCE|NOFORCE] VIEW view
[(alias[, alias]...)]
AS subquery
[WITH CHECK OPTION [CONSTRAINT constraint]]
[WITH READ ONLY [CONSTRAINT constraint]];
Example :
CREATE VIEW salvu50
AS SELECT employee_id ID_NUMBER, last_name NAME,
salary*12 ANN_SALARY
FROM employees
WHERE department_id = 50;
Output :
SELECT *
FROM salvu50;

3. Sequence – This database object is used to create a sequence in database. A


sequence is a user created database object that can be shared by multiple users
to generate unique integers. A typical usage for sequences is to create a primary
key value, which must be unique for each row. The sequence is generated and
incremented (or decremented) by an internal Oracle routine.
Syntax :
CREATE SEQUENCE sequence
[INCREMENT BY n]
[START WITH n]
[{MAXVALUE n | NOMAXVALUE}]
[{MINVALUE n | NOMINVALUE}]
[{CYCLE | NOCYCLE}]
[{CACHE n | NOCACHE}];
Example :
CREATE SEQUENCE dept_deptid_seq
INCREMENT BY 10
START WITH 120
MAXVALUE 9999
NOCACHE
NOCYCLE;
Check if sequence is created by :
SELECT sequence_name, min_value, max_value,
increment_by, last_number
FROM user_sequences;
4. Index – This database object is used to create a indexes in database. An Oracle
server index is a schema object that can speed up the retrieval of rows by using
a pointer. Indexes can be created explicitly or automatically. If you do not have
an index on the column, then a full table scan occurs.
An index provides direct and fast access to rows in a table. Its purpose is to
reduce the necessity of disk I/O by using an indexed path to locate data quickly.
The index is used and maintained automatically by the Oracle server. Once an
index is created, no direct activity is required by the user. Indexes are logically
and physically independent of the table they index. This means that they can be
created or dropped at any time and have no effect on the base tables or other
indexes.
Syntax :
CREATE INDEX index
ON table (column[, column]...);
Example :
CREATE INDEX emp_last_name_idx
ON employees(last_name);
5. Synonym – This database object is used to create a indexes in database. It
simplify access to objects by creating a synonym(another name for an object).
With synonyms, you can Ease referring to a table owned by another user and
shorten lengthy object names. To refer to a table owned by another user, you
need to prefix the table name with the name of the user who created it followed
by a period. Creating a synonym eliminates the need to qualify the object name
with the schema and provides you with an alternative name for a table, view,
sequence, procedure, or other objects. This method can be especially useful
with lengthy object names, such as views.
In the syntax:
PUBLIC : creates a synonym accessible to all users
synonym : is the name of the synonym to be created
object : identifies the object for which the synonym is created
Syntax :
CREATE [PUBLIC] SYNONYM synonym FOR object;
Example :
CREATE SYNONYM d_sum FOR dept_sum_vu;

Multimedia Database
Multimedia database is the collection of interrelated multimedia data that includes
text, graphics (sketches, drawings), images, animations, video, audio etc and have vast
amounts of multisource multimedia data. The framework that manages different types
of multimedia data which can be stored, delivered and utilized in different ways is
known as multimedia database management system. There are three classes of the
multimedia database which includes static media, dynamic media and dimensional
media.
Content of Multimedia Database management system:
1. Media data – The actual data representing an object.
2. Media format data – Information such as sampling rate, resolution, encoding
scheme etc. about the format of the media data after it goes through the
acquisition, processing and encoding phase.
3. Media keyword data – Keywords description relating to the generation of
data. It is also known as content descriptive data. Example: date, time and
place of recording.
4. Media feature data – Content dependent data such as the distribution of
colors, kinds of texture and different shapes present in data.
Types of multimedia applications based on data management characteristic are:
1. Repository applications – A Large amount of multimedia data as well as
meta-data (Media format date, Media keyword data, Media feature data) that
is stored for retrieval purpose, e.g., Repository of satellite images, engineering
drawings, radiology scanned pictures.
2. Presentation applications – They involve delivery of multimedia data
subject to temporal constraint. Optimal viewing or listening requires DBMS to
deliver data at certain rate offering the quality of service above a certain
threshold. Here data is processed as it is delivered. Example: Annotating of
video and audio data, real-time editing analysis.
3. Collaborative work using multimedia information – It involves executing a
complex task by merging drawings, changing notifications. Example:
Intelligent healthcare network.
There are still many challenges to multimedia databases, some of which are:
1. Modelling – Working in this area can improve database versus information
retrieval techniques thus, documents constitute a specialized area and deserve
special consideration.
2. Design – The conceptual, logical and physical design of multimedia databases
has not yet been addressed fully as performance and tuning issues at each
level are far more complex as they consist of a variety of formats like JPEG,
GIF, PNG, MPEG which is not easy to convert from one form to another.
3. Storage – Storage of multimedia database on any standard disk presents the
problem of representation, compression, mapping to device hierarchies,
archiving and buffering during input-output operation. In DBMS, a
”BLOB”(Binary Large Object) facility allows untyped bitmaps to be stored and
retrieved.
4. Performance – For an application involving video playback or audio-video
synchronization, physical limitations dominate. The use of parallel processing
may alleviate some problems but such techniques are not yet fully developed.
Apart from this multimedia database consume a lot of processing time as well
as bandwidth.
5. Queries and retrieval –For multimedia data like images, video, audio
accessing data through query opens up many issues like efficient query
formulation, query execution and optimization which need to be worked upon.
Areas where multimedia database is applied are:
• Documents and record management: Industries and businesses that keep
detailed records and variety of documents. Example: Insurance claim record.
• Knowledge dissemination: Multimedia database is a very effective tool for
knowledge dissemination in terms of providing several resources. Example:
Electronic books.
• Education and training: Computer-aided learning materials can be designed
using multimedia sources which are nowadays very popular sources of
learning. Example: Digital libraries.
• Marketing, advertising, retailing, entertainment and travel. Example: a virtual
tour of cities.
• Real-time control and monitoring: Coupled with active database
technology, multimedia presentation of information can be very effective
means for monitoring and controlling complex tasks Example: Manufacturing
operation control.

Interfaces in DBMS
A database management system (DBMS) interface is a user interface which allows
for the ability to input queries to a database without using the query language itself.
User-friendly interfaces provide by DBMS may include the following:
1. Menu-Based Interfaces for Web Clients or Browsing –
These interfaces present the user with lists of options (called menus) that
lead the user through the formation of a request. Basic advantage of using
menus is that they removes the tension of remembering specific
commands and syntax of any query language, rather than query is
basically composed step by step by collecting or picking options from a
menu that is basically shown by the system. Pull-down menus are a very
popular technique in Web based interfaces. They are also often used
in browsing interface which allow a user to look through the contents of a
database in an exploratory and unstructured manner.
2. Forms-Based Interfaces –
A forms-based interface displays a form to each user. Users can fill out all
of the form entries to insert a new data, or they can fill out only certain
entries, in which case the DBMS will redeem same type of data for other
remaining entries. This type of forms are usually designed or created and
programmed for the users that have no expertise in operating system.
Many DBMSs have forms specification languages which are special
languages that help specify such forms.
Example: SQL* Forms is a form-based language that specifies queries
using a form designed in conjunction with the relational database
schema.b>
3. Graphical User Interface –
A GUI typically displays a schema to the user in diagrammatic form.The
user then can specify a query by manipulating the diagram. In many
cases, GUI’s utilize both menus and forms. Most GUIs use a pointing
device such as mouse, to pick certain part of the displayed schema
diagram.
4. Natural language Interfaces –
These interfaces accept request written in English or some other language
and attempt to understand them. A Natural language interface has its own
schema, which is similar to the database conceptual schema as well as a
dictionary of important words.
The natural language interface refers to the words in its schema as well as
to the set of standard words in a dictionary to interpret the request.If the
interpretation is successful, the interface generates a high-level query
corresponding to the natural language and submits it to the DBMS for
processing, otherwise a dialogue is started with the user to clarify any
provided condition or request. The main disadvantage with this is that the
capabilities of this type of interfaces are not that much advance.
5. Speech Input and Output –
There is an limited use of speech say it for a query or an answer to a
question or being a result of a request it is becoming commonplace
Applications with limited vocabularies such as inquiries for telephone
directory, flight arrival/departure, and bank account information are
allowed speech for input and output to enable ordinary folks to access this
information.
The Speech input is detected using a predefined words and used to set up
the parameters that are supplied to the queries. For output, a similar
conversion from text or numbers into speech take place.
6. Interfaces for DBA –
Most database system contains privileged commands that can be used
only by the DBA’s staff. These include commands for creating accounts,
setting system parameters, granting account authorization, changing a
schema, reorganizing the storage structures of a databases.

Categories of End Users in DBMS


End users are basically those people whose jobs require access to the database for
querying, updating, and generating reports. The database primarily exists for their use.
There are several categories of end-users these are as follows:
1. Casual End Users –
These are the users who occasionally access the database but they require different
information each time. They use a sophisticated database query language basically to
specify their request and are typically middle or level managers or other occasional
browsers. These users learn very few facilities that they may use repeatedly from the
multiple facilities provided by DBMS to access it.
2. Naive or parametric end users –
These are the users who basically make up a sizeable portion of database end-users.
The main job function revolves basically around constantly querying and updating the
database for this we basically use a standard type of query known as the canned
transaction that has been programmed and tested. These users need to learn very
little about the facilities provided by the DBMS they basically have to understand the
users’ interfaces of the standard transaction designed and implemented for their use.
The following tasks are basically performed by Naive end-users:
1. The person who is working in the bank will basically tell us the account
balance and post-withdrawal and deposits.
2. Reservation clerks for airlines, railways, hotels and car rental companies
basically check availability for a given request and make the reservation.
3. Clerks who are working at receiving end for shipping companies enter the
package identifies via barcodes and descriptive information through buttons
to update a central database of received and in-transit packages.

3. Sophisticated end users –


These users basically include engineers, scientists, business analytics, and others who
thoroughly familiarize themselves with the facilities of the DBMS in order to implement
their application to meet their complex requirements. These users try to learn most of
the DBMS facilities in order to achieve their complex requirements.
4. Standalone users –
These are those users whose job is basically to maintain personal databases by using a
ready-made program package that provides easy-to-use menu-based or graphics-based
interfaces, An example is the user of a tax package that basically stores a variety of
personal financial data for tax purposes. These users become very proficient in using a
specific software package.

Economic factors (Choice of DBMS)


Basically, the choice of a DBMS is governed by a number of factors such as:

1. Technical

2. Economics

3. Politics of Organization
We will concentrate on discussing the economic and organizational factors that affect
the choice of DBMS. Following cost are considered while choosing a DBMS these are as
follows:
o Software acquisition cost –
This is basically the up-front cost of software or buying cost of software including
language options, different types of interfaces. The correct DBMS version for a
specific OS must be selected. Basically, the Development tools, design tools, and
additional language support are not included in basic pricing.

o Maintenance cost –
This is basically the recurring cost of receiving standard maintenance service from
the vendor and to keep the DBMS version up-to-date.

o Hardware acquisition cost –


We may need hardware components such as memory, disk drives, controllers,
archival storage etc cost of all these also we need to consider while choosing a
DBMS.

o Personal cost –
Acquisition of DBMS software for the first time by an organization is often
accompanied by a reorganization of the data processing department. The position of
DBA and staff exist in most companies and that have adopted DBMS.

o Training cost –
DBMS are often complex systems personal must often be trained to use and program
for DBMS. Training is required at all levels which include programming, application
development and database administration.

o Operating cost –
Cost of operating database also needs to be considered while choosing DBMS.
Introduction of ER Model
ER Model is used to model the logical view of the system from
data perspective which consists of these components:
Entity, Entity Type, Entity Set –
An Entity may be an object with a physical existence – a particular
person, car, house, or employee – or it may be an object with a
conceptual existence – a company, a job, or a university course.
An Entity is an object of Entity Type and set of all entities is called
as entity set. e.g.; E1 is an entity having Entity Type Student and
set of all students is called Entity Set. In ER diagram, Entity Type is
represented as:
Attribute(s):
Attributes are the properties which define the entity type.
For example, Roll_No, Name, DOB, Age, Address, Mobile_No are
the attributes which defines entity type Student. In ER diagram,
attribute is represented by an oval.

1. Key Attribute –
The attribute which uniquely identifies each entity in the entity
set is called key attribute.For example, Roll_No will be unique for
each student. In ER diagram, key attribute is represented by an oval
with underlying lines.

2. Composite Attribute –
An attribute composed of many other attribute is called as composite attribute. For
example, Address attribute of student Entity type consists of Street, City, State, and
Country. In ER diagram, composite attribute is represented by an oval comprising of
ovals.

3. Multivalued Attribute –
An attribute consisting more than one value for a given entity.
For example, Phone_No (can be more than one for a given
student). In ER diagram, multivalued attribute is represented by
double oval.

4. Derived Attribute –
An attribute which can be derived from other attributes of the
entity type is known as derived attribute. e.g.; Age (can be derived
from DOB). In ER diagram, derived attribute is represented by
dashed oval.

The complete entity type Student with its attributes can be represented as:
Relationship Type and Relationship Set:
A relationship type represents the association between entity types. For example,
‘Enrolled in’ is a relationship type that exists between entity type Student and Course. In
ER diagram, relationship type is represented by a diamond and connecting the entities
with lines.

A set of relationships of same type is known as relationship set. The following


relationship set depicts S1 is enrolled in C2, S2 is enrolled in C1 and S3 is enrolled in
C3.

Degree of a relationship set:


The number of different entity sets participating in a relationship set is called as
degree of a relationship set.
1. Unary Relationship –
When there is only ONE entity set participating in a relation, the relationship is
called as unary relationship. For example, one person is married to only one person.
2. Binary Relationship –
When there are TWO entities set participating in a relation, the relationship is called
as binary relationship. For example, Student is enrolled in Course.

3. n-ary Relationship –
When there are n entities set participating in a relation, the relationship is called as n-
ary relationship.
Cardinality:
The number of times an entity of an entity set participates in a relationship set is
known as cardinality. Cardinality can be of different types:
1. One to one – When each entity in each entity set can take part only once in the
relationship, the cardinality is one to one. Let us assume that a male can marry to one
female and a female can marry to one male. So the relationship will be one to one.

Using Sets, it can be represented as:

2. Many to one – When entities in one entity set can take part only once in the
relationship set and entities in other entity set can take part more than once in
the relationship set, cardinality is many to one. Let us assume that a student can take
only one course but one course can be taken by many students. So the cardinality will
be n to 1. It means that for one course there can be n students but for one student, there
will be only one course.
Using Sets, it can be represented as:

In this case, each student is taking only 1 course but 1 course has been taken by many
students.
3. Many to many – When entities in all entity sets can take part more than once in
the relationship cardinality is many to many. Let us assume that a student can take
more than one course and one course can be taken by many students. So the
relationship will be many to many.

Using sets, it can be represented as:

In this example, student S1 is enrolled in C1 and C3 and Course C3 is enrolled by S1, S3


and S4. So it is many to many relationships.
Participation Constraint:
Participation Constraint is applied on the entity participating in the relationship set.
1. Total Participation – Each entity in the entity set must participate in the
relationship. If each student must enroll in a course, the participation of student will be
total. Total participation is shown by double line in ER diagram.
2. Partial Participation – The entity in the entity set may or may NOT participate in
the relationship. If some courses are not enrolled by any of the student, the
participation of course will be partial.
The diagram depicts the ‘Enrolled in’ relationship set with Student Entity set having
total participation and Course Entity set having partial participation.

Using set, it can be represented as,

Every student in Student Entity set is participating in relationship but there exists a
course C4 which is not taking part in the relationship.
Weak Entity Type and Identifying Relationship:
As discussed before, an entity type has a key attribute which uniquely identifies each
entity in the entity set. But there exists some entity type for which key attribute
can’t be defined. These are called Weak Entity type.
For example, A company may store the information of dependents (Parents, Children,
Spouse) of an Employee. But the dependents don’t have existence without the
employee. So Dependent will be weak entity type and Employee will be Identifying
Entity type for Dependent.
A weak entity type is represented by a double rectangle. The participation of weak
entity type is always total. The relationship between weak entity type and its
identifying strong entity type is called identifying relationship and it is represented by
double diamond.

Generalization, Specialization and Aggregation in ER Model


Generalization, Specialization and Aggregation in ER model are used for data
abstraction in which abstraction mechanism is used to hide details of a set of
objects.
Generalization –
Generalization is the process of
extracting common properties from a
set of entities and create a generalized
entity from it. It is a bottom-up
approach in which two or more entities
can be generalized to a higher level
entity if they have some attributes in
common. For Example, STUDENT
and FACULTY can be generalized to
a higher level entity called PERSON
as shown in Figure 1. In this case,
common attributes like P_NAME,
P_ADD become part of higher entity
(PERSON) and specialized attributes
like S_FEE become part of specialized
entity (STUDENT).

Specialization –
In specialization, an entity is divided
into sub-entities based on their
characteristics. It is a top-down
approach where higher level entity is
specialized into two or more lower level
entities. For Example, EMPLOYEE
entity in an Employee management
system can be specialized into
DEVELOPER, TESTER etc. as shown
in Figure 2. In this case, common
attributes like E_NAME, E_SAL etc.
become part of higher entity
(EMPLOYEE) and specialized
attributes like TES_TYPE become part
of specialized entity (TESTER).

Aggregation –
An ER diagram is not capable of representing
relationship between an entity and a
relationship which may be required in some
scenarios. In those cases, a relationship with
its corresponding entities is aggregated into a
higher level entity. Aggregation is an
abstraction through which we can represent
relationships as higher level entity sets.
For Example, Employee working for a project
may require some machinery. So, REQUIRE
relationship is needed between relationship WORKS_FOR and entity MACHINERY.
Using aggregation, WORKS_FOR relationship with its entities EMPLOYEE and
PROJECT is aggregated into single entity and relationship REQUIRE is created
between aggregated entity and MACHINERY.
Representing aggregation via schema –
To represent aggregation, create a schema containing:
1. primary key of the aggregated relationship
2. primary key of the associated entity set
3. descriptive attribute, if exists.

Relational Model
Relational Model: Relational model represents data in the form of relations or
tables.
Relational Schema: Schema represents structure of a relation. e.g.; Relational
Schema of STUDENT relation can be represented as:
STUDENT (STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_COUNTRY, STUD_AGE)
Relational Instance: The set of values present in a relation at a particular instance
of time is known as relational instance as shown in Table 1 and Table 2.
Attribute: Each relation is defined in terms of some properties, each of which is
known as attribute. For Example, STUD_NO, STUD_NAME etc. are attributes of
relation STUDENT.
Domain of an attribute: The possible values an attribute can take in a relation is
called its domain. For Example, domain of STUD_AGE can be from 18 to 40.
Tuple: Each row of a relation is known as tuple. e.g.; STUDENT relation given
below has 4 tuples.
NULL values: Values of some attribute for some tuples may be unknown, missing
or undefined which are represented by NULL. Two NULL values in a relation are
considered different from each other.
Table 1 and Table 2 represent relational model having two relations STUDENT and
STUDENT_COURSE.
Relational Model in DBMS
Relational Model was proposed by E.F. Codd to model data in the form of relations or
tables. After designing the conceptual model of Database using ER diagram, we need to
convert the conceptual model in the relational model which can be implemented using
any RDBMS languages like Oracle SQL, MySQL etc. So we will see what Relational Model
is.
What is Relational Model?
Relational Model represents how data is stored in Relational Databases. A relational
database stores data in the form of relations (tables). Consider a relation STUDENT
with attributes ROLL_NO, NAME, ADDRESS, PHONE and AGE shown in Table 1.

IMPORTANT TERMINOLOGIES
• Attribute: Attributes are the properties that define a relation. e.g.; ROLL_NO, NAME
• Relation Schema: A relation schema represents name of the relation with its
attributes. e.g.; STUDENT (ROLL_NO, NAME, ADDRESS, PHONE and AGE) is relation
schema for STUDENT. If a schema has more than 1 relation, it is called Relational
Schema.
• Tuple: Each row in the relation is known as tuple. The above relation contains 4
tuples, one of which is shown as:

• Relation Instance: The set of tuples of a relation at a particular


instance of time is called as relation instance. Table 1 shows the
relation instance of STUDENT at a particular time. It can change
whenever there is insertion, deletion or updation in the database.
• Degree: The number of attributes in the relation is known as degree
of the relation. The STUDENT relation defined above has degree 5.
• Cardinality: The number of tuples in a relation is known as
cardinality. The STUDENT relation defined above has cardinality 4.
• Column: Column represents the set of values for a particular
attribute. The column ROLL_NO is extracted from relation STUDENT.
• NULL Values: The value which is not known or unavailable is called NULL value. It
is represented by blank space. e.g.; PHONE of STUDENT having ROLL_NO 4 is NULL.
Constraints in Relational Model
While designing Relational Model, we define some conditions which must hold for data
present in database are called Constraints. These constraints are checked before
performing any operation (insertion, deletion and updation) in database. If there is a
violation in any of constrains, operation will fail.
Domain Constraints: These are attribute level constraints. An attribute can only take
values which lie inside the domain range. e.g.- If a constrains AGE>0 is applied on
STUDENT relation, inserting negative value of AGE will result in failure.
Key Integrity: Every relation in the database should have atleast one set of attributes
which defines a tuple uniquely. Those set of attributes is called key. e.g.- ROLL_NO in
STUDENT is a key. No two students can have same roll number. So, a key has two
properties:
• It should be unique for all tuples.
• It can’t have NULL values.
Referential Integrity: When one attribute of a relation can only takes values from
other attribute of same relation or any other relation, it is called referential integrity.
Let us suppose we have 2 relations

BRANCH_CODE of STUDENT can only take the values which are present in
BRANCH_CODE of BRANCH which is called referential integrity constraint. The relation
which is referencing to other relation is called REFERENCING RELATION (STUDENT in
this case) and the relation to which other relations refer is called REFERENCED
RELATION (BRANCH in this case).

Relational Database Management System (RDBMS) –


RDBMS is for SQL, and for all modern database systems like MS SQL Server, IBM
DB2, Oracle, MySQL, Amazon Redshift and Microsoft Access. A Relational
database management system (RDBMS) is a database management system
(DBMS) that is based on the relational model as introduced by E. F. Codd. An
RDBMS is a type of DBMS with a row-based table structure that connects related
data elements and includes functions that maintain the security, accuracy, integrity
and consistency of the data. The most basic RDBMS functions are create, read,
update and delete operations. The key difference here, compared to DBMS, is that
RDBMS stores data in the form of a collection of tables, and relations can be
defined between the common fields of these tables.
ANOMALIES
An anomaly is an irregularity, or something which deviates from the expected or
normal state. When designing databases, we identify three types of
anomalies: Insert, Update and Delete.
Insertion Anomaly in Referencing Relation:
We can’t insert a row in REFERENCING RELATION if referencing attribute’s value is not
present in referenced attribute value. e.g.; Insertion of a student with BRANCH_CODE
‘ME’ in STUDENT relation will result in error because ‘ME’ is not present in
BRANCH_CODE of BRANCH.
Deletion/ Updation Anomaly in Referenced Relation:
We can’t delete or update a row from REFERENCED RELATION if value of
REFERENCED ATTRIBUTE is used in value of REFERENCING ATTRIBUTE. e.g; if we try
to delete tuple from BRANCH having BRANCH_CODE ‘CS’, it will result in error because
‘CS’ is referenced by BRANCH_CODE of STUDENT, but if we try to delete the row from
BRANCH with BRANCH_CODE CV, it will be deleted as the value is not been used by
referencing relation. It can be handled by following method:
ON DELETE CASCADE: It will delete the tuples from REFERENCING RELATION if value
used by REFERENCING ATTRIBUTE is deleted from REFERENCED RELATION. e.g;, if we
delete a row from BRANCH with BRANCH_CODE ‘CS’, the rows in STUDENT relation
with BRANCH_CODE CS (ROLL_NO 1 and 2 in this case) will be deleted.

ON UPDATE CASCADE: It will update the REFERENCING ATTRIBUTE in REFERENCING


RELATION if attribute value used by REFERENCING ATTRIBUTE is updated in
REFERENCED RELATION. e.g;, if we update a row from BRANCH with BRANCH_CODE
‘CS’ to ‘CSE’, the rows in STUDENT relation with BRANCH_CODE CS (ROLL_NO 1 and 2
in this case) will be updated with BRANCH_CODE ‘CSE’.

SUPER KEYS:
Any set of attributes that allows us to identify unique rows (tuples) in a given relation
are known as super keys. Out of these super keys we can always choose a proper subset
among these which can be used as a primary key. Such keys are known as Candidate
keys. If there is a combination of two or more attributes which is being used as the
primary key then we call it as a Composite key.
Types of Keys in Relational Mode

Candidate Key: The minimal set of attributes that can uniquely identify a tuple is
known as a candidate key. For Example, STUD_NO in STUDENT relation.
• The value of the Candidate Key is unique and non-null for every tuple.
• There can be more than one candidate key in a relation. For Example, STUD_NO
is the candidate key for relation STUDENT.
• The candidate key can be simple (having only one attribute) or composite as well.
For Example, {STUD_NO, COURSE_NO} is a composite candidate key for
relation STUDENT_COURSE.
• No, of candidate keys in a Relation are nC(floor(n/2)), for example if a Relation
have 5 attributes i.e. R(A,B,C,D,E) then total no of candidate keys are
5C(floor(5/2))=10.
Note – In SQL Server a unique constraint that has a nullable column, allows the
value ‘null’ in that column only once. That’s why the STUD_PHONE attribute is a
candidate here, but cannot be ‘null’ values in the primary key attribute.

Super Key: The set of attributes that can uniquely identify a tuple is known as
Super Key. For Example, STUD_NO, (STUD_NO, STUD_NAME), etc.
• Adding zero or more attributes to the candidate key generates the super key.
• A candidate key is a super key but vice versa is not true.

Primary Key: There can be more than one candidate key in relation out of which
one can be chosen as the primary key. For Example, STUD_NO, as well as
STUD_PHONE both, are candidate keys for relation STUDENT but STUD_NO can
be chosen as the primary key (only one out of many candidate keys).

Alternate Key: The candidate key other than the primary key is called an alternate
key. For Example, STUD_NO, as well as STUD_PHONE both, are candidate keys
for relation STUDENT but STUD_PHONE will be an alternate key (only one out of
many candidate keys).
Foreign Key: If an attribute can only take the values which are present as values
of some other attribute, it will be a foreign key to the attribute to which it refers. The
relation which is being referenced is called referenced relation and the
corresponding attribute is called referenced attribute and the relation which refers
to the referenced relation is called referencing relation and the corresponding
attribute is called referencing attribute. The referenced attribute of the referenced
relation should be the primary key to it. For Example, STUD_NO in
STUDENT_COURSE is a foreign key to STUD_NO in STUDENT relation.
It may be worth noting that unlike Primary Key of any given relation, Foreign Key
can be NULL as well as may contain duplicate tuples, i.e., it need not follow
uniqueness constraint.
For Example, STUD_NO in STUDENT_COURSE relation is not unique. It has
been repeated for the first and third tuples. However, the STUD_NO in STUDENT
relation is a primary key and it needs to be always unique, and it cannot be null.

Mapping from ER Model to Relational Model


Case 1: Binary Relationship with 1:1 cardinality with total participation of an
entity
A person has 0 or 1 passport number
and Passport is always owned by 1
person. So, it is 1:1 cardinality with
full participation constraint from
Passport.

Case 2: Binary Relationship with 1:1 cardinality and partial participation of


both entities
A male marries 0 or 1 female and vice
versa as well. So it is 1:1 cardinality with
partial participation constraint from both.

Case 3: Binary Relationship with n: 1 cardinality


In this scenario, every student can enrol
only in one elective course but for an
elective course there can be more than
one student.
Case 4: Binary Relationship with m: n cardinality
In this scenario, every student can
enrol in more than 1 compulsory
course and for a compulsory course
there can be more than 1 student.

Case 5: Binary Relationship with weak entity


In this scenario, an employee can have
many dependents and one dependent
can depend on one employee. A
dependent does not have any existence
without an employee (e.g; you as a
child can be dependent of your father in
his company). So it will be a weak entity
and its participation will always be total. Weak Entity does not have key of its own.
So, its key will be combination of key of its identifying entity (E-Id of Employee in
this case) and its partial key (D-Name).

Strategies for Schema design in DBMS


1. Top-down strategy –
In this strategy, we basically start with a schema that contains a high level of
abstraction and then apply successive top-down refinement. Let’s try to
understand this with an example, we may specify only a few level entity types
and then we specify their attributes split them into lower-level entity types and
relationships. The process of specialization to refine an entity type into subclass
is also an example of this strategy.

2. Bottom-up strategy –
In this type of strategy, we basically start with basic abstraction and then go on
adding to this abstraction. For example, we may start with attributes and group
these into entity types and relationships. We can also add a new relationship
among entity types as the design goes ahead. The basic example is the process
of generalizing entity types into the higher-level generalized superclass.

3. Inside-Out Strategy –
This is a special case of a bottom-up strategy when attention is basically
focused on a central set of concepts that are most evident. Modeling then
basically spreads outward by considering new concepts in the vicinity of existing
ones. We could specify a few clearly evident entity types in the schema and
continue by adding other entity types and relationships that are related to each
other.

4. Mixed Strategy –
Instead of using any particular strategy throughout the design, the requirements
are partitioned according to a top-down strategy, and part of the schema is
designed for each partition according to a bottom-up strategy after that various
schema are combined.
Schema Integration in DBMS
Schema Integration is divided into the following subtask.
1. Identifying correspondences and conflicts among the schema:
As the schemas are designed individually it is necessary to specify constructs in
the schemas that represent the same real-world concept. We must identify these
correspondences before proceeding with the integration. During this process,
several types of conflicts may occur such as:

1. Naming conflict –
Naming conflicts are of two types synonyms and homonyms. A synonym
occurs when two schemas use different names to describe the same
concept, for example, an entity type CUSTOMER in one schema may
describe an entity type CLIENT in another schema. A homonym occurs
when two schemas use the same name to describe different concepts. For
example, an entity type Classes may represent TRAIN classes in one
schema and AEROPLANE classes in another schema.

2. Type conflicts –
A similar concept may be represented in two schemas by different
modeling constructs. For example, DEPARTMENT may be an entity type
in one schema and an attribute in another.

3. Domain conflicts –
A single attribute may have different domains in different schemas. For
example, we may declare Ssn as an integer in one schema and a
character string in another. A conflict of the unit of measure could occur if
one schema represented weight in pounds and the other used kgs.

4. Conflicts among constraints –


Two schemas may impose different constraints, for example, the KEY of
an entity type may be different in each schema.

2. Modifying views to conform to one another:


Some schemas are modified so that they conform to other schemas more closely.
Some of the conflicts that may occur during the first steps are resolved in this
step.
3. Merging of Views and Restructuring:
The global schemas are created by merging the individual schemas.
Corresponding concepts are represented only once in the global schema and
mapping between the views and the global schemas are specified. This is the
hardest step to achieve in real-world databases which involve hundreds of entities
and relations. It involves a considerable amount of human intervention and
negotiation to resolve conflicts and to settle on the most reasonable and
acceptable solution for a global schema.
Restructuring As a final optional step the global schemas may be analyzed and
restructured to remove any redundancies or unnecessary complexity.

Star Schema in Data Warehouse modeling


Star schema is the fundamental schema among the data mart schema and it is
simplest. This schema is widely used to develop or build a data warehouse and
dimensional data marts. It includes one or more fact tables indexing any number of
dimensional tables. The star schema is a necessary cause of the snowflake schema.
It is also efficient for handling basic queries.
It is said to be star as its physical model resembles to the star shape having a fact
table at its center and the dimension tables at its peripheral representing the star’s
points.
Sales price, sale quantity, distant, speed, weight, and weight measurements are few
examples of fact data in star schema.
In the above demonstration,
SALES is a fact table having
attributes i.e. (Product ID,
Order ID, Customer ID,
Employer ID, Total, Quantity,
Discount) which references to
the dimension tables.
Employee dimension
table contains the attributes:
Emp ID, Emp Name, Title,
Department and
Region. Product dimension
table contains the attributes:
Product ID, Product Name,
Product Category, Unit
Price. Customer dimension
table contains the attributes: Customer ID, Customer Name, Address, City,
Zip. Time dimension table contains the attributes: Order ID, Order Date, Year,
Quarter, Month.
Advantages of Star Schema :
1. Simpler Queries –
Join logic of star schema is quite cinch in comparison to other join logic which
are needed to fetch data from a transactional schema that is highly normalized.
2. Simplified Business Reporting Logic –
In comparison to a transactional schema that is highly normalized, the star
schema makes simpler common business reporting logic, such as as-of
reporting and period-over-period.
3. Feeding Cubes –
Star schema is widely used by all OLAP systems to design OLAP cubes
efficiently. In fact, major OLAP systems deliver a ROLAP mode of operation
which can use a star schema as a source without designing a cube structure.
Disadvantages of Star Schema –
1. Data integrity is not enforced well since in a highly de-normalized schema
state.
2. Not flexible in terms if analytical needs as a normalized data model.
3. Star schemas don’t reinforce many-to-many relationships within business
entities – at least not frequently.

Snowflake Schema in Data Warehouse Model


The snowflake schema is a variant of the star schema. Here, the centralized fact
table is connected to multiple dimensions. In the snowflake schema, dimensions
are present in a normalized form in multiple related tables. The snowflake structure
materialized when the dimensions of a star schema are detailed and highly
structured, having several levels of relationship, and the child tables have multiple
parent tables. The snowflake effect affects only the dimension tables and does not
affect the fact tables.

The Employee dimension table now contains the attributes: EmployeeID,


EmployeeName, DepartmentID, Region, Territory. The DepartmentID attribute
links with the Employee table with the Department dimension table.
The Department dimension is used to provide detail about each department, such
as the Name and Location of the department. The Customer dimension table now
contains the attributes: CustomerID, CustomerName, Address, CityID. The CityID
attributes link the Customer dimension table with the City dimension table.
The City dimension table has details about each city such as CityName, Zipcode,
State, and Country.
The main difference between star schema and snowflake schema is that the
dimension table of the snowflake schema is maintained in the normalized form to
reduce redundancy. The advantage here is that such tables (normalized) are easy
to maintain and save storage space. However, it also means that more joins will be
needed to execute the query. This will adversely impact system performance.
What is snowflaking?
The snowflake design is the result of further expansion and normalized of the
dimension table. In other words, a dimension table is said to be snowflaked if the
low-cardinality attribute of the dimensions has been divided into separate normalized
tables. These tables are then joined to the original dimension table with referential
constraints (foreign key constrain). Generally, snowflaking is not recommended in
the dimension table, as it hampers the understandability and performance of the
dimension model as more tables would be required to be joined to satisfy the
queries.
Characteristics of snowflake schema: The dimension model of a snowflake under
the following conditions:
• The snowflake schema uses small disk space.
• It is easy to implement dimension that is added to the schema.
• There are multiple tables, so performance is reduced.
• The dimension table consists of two or more sets of attributes that define
information at different grains.
• The sets of attributes of the same dimension table are being populated by
different source systems.
Advantages: There are two main advantages of snowflake schema given below:
• It provides structured data which reduces the problem of data integrity.
• It uses small disk space because data are highly structured.
Disadvantages:
• Snowflaking reduces space consumed by dimension tables but compared with
the entire data warehouse the saving is usually insignificant.
• Avoid snowflaking or normalization of a dimension table, unless required and
appropriate.
• Do not snowflake hierarchies of one dimension table into separate tables.
Hierarchies should belong to the dimension table only and should never be
snowflakes.
• Multiple hierarchies that can belong to the same dimension have been designed
at the lowest possible detail.

Relational Algebra
Relational Algebra is procedural query language, which takes Relation as input
and generate relation as output. Relational algebra mainly provides theoretical
foundation for relational databases and SQL.
Operators in Relational Algebra
Projection (π)
Projection is used to project required column data from a
relation. By Default projection removes duplicate data.
Example:
Selection (σ)
Selection is used to select required tuples of the relations.
for the above relation σ (c>3)R will select the tuples which have
c more than 3.
Note: selection operator only selects the required tuples but
does not display them. For displaying, data projection operator
is used.

Union (U)
Union operation in relational algebra is same as union operation in set theory, only
constraint is for union of two relation both relations must have same set of Attributes.

Set Difference (-)


Set Difference in relational algebra is same set difference operation as in set theory
with the constraint that both relations should have same set of attributes.

Rename (ρ)
Rename is a unary operation used for renaming attributes of a relation.
ρ (a/b)R will rename the attribute ‘b’ of relation by ‘a’.

Cross Product (X)


Cross product between two relations let say A and B, so cross product between A X
B will results all the attributes of A followed by each attribute of B. Each record of A
will pairs with every record of B.
Note: if A has ‘n’ tuples and B has ‘m’ tuples then A X B will have ‘n*m’ tuples.
Natural Join (⋈)
Natural join is a binary operator. Natural join between two or more relations will
result set of all combination of tuples where they have equal common attribute.
Conditional Join
Conditional join works similar to natural join. In natural join, by default condition is
equal between common attribute while in conditional join we can specify the any
condition such as greater than, less than, not equal.

Basic Operators in Relational Algebra


Selection operator (σ): Selection operator is used to select tuples from a relation
based on some condition. Syntax:

Extract students whose age is greater than 18 from STUDENT relation given in
Table 3

RESULT:

Projection Operator (∏): Projection operator is used to project particular columns


from a relation. Syntax:

Extract ROLL_NO and NAME from STUDENT relation given in Table 3

RESULT:

Note: If resultant relation after projection has duplicate rows, it will be removed.
For Example: ∏(ADDRESS)(STUDENT) will remove one duplicate row with value DELHI
and return three rows.
Cross Product(X): Cross product is used to join two relations. For every row of
Relation1, each row of Relation2 is concatenated. If Relation1 has m tuples and
and Relation2 has n tuples, cross product of Relation1 and Relation2 will have m X
n tuples. Syntax:
To apply Cross Product on STUDENT
relation given in Table 1 and
STUDENT_SPORTS relation given in
Table 2,

RESULT:

Union (U): Union on two relations R1 and R2 can only be computed if R1 and R2
are union compatible (These two
relation should have same number of
attributes and corresponding attributes in
two relations have same domain) . Union
operator when applied on two relations
R1 and R2 will give a relation with tuples
which are either in R1 or in R2. The
tuples which are in both R1 and R2 will
appear only once in result
relation. Syntax:
Relation1 U Relation2
Find person who are either student or
employee, we can use Union operator
like:
STUDENT U EMPLOYEE

Minus (-): Minus on two relations R1 and R2 can only be computed if R1 and R2
are union compatible. Minus operator when applied on two relations as R1-R2
will give a relation with tuples which are in R1 but not in R2. Syntax:
Relation1 - Relation2
Find person who are student but not employee, we can use minus operator like:
STUDENT - EMPLOYEE

RESULT:

Rename(ρ): Rename operator is used to give another name to a relation. Syntax:


ρ(Relation2, Relation1)
To rename STUDENT relation to STUDENT1, we can use rename operator like:
ρ(STUDENT1, STUDENT)
If you want to create a relation STUDENT_NAMES with ROLL_NO and NAME
from STUDENT, it can be done using rename operator as:
ρ(STUDENT_NAMES, ∏(ROLL_NO, NAME)(STUDENT))

Inner Join vs Outer Join


What is Join?
An SQL Join is used to combine data from two or more tables, based on a common
field between them. For example, consider the following two tables.
Student Table StudentCourse Table

Following is join query that shows names of students enrolled in


different courseIDs.

Note: INNER is optional above. Simple JOIN is also considered as INNER JOIN.

What is the difference between inner join and outer join?


Outer Join is of 3 types
1) Left outer join
2) Right outer join
3) Full Join
1) Left outer join returns all rows of table on left side of join. The rows for which
there is no matching row on right side, result contains NULL in the right side.
as INNER JOIN

Note: OUTER is optional above. Simple LEFT JOIN is also considered as LEFT
OUTER JOIN
2) Right Outer Join is similar to Left Outer Join (Right replaces Left everywhere)
3) Full Outer Join Contains results of both Left and Right outer joins.

Tuple Relational Calculus (TRC) in DBMS


Tuple Relational Calculus is a non-procedural query language unlike relational
algebra. Tuple Calculus provides only the description of the query but it does not
provide the methods to solve it. Thus, it explains what to do but not how to do.
In Tuple Calculus, a query is expressed as
{t| P(t)}
where t = resulting tuples,
P(t) = known as Predicate and these are the conditions that are used to fetch t
Thus, it generates set of all tuples t, such that Predicate P(t) is true for t.
P(t) may have various conditions logically combined with OR (∨), AND (∧), NOT(¬).

It also uses quantifiers:


∃ t ∈ r (Q(t)) = “there exists” a tuple in t in relation r such that predicate Q(t) is true.
∀ t ∈ r (Q(t)) = Q(t) is true “for all” tuples in relation r.

Difference between Row oriented and Column oriented


data stores in DBMS
A data store is basically a place for storing collections of data, such as a
database, a file system or a directory. In Database system they can be stored in
two ways. These are as follows:
1. Row Oriented Data Stores
2. Column-Oriented Data Stores
Best Example of Row-oriented
data stores is Relational
Database, which is a structured
data storage and also a
sophisticated query engine. It
incurs a big penalty to improve
performance as the data size
increases.
The best example of a Column-
Oriented datastores is HBase
Database, which is basically
designed from the ground up to
provide scalability and
partitioning to enable efficient
data structure serialization,
storage, and retrieval.
Functional Dependency and Attribute Closure
Functional Dependency
A functional dependency A->B in a relation holds if two tuples having same value of
attribute A also have same value for attribute B. For Example, in relation STUDENT
shown in table 1, Functional Dependencies

STUD_NO->STUD_NAME, STUD_NO->STUD_PHONE hold


but

STUD_NAME->STUD_ADDR do not hold

How to find functional dependencies for a relation?


Functional Dependencies in a relation are dependent on the domain of the relation.
Consider the STUDENT relation given in Table 1.

• We know that STUD_NO is unique for each student. So STUD_NO-


>STUD_NAME, STUD_NO->STUD_PHONE, STUD_NO->STUD_STATE, STUD_NO-
>STUD_COUNTRY and STUD_NO -> STUD_AGE all will be true.
• Similarly, STUD_STATE->STUD_COUNTRY will be true as if two records have
same STUD_STATE, they will have same STUD_COUNTRY as well.
• For relation STUDENT_COURSE, COURSE_NO->COURSE_NAME will be true as
two records with same COURSE_NO will have same COURSE_NAME.

Functional Dependency Set: Functional Dependency set or FD set of a relation is the


set of all FDs present in the relation. For Example, FD set for relation STUDENT shown
in table 1 is:

{ STUD_NO->STUD_NAME, STUD_NO->STUD_PHONE, STUD_NO->STUD_STATE, STUD_NO-


>STUD_COUNTRY,
STUD_NO -> STUD_AGE, STUD_STATE->STUD_COUNTRY }
Attribute Closure: Attribute closure of an attribute set can be defined as set of
attributes which can be functionally determined from it.
How to find attribute closure of an attribute set?
To find attribute closure of an attribute set:
• Add elements of attribute set to the result set.
• Recursively add elements to the result set which can be functionally determined
from the elements of the result set.
Using FD set of table 1, attribute closure can be determined as:

(STUD_NO)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE, STUD_COUNTRY,


STUD_AGE}
(STUD_STATE)+ = {STUD_STATE, STUD_COUNTRY}

How to find Candidate Keys and Super Keys using Attribute Closure?
• If attribute closure of an attribute set contains all attributes of relation, the attribute
set will be super key of the relation.
• If no subset of this attribute set can functionally determine all attributes of the
relation, the set will be candidate key as well. For Example, using FD set of table 1,
(STUD_NO, STUD_NAME)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_COUNTRY, STUD_AGE}
(STUD_NO)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE, STUD_COUNTRY,
STUD_AGE}
(STUD_NO, STUD_NAME) will be super key but not candidate key because its subset
(STUD_NO)+ is equal to all attributes of the relation. So, STUD_NO will be a candidate
key.

How to check whether an FD can be derived from a given FD set?


To check whether an FD A->B can be derived from an FD set F,
1. Find (A)+ using FD set F.
2. If B is subset of (A)+, then A->B is true else not true.

Prime and non-prime attributes


Attributes which are parts of any candidate key of relation are called as prime
attribute, others are non-prime attributes. For Example, STUD_NO in STUDENT
relation is prime attribute, others are non-prime attribute.

Finding Attribute Closure and Candidate Keys using


Functional Dependencies
What is Functional Dependency?
A functional dependency X->Y in a relation holds if two tuples having same value for X
also have same value for Y i.e X uniquely determines Y.
In EMPLOYEE relation given in Table 1,
• FD E-ID->E-NAME holds because for each E-ID, there is a unique value of E-NAME.
• FD E-ID->E-CITY and E-CITY->E-STATE also holds.
• FD E-NAME->E-ID does not hold because E-NAME ‘John’ is not uniquely
determining E-ID. There are 2 E-IDs corresponding to John (E001 and E003).
EMPLOYEE
E-ID E-NAME E-CITY E-STATE

E001 John Delhi Delhi

E002 Mary Delhi Delhi

E003 John Noida U.P.

Table 1
The FD set for EMPLOYEE relation given in Table 1 are:
{E-ID->E-NAME, E-ID->E-CITY, E-ID->E-STATE, E-CITY->E-STATE}
Trivial versus Non-Trivial Functional Dependency: A trivial functional dependency
is the one which will always hold in a relation.
X->Y will always hold if X ⊇ Y
In the example given above, E-ID, E-NAME->E-ID is a trivial functional dependency and
will always hold because {E-ID,E-NAME} ⊃ {E-ID}. You can also see from the table that
for each value of {E-ID, E-NAME}, value of E-ID is unique, so {E-ID, E-NAME} functionally
determines E-ID.
If a functional dependency is not trivial, it is called Non-Trivial Functional
Dependency. Non-Trivial functional dependency may or may not hold in a relation.
e.g; E-ID->E-NAME is a non-trivial functional dependency which holds in the above
relation.
Properties of Functional Dependencies
Let X, Y, and Z are sets of attributes in a relation R. There are several properties of
functional dependencies which always hold in R also known as Armstrong Axioms.
1. Reflexivity: If Y is a subset of X, then X → Y. e.g.; Let X represents {E-ID, E-NAME} and
Y represents {E-ID}. {E-ID, E-NAME}->E-ID is true for the relation.
2. Augmentation: If X → Y, then XZ → YZ. e.g.; Let X represents {E-ID}, Y represents {E-
NAME} and Z represents {E-CITY}. As {E-ID}->E-NAME is true for the relation, so { E-
ID,E-CITY}->{E-NAME,E-CITY} will also be true.
3. Transitivity: If X → Y and Y → Z, then X → Z. e.g.; Let X represents {E-ID}, Y
represents {E-CITY} and Z represents {E-STATE}. As {E-ID} ->{E-CITY} and {E-CITY}-
>{E-STATE} is true for the relation, so { E-ID }->{E-STATE} will also be true.
4. Attribute Closure: The set of attributes that are functionally dependent on the
attribute A is called Attribute Closure of A and it can be represented as A+.

Steps to Find the Attribute Closure of A


Q. Given FD set of a Relation R, The attribute closure set S be the set of A
1. Add A to S.
2. Recursively add attributes which can be functionally determined from attributes of
the set S until done.
From Table 1, FDs are
Given R(E-ID, E-NAME, E-CITY, E-STATE)
FDs = { E-ID->E-NAME, E-ID->E-CITY, E-ID->E-STATE, E-CITY->E-STATE }
The attribute closure of E-ID can be calculated as:
1. Add E-ID to the set {E-ID}
2. Add Attributes which can be derived from any attribute of set. In this case, E-NAME
and E-CITY, E-STATE can be derived from E-ID. So these are also a part of closure.
3. As there is one other attribute remaining in relation to be derived from E-ID. So
result is:
(E-ID)+ = {E-ID, E-NAME, E-CITY, E-STATE }
Similarly,
(E-NAME)+ = {E-NAME}
(E-CITY)+ = {E-CITY, E_STATE}

Q. Find the attribute closures of given FDs R(ABCDE) = {AB->C, B->D, C->E, D-
>A} To find (B)+, we will add attribute in set using various FD which has been shown in
table below.
Attributes Added in Closure FD used

{B} Triviality

{B,D} B->D

{B,D,A} D->A

{B,D,A,C} AB->C

{B,D,A,C,E} C->E

▪ We can find (C, D)+ by adding C and D into the set (triviality) and then E using(C-
>E) and then A using (D->A) and set becomes.
(C,D)+ = {C,D,E,A}
▪ Similarly we can find (B,C)+ by adding B and C into the set (triviality) and then D
using (B->D) and then E using (C->E) and then A using (D->A) and set becomes
(B,C)+ ={B,C,D,E,A}
Candidate Key
Candidate Key is minimal set of attributes of a relation which can be used to identify a
tuple uniquely. For Example, each tuple of EMPLOYEE relation given in Table 1 can be
uniquely identified by E-ID and it is minimal as well. So it will be Candidate key of the
relation.
A candidate key may or may not be a primary key.
Super Key
Super Key is set of attributes of a relation which can be used to identify a tuple
uniquely. For example, each tuple of EMPLOYEE relation given in Table 1 can be
uniquely identified by E-ID or (E-ID, E-NAME) or (E-ID, E-CITY) or (E-ID, E-STATE)
or (E_ID, E-NAME, E-STATE) etc. So, all of these are super keys of EMPLOYEE relation.
Note: A candidate key is always a super key but vice versa is not true.

Q. Finding Candidate Keys and Super Keys of a Relation using FD set The set of
attributes whose attribute closure is set of all attributes of relation is called super key
of relation. For Example, the EMPLOYEE relation shown in Table 1 has following FD
set. {E-ID->E-NAME, E-ID->E-CITY, E-ID->E-STATE, E-CITY->E-
STATE} Let us calculate attribute closure of different set of attributes:
(E-ID)+ = {E-ID, E-NAME,E-CITY,E-STATE}
(E-ID,E-NAME)+ = {E-ID, E-NAME,E-CITY,E-STATE}
(E-ID,E-CITY)+ = {E-ID, E-NAME,E-CITY,E-STATE}
(E-ID,E-STATE)+ = {E-ID, E-NAME,E-CITY,E-STATE}
(E-ID,E-CITY,E-STATE)+ = {E-ID, E-NAME,E-CITY,E-STATE}
(E-NAME)+ = {E-NAME}
(E-CITY)+ = {E-CITY,E-STATE}
As (E-ID)+, (E-ID, E-NAME)+, (E-ID, E-CITY)+, (E-ID, E-STATE)+, (E-ID, E-CITY, E-
STATE)+ give set of all attributes of relation EMPLOYEE. So all of these are super keys of
relation.
The minimal set of attributes whose attribute closure is set of all attributes of relation
is called candidate key of relation. As shown above, (E-ID)+ is set of all attributes of
relation and it is minimal. So E-ID will be candidate key. On the other hand (E-ID, E-
NAME)+ also is set of all attributes but it is not minimal because its subset (E-ID)+ is equal
to set of all attributes. So (E-ID, E-NAME) is not a candidate key.

Armstrong’s Axioms in Functional Dependency in DBMS


The term Armstrong axioms refer to the sound and complete set of inference rules
or axioms, introduced by William W. Armstrong, that is used to test the logical
implication of functional dependencies. If F is a set of functional dependencies
then the closure of F, denoted as F+, is the set of all functional dependencies
logically implied by F. Armstrong’s Axioms are a set of rules, that when applied
repeatedly, generates a closure of functional dependencies.
Axioms –

1. Axiom of reflexivity –
If A is a set of attributes and B is subset of A, then A holds B. If B ⊆ A, then A
→ B. This property is trivial property.
2. Axiom of augmentation –
If A → B holds and Y is attribute set, then AY → BY also holds. That is adding
attributes in dependencies, does not change the basic dependencies. If A → B,
then AC → BC for any C.
3. Axiom of transitivity –
Same as the transitive rule in algebra, if A → B holds and B → C holds, then A
→ C also holds. A → B is called as A functionally that determines B.

Secondary Rules –

These rules can be derived from the above axioms.


1. Union –
If A → B holds and A → C holds, then A → BC holds.
2. Composition –
If A → B and X → Y holds, then AX → BY holds.
3. Decomposition –
If A → BC holds then A → B and A → C hold.
4. Pseudo Transitivity –
If A → B holds and BC → D holds, then AC → D holds.

Why armstrong axioms refer to the Sound and Complete ?


By sound, we mean that given a set of functional dependencies F specified on a
relation schema R, any dependency that we can infer from F by using the primary
rules of Armstrong axioms holds in every relation state r of R that satisfies the
dependencies in F.
By complete, we mean that using primary rules of Armstrong axioms repeatedly to
infer dependencies until no more dependencies can be inferred results in the
complete set of all possible dependencies that can be inferred from F.

Equivalence of Functional Dependencies


Given a Relation with different FD sets for that relation, we have to find out whether
one FD set is subset of other or both are equal.
How to find relationship between two FD sets?
Let FD1 and FD2 are two FD sets for a relation R.
1. If all FDs of FD1 can be derived from FDs present in FD2, we can say that FD2 ⊃ FD1.
2. If all FDs of FD2 can be derived from FDs present in FD1, we can say that FD1 ⊃ FD2.
3. If 1 and 2 both are true, FD1=FD2.
All these three cases can be shown using Venn diagram as:

Q. Let us take an example to show the relationship between two FD sets. A relation
R(A,B,C,D) having two FD sets FD1 = {A->B, B->C, AB->D} and FD2 = {A->B, B->C, A-
>C, A->D}
Step 1. Checking whether all FDs of FD1 are present in FD2
• A->B in set FD1 is present in set FD2.
• B->C in set FD1 is also present in set FD2.
• AB->D in present in set FD1 but not directly in FD2 but we will check whether
we can derive it or not. For set FD2, (AB)+ = {A,B,C,D}. It means that AB can
functionally determine A, B, C and D. So AB->D will also hold in set FD2.
As all FDs in set FD1 also hold in set FD2, FD2 ⊃ FD1 is true.
Step 2. Checking whether all FDs of FD2 are present in FD1
• A->B in set FD2 is present in set FD1.
• B->C in set FD2 is also present in set FD1.
• A->C is present in FD2 but not directly in FD1 but we will check whether we can
derive it or not. For set FD1, (A)+ = {A,B,C,D}. It means that A can functionally
determine A, B, C and D. SO A->C will also hold in set FD1.
• A->D is present in FD2 but not directly in FD1 but we will check whether we can
derive it or not. For set FD1, (A)+ = {A,B,C,D}. It means that A can functionally
determine A, B, C and D. SO A->D will also hold in set FD1.
As all FDs in set FD2 also hold in set FD1, FD1 ⊃ FD2 is true.
Step 3. As FD2 ⊃ FD1 and FD1 ⊃ FD2 both are true FD2 =FD1 is true. These two FD sets
are semantically equivalent.

Q. Let us take another example to show the relationship between two FD sets. A
relation R2(A,B,C,D) having two FD sets FD1 = {A->B, B->C,A->C} and FD2 = {A->B,
B->C, A->D}
Step 1. Checking whether all FDs of FD1 are present in FD2
• A->B in set FD1 is present in set FD2.
• B->C in set FD1 is also present in set FD2.
• A->C is present in FD1 but not directly in FD2 but we will check whether we
can derive it or not. For set FD2, (A)+ = {A,B,C,D}. It means that A can
functionally determine A, B, C and D. SO A->C will also hold in set FD2.
As all FDs in set FD1 also hold in set FD2, FD2 ⊃ FD1 is true.
Step 2. Checking whether all FDs of FD2 are present in FD1
•A->B in set FD2 is present in set FD1.
• B->C in set FD2 is also present in set FD1.
• A->D is present in FD2 but not directly in FD1 but we will check whether we
can derive it or not. For set FD1, (A)+ = {A,B,C}. It means that A can’t
functionally determine D. SO A->D will not hold in FD1.
As all FDs in set FD2 do not hold in set FD1, FD2 ⊄ FD1.
Step 3. In this case, FD2 ⊃ FD1 and FD2 ⊄ FD1, these two FD sets are not semantically
equivalent.
Database Normalization
Database normalization is the process of organizing the attributes of the database
to reduce or eliminate data redundancy (having the same data but at different
places).
Problems because of data redundancy
Data redundancy unnecessarily increases the size of the database as the same
data is repeated in many places. Inconsistency problems also arise during insert,
delete and update operations.
Functional Dependency
Functional Dependency is a constraint between two sets of attributes in relation to
a database. A functional dependency is denoted by an arrow (→). If an attribute A
functionally determines B, then it is written as A → B.
For example, employee_id → name means employee_id functionally determines
the name of the employee. As another example in a timetable database,
{student_id, time} → {lecture_room}, student ID and time determine the lecture
room where the student should be.
What does functionally dependent mean?
A function dependency A → B means for all instances of a particular value of A,
there is the same value of B.
For example, in the below table A → B is true, but B → A is not true as there are
different values of A for B = 3.
A B
------
1 3
2 3
4 0
1 3
4 0
Trivial Functional Dependency
X → Y is trivial only when Y is subset of X.
Examples:
ABC → AB
ABC → A
ABC → ABC
Non-Trivial Functional Dependencies
X → Y is a non-trivial functional dependency when Y is not a subset of X.
X → Y is called completely non-trivial when X intersect Y is NULL.

Example:
Id → Name,
Name → DOB
Semi Non-Trivial Functional Dependencies
X → Y is called semi non-trivial when X intersect Y is not NULL.
Examples:

AB → BC,
AD → DC

Normal Forms in DBMS


Normalization is the process of minimizing redundancy from a relation or set of
relations. Redundancy in relation may cause insertion, deletion, and update
anomalies. So, it helps to minimize the redundancy in relations. Normal forms are
used to eliminate or reduce redundancy in database tables.

1. First Normal Form –

If a relation contain composite or multi-valued attribute, it violates first normal form


or a relation is in first normal form if it does not contain any composite or multi-
valued attribute. A relation is in first normal form if every attribute in that relation
is singled valued attribute.
Example 1 – Relation STUDENT in table 1 is not in 1NF because of multi-valued
attribute STUD_PHONE. Its decomposition into 1NF has been shown in table 2.

Example 2 –
ID Name Courses
------------------
1 A c1, c2
2 E c3
3 M C2, c3
In the above table Course is a multi-valued attribute so it is not in 1NF.
Below Table is in 1NF as there is no multi-valued attribute
ID Name Course
------------------
1 A c1
1 A c2
2 E c3
3 M c2
3 M c3

2. Second Normal Form –

To be in second normal form, a relation must be in first normal form and relation
must not contain any partial dependency. A relation is in 2NF if it has No Partial
Dependency, i.e., no non-prime attribute (attributes which are not part of any
candidate key) is dependent on any proper subset of any candidate key of the
table.
Partial Dependency – If the proper subset of candidate key determines non-prime
attribute, it is called partial dependency.
Example 1 – Consider table-3 as following below.
STUD_NO COURSE_NO COURSE_FEE
1 C1 1000
2 C2 1500
1 C4 2000
4 C3 1000
4 C1 1000
2 C5 2000
{Note that, there are many courses having the same course fee.}
Here,
COURSE_FEE cannot alone decide the value of COURSE_NO or STUD_NO;
COURSE_FEE together with STUD_NO cannot decide the value of COURSE_NO;
COURSE_FEE together with COURSE_NO cannot decide the value of STUD_NO;
Hence,
COURSE_FEE would be a non-prime attribute, as it does not belong to the one
only candidate key {STUD_NO, COURSE_NO} ;
But, COURSE_NO -> COURSE_FEE, i.e., COURSE_FEE is dependent on
COURSE_NO, which is a proper subset of the candidate key. Non-prime attribute
COURSE_FEE is dependent on a proper subset of the candidate key, which is a
partial dependency and so this relation is not in 2NF.
To convert the above relation to 2NF,we need to split the table into two tables such
as :
Table 1: STUD_NO, COURSE_NO
Table 2: COURSE_NO, COURSE_FEE
Table 1 Table 2
STUD_NO COURSE_NO COURSE_NO COURSE_FEE
1 C1 C1 1000
2 C2 C2 1500
1 C4 C3 1000
4 C3 C4 2000
4 C1 C5 2000
2 C5
NOTE: 2NF tries to reduce the redundant data getting stored in memory. For
instance, if there are 100 students taking C1 course, we don’t need to store its Fee
as 1000 for all the 100 records, instead, once we can store it in the second table as
the course fee for C1 is 1000.
Example 2 – Consider following functional dependencies in relation R (A, B, C, D)
AB -> C [A and B together determine C]
BC -> D [B and C together determine D]
In the above relation, AB is the only candidate key and there is no partial
dependency, i.e., any proper subset of AB doesn’t determine any non-
prime attribute.

3. Third Normal Form –

A relation is in third normal form, if there is no transitive dependency for non-


prime attributes as well as it is in second normal form.
A relation is in 3NF if at least one of the following condition holds in every non-
trivial function dependency X –> Y
a) X is a super key.
b) Y is a prime attribute (each element of Y is part of some candidate key).

Transitive dependency – If A->B and B->C are two FDs then A->C is called
transitive dependency.
Example 1 – In relation STUDENT given in Table 4,
FD set: {STUD_NO -> STUD_NAME, STUD_NO -> STUD_STATE, STUD_STATE
-> STUD_COUNTRY, STUD_NO -> STUD_AGE}
Candidate Key: {STUD_NO}
For this relation in table 4, STUD_NO -> STUD_STATE and STUD_STATE ->
STUD_COUNTRY are true. So STUD_COUNTRY is transitively dependent on
STUD_NO. It violates the third normal form. To convert it in third normal form, we
will decompose the relation STUDENT (STUD_NO, STUD_NAME, STUD_PHONE,
STUD_STATE, STUD_COUNTRY_STUD_AGE) as:
STUDENT (STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_AGE)
STATE_COUNTRY (STATE, COUNTRY)

Example 2 – Consider relation R(A, B, C, D, E)


A -> BC,
CD -> E,
B -> D,
E -> A
All possible candidate keys in above relation are {A, E, CD, BC} All attributes are
on right sides of all functional dependencies are prime.

4. Boyce-Codd Normal Form (BCNF) –

A relation R is in BCNF if R is in Third Normal Form and for every FD, LHS is
super key. A relation is in BCNF iff in every non-trivial functional dependency X –>
Y, X is a super key.
Example 1 – Find the highest normal form of a relation R(A,B,C,D,E) with FD set
as {BC->D, AC->BE, B->E}
Step 1. As we can see, (AC)+ ={A,C,B,E,D} but none of its subset can determine
all attribute of relation, So AC will be candidate key. A or C can’t be derived from
any other attribute of the relation, so there will be only 1 candidate key {AC}.
Step 2. Prime attributes are those attributes that are part of candidate key {A, C} in
this example and others will be non-prime {B, D, E} in this example.
Step 3. The relation R is in 1st normal form as a relational DBMS does not allow
multi-valued or composite attribute.
The relation is in 2nd normal form because BC->D is in 2nd normal form (BC is not
a proper subset of candidate key AC) and AC->BE is in 2nd normal form (AC is
candidate key) and B->E is in 2nd normal form (B is not a proper subset of
candidate key AC).
The relation is not in 3rd normal form because in BC->D (neither BC is a super key
nor D is a prime attribute) and in B->E (neither B is a super key nor E is a prime
attribute) but to satisfy 3rd normal for, either LHS of an FD should be super key or
RHS should be prime attribute.
So the highest normal form of relation will be 2nd Normal form.

Example 2 –For example consider relation R(A, B, C)


A -> BC,
B ->
A and B both are super keys so above relation is in BCNF.
Key Points –
1. BCNF is free from redundancy.
2. If a relation is in BCNF, then 3NF is also satisfied.
3. If all attributes of relation are prime attribute, then the relation is always in 3NF.
4. A relation in a Relational Database is always and at least in 1NF form.
5. Every Binary Relation ( a Relation with only 2 attributes ) is always in BCNF.
6. If a Relation has only singleton candidate keys( i.e. every candidate key
consists of only 1 attribute), then the Relation is always in 2NF( because no
Partial functional dependency possible).
7. Sometimes going for BCNF form may not preserve functional dependency. In
that case go for BCNF only if the lost FD(s) is not required, else normalize till
3NF only.
8. There are many more Normal forms that exist after BCNF, like 4NF and more.
But in real world database systems it’s generally not required to go beyond
BCNF.

Exercise 1: Find the highest normal form in R (A, B, C, D, E) under following


functional dependencies.
ABC --> D
CD --> AE
Important Points for solving above type of question.
1) It is always a good idea to start checking from BCNF, then 3 NF, and so on.
2) If any functional dependency satisfied a normal form, then there is no need to
check for lower normal form. For example, ABC –> D is in BCNF (Note that ABC is
a super key), so no need to check this dependency for lower normal forms.
Candidate keys in the given relation are {ABC, BCD}
BCNF: ABC -> D is in BCNF. Let us check CD -> AE, CD is not a super key so this
dependency is not in BCNF. So, R is not in BCNF.
3NF: ABC -> D we don’t need to check for this dependency as it already satisfied
BCNF. Let us consider CD -> AE. Since E is not a prime attribute, so the relation is
not in 3NF.
2NF: In 2NF, we need to check for partial dependency. CD is a proper subset of a
candidate key and it determines E, which is non-prime attribute. So, given relation
is also not in 2 NF. So, the highest normal form is 1 NF.

4th and 5th Normal form in DBMS


If two or more independent relation are kept in a single relation or we can
say multivalue dependency occurs when the presence of one or more rows in a
table implies the presence of one or more other rows in that same table. Put
another way, two attributes (or columns) in a table are independent of one another,
but both depend on a third attribute. A multivalued dependency always requires
at least three attributes because it consists of at least two attributes that are
dependent on a third.
For a dependency A -> B, if for a single value of A, multiple value of B exists, then
the table may have multi-valued dependency. The table should have at least 3
attributes and B and C should be independent for A ->> B multivalued dependency.
For example,

Person Mobile Food_Likes

Mahesh 9893/9424 Burger / pizza

Ramesh 9191 Pizza

Person->-> mobile,
Person ->-> food_likes
This is read as “person multidetermines mobile” and “person multidetermines
food_likes.”
Note that a functional dependency is a special case of multivalued dependency. In
a functional dependency X -> Y, every x determines exactly one y, never more
than one.
Fourth normal form (4NF):
Fourth normal form (4NF) is a level of database normalization where there are no
non-trivial multivalued dependencies other than a candidate key. It builds on the
first three normal forms (1NF, 2NF and 3NF) and the Boyce-Codd Normal Form
(BCNF). It states that, in addition to a database meeting the requirements of
BCNF, it must not contain more than one multivalued dependency.
Properties – A relation R is in 4NF if and only if the following conditions are
satisfied:
1. It should be in the Boyce-Codd Normal Form (BCNF).
2. the table should not have any Multi-valued Dependency.
A table with a multivalued dependency violates the normalization standard of
Fourth Normal Form (4NK) because it creates unnecessary redundancies and can
contribute to inconsistent data. To bring this up to 4NF, it is necessary to break this
information into two tables.
Example – Consider the database table of a class whaich has two relations R1
contains student ID(SID) and student name (SNAME) and R2 contains course
id(CID) and course name (CNAME).
Table – R1(SID, SNAME) Table – R2(CID, CNAME)
When there cross product is done it resulted in multivalued
dependencies: Table – R1 X R2

Multivalued dependencies (MVD) are:


SID->->CID; SID->->CNAME; SNAME->->CNAME

Joint dependency – Join decomposition is a further generalization of Multivalued


dependencies. If the join of R1 and R2 over C is equal to relation R then we can
say that a join
dependency (JD) exists, where R1 and R2 are the decomposition R1(A, B, C) and
R2(C, D) of a given relations R (A, B, C, D). Alternatively, R1 and R2 are a lossless
decomposition of R. A JD ⋈ {R1, R2, …, Rn} is said to hold over a relation R if R1,
R2, ….., Rn is a lossless-join decomposition. The *(A, B, C, D), (C, D) will be a JD
of R if the join of join’s attribute is equal to
the relation R. Here, *(R1, R2, R3) is used to indicate that relation R1, R2, R3 and
so on are a JD of R.
Let R is a relation schema R1, R2, R3……..Rn be the decomposition of R. r( R ) is
said to satisfy join dependency if and only if

Example –Company->->Product Agent->->Company Agent->->Product

Table – R1 Table – R2 Table – R3

Table – R1⋈R2⋈R3

Agent->->Product
Fifth Normal Form / Projected Normal Form (5NF):
A relation R is in 5NF if and only if every join dependency in R is implied by the
candidate keys of R. A relation decomposed into two relations must have loss-less
join Property, which ensures that no spurious or extra tuples are generated, when
relations are reunited through a natural join.
Properties – A relation R is in 5NF if and only if it satisfies following conditions:
1. R should be already in 4NF.
2. It cannot be further non loss decomposed (join dependency)
Example – Consider the above schema, with a case as “if a company makes a
product and an agent is an agent for that company, then he always sells that
product for the company”. Under these circumstances, the ACP table is shown as:
Table – ACP
The relation ACP is again decomposed into 3 relations.
Now, the natural Join of all the three relations will be
shown as:

Table – R1 Table – R2 Table – R3

Result of Natural Join of R1 and R3 over ‘Company’ and then Natural Join of R13
and R2 over ‘Agent’ and ‘Product’ will be table ACP.
Hence, in this example, all the redundancies are eliminated, and the
decomposition of ACP is a lossless join decomposition. Therefore, the relation is in
5NF as it does not violate the property of lossless join.
Concurrency Control in DBMS
Concurrency Control deals with interleaved execution of more than one transaction. In
the next article, we will see what is serializability and how to find whether a schedule is
serializable or not.
What is Transaction?

A set of logically related operations is known as transaction. The main operations of a


transaction are:

Read(A): Read operations Read(A) or R(A) reads the value of A from the database and
stores it in a buffer in main memory.

Write (A): Write operation Write(A) or W(A) writes the value back to the database from
buffer.
(Note: It doesn’t always need to write it to database back it just writes the changes to
buffer this is the reason where dirty read comes into picture)

Let us take a debit transaction from an account which consists of following operations:

1. R(A);
2. A=A-1000;
3. W(A);

Assume A’s value before starting of transaction is 5000.


• The first operation reads the value of A from database and stores it in a buffer.
• Second operation will decrease its value by 1000. So buffer will contain 4000.
• Third operation will write the value from buffer to database. So A’s final value will be
4000.

But it may also be possible that transaction may fail after executing some of its
operations. The failure can be because of hardware, software or power etc. For
example, if debit transaction discussed above fails after executing operation 2, the value
of A will remain 5000 in the database which is not acceptable by the bank. To avoid this,
Database has two important operations:

Commit: After all instructions of a transaction are successfully executed, the changes
made by transaction are made permanent in the database.

Rollback: If a transaction is not able to execute all operations successfully, all the
changes made by transaction are undone.
Properties of a transaction (ACID)

Atomicity:
By this, we mean that either the entire transaction takes place at once or doesn’t
happen at all. There is no midway i.e. transactions do not occur partially. Each
transaction is considered as one unit and either runs to completion or is not executed
at all. It involves the following two operations.
—Abort: If a transaction aborts, changes made to database are not visible.
—Commit: If a transaction commits, changes made are visible.
Atomicity is also known as the ‘All or nothing rule’.
Consider the following transaction T consisting of T1 and T2: Transfer of 100 from
account X to account Y.

If the transaction fails after completion of T1 but before completion of T2.( say,
after write(X) but before write(Y)), then amount has been deducted from X but not
added to Y. This results in an inconsistent database state. Therefore, the
transaction must be executed in entirety in order to ensure correctness of database
state.

As a transaction is set of logically related operations, either all of them should be


executed or none. A debit transaction discussed above should either execute all three
operations or none. If debit transaction fails after executing operation 1 and 2 then its
new value 4000 will not be updated in the database which leads to inconsistency.

Consistency:
This means that integrity constraints must be maintained so that the database is
consistent before and after the transaction. It refers to the correctness of a
database. Referring to the example above,
The total amount before and after the transaction must be maintained.
Total before T occurs = 500 + 200 = 700.
Total after T occurs = 400 + 300 = 700.
Therefore, database is consistent. Inconsistency occurs in case T1 completes
but T2 fails. As a result T is incomplete.

If operations of debit and credit transactions on same account are executed concurrently,
it may leave database in an inconsistent state.
• For Example, T1 (debit of Rs. 1000 from A) and T2 (credit of 500 to A) executing
concurrently, the database reaches inconsistent state.
• Let us assume Account balance of A is Rs. 5000. T1 reads A(5000) and stores the
value in its local buffer space. Then T2 reads A(5000) and also stores the value in its
local buffer space.
• T1 performs A=A-1000 (5000-1000=4000) and 4000 is stored in T1 buffer space.
Then T2 performs A=A+500 (5000+500=5500) and 5500 is stored in T2 buffer space.
T1 writes the value from its buffer back to database.
• A’s value is updated to 4000 in database and then T2 writes the value from its buffer
back to database. A’s value is updated to 5500 which shows that the effect of debit
transaction is lost and database has become inconsistent.
• To maintain consistency of database, we need concurrency control
protocols which will be discussed in next article. The operations of T1 and T2 with
their buffers and database have been shown in Table 1.

Isolation:
This property ensures that multiple transactions can occur concurrently without
leading to the inconsistency of database state. Transactions occur independently
without interference. Changes occurring in a particular transaction will not be
visible to any other transaction until that particular change in that transaction is
written to memory or has been committed. This property ensures that the execution
of transactions concurrently will result in a state that is equivalent to a state
achieved these were executed serially in some order.
Let X= 500, Y = 500.
Consider two transactions T and T”.

Suppose T has been executed till Read (Y) and then T’’ starts. As a result ,
interleaving of operations takes place due to which T’’ reads correct value of X but
incorrect value of Y and sum computed by
T’’: (X+Y = 50, 000+500=50, 500)
is thus not consistent with the sum at end of transaction:
T: (X+Y = 50, 000 + 450 = 50, 450).
This results in database inconsistency, due to a loss of 50 units. Hence,
transactions must take place in isolation and changes should be visible only after
they have been made to the main memory.

Result of a transaction should not be visible to others before transaction is


committed. For example, let us assume that A’s balance is Rs. 5000 and T1 debits
Rs. 1000 from A. A’s new balance will be 4000. If T2 credits Rs. 500 to A’s new
balance, A will become 4500 and after this T1 fails. Then we have to rollback T2 as
well because it is using value produced by T1. So, a transaction results are not made
visible to other transactions before it commits.

Durability:
This property ensures that once the transaction has completed execution, the
updates and modifications to the database are stored in and written to disk and they
persist even if a system failure occurs. These updates now become permanent and
are stored in non-volatile memory. The effects of the transaction, thus, are never
lost.

Once database has committed a transaction, the changes made by the transaction
should be permanent. e.g.; If a person has credited $500000 to his account, bank
can’t say that the update has been lost. To avoid this problem, multiple copies of
database are stored at different locations.

What is a Schedule?
A schedule is a series of operations from one or more transactions. A schedule
can be of two types:
• Serial Schedule: When one transaction completely executes before starting
another transaction, the schedule is called serial schedule. A serial schedule is
always consistent. e.g.; If a schedule S has debit transaction T1 and credit
transaction T2, possible serial schedules are T1 followed by T2 (T1->T2) or T2
followed by T1 ((T2->T1). A serial schedule has low throughput and less resource
utilization.
• Concurrent Schedule: When operations of a transaction are interleaved with
operations of other transactions of a schedule, the schedule is called Concurrent
schedule. e.g.; Schedule of debit and credit transaction shown in Table 1 is
concurrent in nature. But concurrency can lead to inconsistency in the
database. The above example of a concurrent schedule is also inconsistent.

Implementation of Locking in DBMS


Locking protocols are used in database management systems as a means of
concurrency control. Multiple transactions may request a lock on a data item
simultaneously. Hence, we require a mechanism to manage the locking requests
made by transactions. Such a mechanism is called as Lock Manager. It relies on
the process of message passing where transactions and lock manager exchange
messages to handle the locking and unlocking of data items.
Data structure used in Lock Manager –
The data structure required for implementation of locking is called as Lock table.
1. It is a hash table where name of data items are used as hashing index.
2. Each locked data item has a linked list associated with it.
3. Every node in the linked list represents the transaction which requested
for lock, mode of lock requested (mutual/exclusive) and current status of
the request (granted/waiting).
4. Every new lock request for the data item will be added in the end of linked
list as a new node.
5. Collisions in hash table are handled by technique of separate chaining.

Consider the following example of lock table:


Explanation: In the above
figure, the locked data items
present in lock table are 5, 47,
167 and 15.
The transactions which have
requested for lock have been
represented by a linked list
shown below them using a
downward arrow.
Each node in linked list has the
name of transaction which has
requested the data item like
T33, T1, T27 etc. The colour of
node represents the status, i.e.,
whether lock has been granted
or waiting.
Note that a collision has occurred for data item 5 and 47. It has been resolved by
separate chaining where each data item belongs to a linked list. The data item is
acting as header for linked list containing the locking request.
Working of Lock Manager –
1. Initially the lock table is empty as no data item is locked.
2. Whenever lock manager receives a lock request from a transaction T i on
a particular data item Qi following cases may arise:
a) If Qi is not already locked, a linked list will be created and lock will be granted
to the requesting transaction Ti.
b) If the data item is already locked, a new node will be added at the end of its
linked list containing the information about request made by T i.
3. If the lock mode requested by T i is compatible with lock mode of
transaction currently having the lock, T i will acquire the lock too and
status will be changed to ‘granted’. Else, status of T i’s lock will be
‘waiting’.
4. If a transaction Ti wants to unlock the data item it is currently holding, it
will send an unlock request to the lock manager. The lock manager will
delete Ti’s node from this linked list. Lock will be granted to the next
transaction in the list.
5. Sometimes transaction Ti may have to be aborted. In such a case all the
waiting request made by T i will be deleted from the linked lists present in
lock table. Once abortion is complete, locks held by T i will also be
released.

Lock Based Concurrency Control Protocol in DBMS


Concurrency-control protocols: allow concurrent schedules, but ensure that the
schedules are conflict/view serializable, and are recoverable and maybe even
cascadeless.
These protocols do not examine the precedence graph as it is being created,
instead a protocol imposes a discipline that avoids non-serializable schedules.
Different concurrency control protocols provide different advantages between the
amount of concurrency they allow and the amount of overhead that they impose.
Different categories of protocols:
o Lock Based Protocol
▪ Basic 2-PL
▪ Conservative 2-PL
▪ Strict 2-PL
▪ Rigorous 2-PL
o Graph Based Protocol
o Time-Stamp Ordering Protocol
o Multiple Granularity Protocol
o Multi-version Protocol
Lock Based Protocols –
A lock is a variable associated with a data item that describes a status of data item
with respect to possible operation that can be applied to it. They synchronize the
access by concurrent transactions to the database items. It is required in this
protocol that all the data items must be accessed in a mutually exclusive manner.
Let me introduce you to two common locks which are used and some terminology
followed in this protocol.
a) Shared Lock (S): also known as Read-only lock. As the name suggests it can
be shared between transactions because while holding this lock the transaction
does not have the permission to update data on the data item. S-lock is
requested using lock-S instruction.
b) Exclusive Lock (X): Data item can be both read as well as written.This is
Exclusive and cannot be held simultaneously on the same data item. X-lock is
requested using lock-X instruction.
Lock Compatibility Matrix –

• A transaction may be granted a lock on an item if the requested lock is


compatible with locks already held on the item by other
transactions.
• Any number of transactions can hold shared locks on an item, but if any
transaction holds an exclusive(X) on the item no other transaction may hold any
lock on the item.
• If a lock cannot be granted, the requesting transaction is made to wait till all
incompatible locks held by other transactions have been released. Then the lock
is granted.
o Upgrade / Downgrade locks : A transaction that holds a lock on an
item A is allowed under certain condition to change the lock state from
one state to another.
Upgrade: A S(A) can be upgraded to X(A) if Ti is the only transaction
holding the S-lock on element A.
Downgrade: We may downgrade X(A) to S(A) when we feel that we no
longer want to write on data-item A. As we were holding X-lock on A, we
need not check any conditions.

So, by now we are introduced with the types of locks and how to apply
them. But wait, just by applying locks if our problems could’ve been
avoided then life would’ve been so simple! If you have done Process
Synchronization under OS you must be familiar with one consistent
problem, starvation and Deadlock! We’ll be discussing them shortly, but
just so you know we have to apply Locks but they must follow a set of
protocols to avoid such undesirable problems. Shortly we’ll use 2-Phase
Locking (2-PL) which will use the concept of Locks to avoid deadlock. So,
applying simple locking, we may not always produce Serializable results, it
may lead to Deadlock Inconsistency.
Problem With Simple Locking…
Consider the Partial Schedule:
Deadlock – consider the above execution phase. Now, T1 holds an
Exclusive lock over B, and T2 holds a Shared lock over A. Consider
Statement 7, T2 requests for lock on B, while in Statement 8 T1 requests
lock on A. This as you may notice imposes a Deadlock as none can
proceed with their execution.
Starvation – is also possible if concurrency control manager is badly
designed. For example: A transaction may be waiting for an X-lock on an
item, while a sequence of other transactions request and are granted an
S-lock on the same item. This may be avoided if the concurrency control
manager is properly designed.

Graph Based Concurrency Control Protocol in DBMS


As we know the prime problems with Lock Based Protocol has been avoiding
Deadlocks and ensuring a Strict Schedule. We’ve seen that Strict Schedules are
possible with following Strict or Rigorous 2-PL. We’ve even seen that Deadlocks can
be avoided if we follow Conservative 2-PL but the problem with this protocol is it
cannot be used practically. Graph Based Protocols are used as an alternative to 2-
PL. Tree Based Protocols is a simple implementation of Graph Based Protocol.
A prerequisite of this protocol is that we know the order to access a Database Item.
For this we implement a Partial Ordering on a set of the Database Items (D) {d1,
d2, d3, ….., dn} . The protocol following the implementation of Partial Ordering is
stated as-
• If di –> dj then any transaction accessing both d i and dj must
access di before accessing dj.
• Implies that the set D may now be viewed as a directed
acyclic graph (DAG), called a database graph.

Image – Database Graph

Tree Based Protocol –


• Partial Order on Database items determines a tree like structure.
• Only Exclusive Locks are allowed.
• The first lock by Ti may be on any data item. Subsequently, a data Q can be
locked by Ti only if the parent of Q is currently locked by Ti.
• Data items can be unlocked at any time.
Following the Tree based Protocol ensures Conflict Serializability and Deadlock
Free schedule. We need not wait for unlocking a Data item as we did in 2-PL
protocol, thus increasing the concurrency.
Now, let us see an Example, following is a Database Graph which will be used as a
reference for locking the items subsequently.

Let’s look at an example based on the above Database Graph. We have three
Transactions in this schedule and this is a skeleton example, i.e, we will only see
how Locking and Unlocking works, let’s keep this simple and not make this complex
by adding operations on data.
From the above example, first see that the schedule is
Conflict Serializable. Serializability for Locks can be
written as T2 –> T1 –> T3.
Data items Locked and Unlocked are following the
same rule as given above and follows the Database
Graph.

Advantage –
• Ensures Conflict Serializable Schedule.
• Ensures Deadlock Free Schedule
• Unlocking can be done anytime
With some advantages comes some Disadvantages also.
Disadvantage –
• Unnecessary locking overheads may happen sometimes, like if we want both D
and E, then at least we have to lock B to follow the protocol.
• Cascading Rollbacks is still a problem. We don’t follow a rule of when Unlock
operation may occur so this problem persists for this
protocol.
Overall this protocol is mostly known and used for its unique way of implementing
Deadlock Freedom.
DBMS IMPORTANT QUESTIONS
1. What is Database?

A database is an organized collection of data, stored and retrieved digitally from a remote or local
computer system. Databases can be vast and complex, and such databases are developed using
fixed design and modeling approaches.

2. What is DBMS?

DBMS stands for Database Management System. DBMS is a system software responsible for the
creation, retrieval, updation, and management of the database. It ensures that our data is
consistent, organized, and is easily accessible by serving as an interface between the database and
its end-users or application software.

3. What is RDBMS? How is it different from DBMS?

RDBMS stands for Relational Database Management System. The key difference here, compared to
DBMS, is that RDBMS stores data in the form of a collection of tables, and relations can be defined
between the common fields of these tables. Most modern database management systems like
MySQL, Microsoft SQL Server, Oracle, IBM DB2, and Amazon Redshift are based on RDBMS

4. What is SQL?

SQL stands for Structured Query Language. It is the standard language for relational database
management systems. It is especially useful in handling organized data comprised of entities
(variables) and relations between different entities of the data.

5. What is the difference between SQL and MySQL?

SQL is a standard language for retrieving and manipulating structured databases. On the contrary,
MySQL is a relational database management system, like SQL Server, Oracle or IBM DB2, that is
used to manage SQL databases.

6. What are Tables and Fields?

A table is an organized collection of data stored in the form of rows and columns. Columns can be
categorized as vertical and rows as horizontal. The columns in a table are called fields while the
rows can be referred to as records.

7. What are Constraints in SQL?

Constraints are used to specify the rules concerning data in the table. It can be applied for single
or multiple fields in an SQL table during the creation of the table or after creating using the ALTER
TABLE command. The constraints are:

• NOT NULL - Restricts NULL value from being inserted into a column.
• CHECK - Verifies that all values in a field satisfy a condition.
• DEFAULT - Automatically assigns a default value if no value has been specified for the field.
• UNIQUE - Ensures unique values to be inserted into the field.
• INDEX - Indexes a field providing faster retrieval of records.
• PRIMARY KEY - Uniquely identifies each record in a table.
• FOREIGN KEY - Ensures referential integrity for a record in another table.

8. What is a Join? List its different types.

The SQL Join clause is used to combine records (rows) from two or more tables in a SQL
database based on a related column between the two.

There are four different types of JOINs in SQL:

• (INNER) JOIN: Retrieves records that have matching values in both tables involved in the join.
This is the widely used join for queries.

SELECT *
FROM Table_A
JOIN Table_B;
SELECT *
FROM Table_A
INNER JOIN Table_B;

• LEFT (OUTER) JOIN: Retrieves all the records/rows from the left and the matched records/rows
from the right table.

SELECT *
FROM Table_A A
LEFT JOIN Table_B B
ON A.col = B.col;

• RIGHT (OUTER) JOIN: Retrieves all the records/rows from the right and the matched
records/rows from the left table.

SELECT *
FROM Table_A A
RIGHT JOIN Table_B B
ON A.col = B.col;

• FULL (OUTER) JOIN: Retrieves all the records where there is a match in either the left or right
table.

SELECT *
FROM Table_A A
FULL JOIN Table_B B
ON A.col = B.col;

9. What is a Self-Join?

A self JOIN is a case of regular join where a table is joined to itself based on some relation
between its own column(s). Self-join uses the INNER JOIN or LEFT JOIN clause and a table alias is
used to assign different names to the table within the query.

SELECT A.emp_id AS "Emp_ID",A.emp_name AS "Employee",


B.emp_id AS "Sup_ID",B.emp_name AS "Supervisor"
FROM employee A, employee B
WHERE A.emp_sup = B.emp_id;

10. What is a Cross-Join?

Cross join can be defined as a cartesian product of the two tables included in the join. The table
after join contains the same number of rows as in the cross-product of the number of rows in the
two tables. If a WHERE clause is used in cross join then the query will work like an INNER JOIN.

SELECT stu.name, sub.subject


FROM students AS stu
CROSS JOIN subjects AS sub;

11. What is an Index? Explain its different types.

A database index is a data structure that provides a quick lookup of data in a column or columns
of a table. It enhances the speed of operations accessing data from a database table at the cost
of additional writes and memory to maintain the index data structure.

CREATE INDEX index_name /* Create Index */


ON table_name (column_1, column_2);
DROP INDEX index_name; /* Drop Index */

There are different types of indexes that can be created for different purposes:

• Unique and Non-Unique Index:

Unique indexes are indexes that help maintain data integrity by ensuring that no two rows of
data in a table have identical key values. Once a unique index has been defined for a table,
uniqueness is enforced whenever keys are added or changed within the index.

CREATE UNIQUE INDEX myIndex


ON students (enroll_no);
Non-unique indexes, on the other hand, are not used to enforce constraints on the tables with
which they are associated. Instead, non-unique indexes are used solely to improve query
performance by maintaining a sorted order of data values that are used frequently.

• Clustered and Non-Clustered Index:

Clustered indexes are indexes whose order of the rows in the database corresponds to the order
of the rows in the index. This is why only one clustered index can exist in a given table, whereas,
multiple non-clustered indexes can exist in the table.

The only difference between clustered and non-clustered indexes is that the database manager
attempts to keep the data in the database in the same order as the corresponding keys appear in
the clustered index.

Clustering indexes can improve the performance of most query operations because they provide
a linear-access path to data stored in the database.

12. What is the difference between Clustered and Non-clustered index?

As explained above, the differences can be broken down into three small factors -

• Clustered index modifies the way records are stored in a database based on the indexed column.
A non-clustered index creates a separate entity within the table which references the original
table.
• Clustered index is used for easy and speedy retrieval of data from the database, whereas, fetching
records from the non-clustered index is relatively slower.
• In SQL, a table can have a single clustered index whereas it can have multiple non-clustered
indexes.

13. What is Data Integrity?

Data Integrity is the assurance of accuracy and consistency of data over its entire life-cycle and is
a critical aspect of the design, implementation, and usage of any system which stores, processes,
or retrieves data. It also defines integrity constraints to enforce business rules on the data when
it is entered into an application or a database.

14. What is a Query?

A query is a request for data or information from a database table or combination of tables. A
database query can be either a select query or an action query.

SELECT fname, lname /* select query */


FROM myDb.students
WHERE student_id = 1;
UPDATE myDB.students /* action query */
SET fname = 'Captain', lname = 'America'
WHERE student_id = 1;
18. What is a Subquery? What are its types?

A subquery is a query within another query, also known as a nested query or inner query. It is
used to restrict or enhance the data to be queried by the main query, thus restricting or
enhancing the output of the main query respectively. For example, here we fetch the contact
information for students who have enrolled for the maths subject:

SELECT name, email, mob, address


FROM myDb.contacts
WHERE roll_no IN (
SELECT roll_no
FROM myDb.students
WHERE subject = 'Maths');

There are two types of subquery - Correlated and Non-Correlated.

• A correlated subquery cannot be considered as an independent query, but it can refer to the
column in a table listed in the FROM of the main query.
• A non-correlated subquery can be considered as an independent query and the output of the
subquery is substituted in the main query.

15. What is the SELECT statement?

SELECT operator in SQL is used to select data from a database. The data returned is stored in a
result table, called the result-set.

SELECT * FROM myDB.students;


20. What are some common clauses used with SELECT query in SQL?

Some common SQL clauses used in conjuction with a SELECT query are as follows:

• WHERE clause in SQL is used to filter records that are necessary, based on specific conditions.
• ORDER BY clause in SQL is used to sort the records based on some field(s) in ascending (ASC) or
descending order (DESC).

SELECT *
FROM myDB.students
WHERE graduation_year = 2019
ORDER BY studentID DESC;

• GROUP BY clause in SQL is used to group records with identical data and can be used in
conjunction with some aggregation functions to produce summarized results from the database.
• HAVING clause in SQL is used to filter records in combination with the GROUP BY clause. It is
different from WHERE, since the WHERE clause cannot filter aggregated records.

SELECT COUNT(studentId), country


FROM myDB.students
WHERE country != "INDIA"
GROUP BY country
HAVING COUNT(studentID) > 5;

16. What are UNION, MINUS and INTERSECT commands?

The UNION operator combines and returns the result-set retrieved by two or more SELECT
statements.
The MINUS operator in SQL is used to remove duplicates from the result-set obtained by the
second SELECT query from the result-set obtained by the first SELECT query and then return the
filtered results from the first.
The INTERSECT clause in SQL combines the result-set fetched by the two SELECT statements
where records from one match the other and then returns this intersection of result-sets.

Certain conditions need to be met before executing either of the above statements in SQL -

• Each SELECT statement within the clause must have the same number of columns
• The columns must also have similar data types
• The columns in each SELECT statement should necessarily have the same order

SELECT name FROM Students /* Fetch the union of queries */


UNION
SELECT name FROM Contacts;
SELECT name FROM Students /* Fetch the union of queries with duplicates*/
UNION ALL
SELECT name FROM Contacts;
SELECT name FROM Students /* Fetch names from students */
MINUS /* that aren't present in contacts */
SELECT name FROM Contacts;
SELECT name FROM Students /* Fetch names from students */
INTERSECT /* that are present in contacts as well */
SELECT name FROM Contacts;
17. What is Cursor? How to use a Cursor?

A database cursor is a control structure that allows for the traversal of records in a database.
Cursors, in addition, facilitates processing after traversal, such as retrieval, addition, and deletion
of database records. They can be viewed as a pointer to one row in a set of rows.

Working with SQL Cursor:

1. DECLARE a cursor after any variable declaration. The cursor declaration must always be associated
with a SELECT Statement.
2. Open cursor to initialize the result set. The OPEN statement must be called before fetching rows
from the result set.
3. FETCH statement to retrieve and move to the next row in the result set.
4. Call the CLOSE statement to deactivate the cursor.
5. Finally use the DEALLOCATE statement to delete the cursor definition and release the associated
resources.

DECLARE @name VARCHAR(50) /* Declare All Required Variables */


DECLARE db_cursor CURSOR FOR /* Declare Cursor Name*/
SELECT name
FROM myDB.students
WHERE parent_name IN ('Sara', 'Ansh')
OPEN db_cursor /* Open cursor and Fetch data into @name */
FETCH next
FROM db_cursor
INTO @name
CLOSE db_cursor /* Close the cursor and deallocate the resources */
DEALLOCATE db_cursor
23. What are Entities and Relationships?

Entity: An entity can be a real-world object, either tangible or intangible, that can be easily
identifiable. For example, in a college database, students, professors, workers, departments, and
projects can be referred to as entities. Each entity has some associated properties that provide it
an identity.
Relationships: Relations or links between entities that have something to do with each other.
For example - The employee's table in a company's database can be associated with the salary
table in the same database.

18. List the different types of relationships in SQL.

• One-to-One - This can be defined as the relationship between two tables where each record in
one table is associated with the maximum of one record in the other table.
• One-to-Many & Many-to-One - This is the most commonly used relationship where a record in a
table is associated with multiple records in the other table.
• Many-to-Many - This is used in cases when multiple instances on both sides are needed for
defining a relationship.
• Self-Referencing Relationships - This is used when a table needs to define a relationship with
itself.

19. What is an Alias in SQL?

An alias is a feature of SQL that is supported by most, if not all, RDBMSs. It is a temporary name
assigned to the table or table column for the purpose of a particular SQL query. In addition,
aliasing can be employed as an obfuscation technique to secure the real names of database
fields. A table alias is also called a correlation name.

An alias is represented explicitly by the AS keyword but in some cases, the same can be
performed without it as well. Nevertheless, using the AS keyword is always a good practice.

SELECT A.emp_name AS "Employee" /* Alias using AS keyword */


B.emp_name AS "Supervisor"
FROM employee A, employee B /* Alias without AS keyword */
WHERE A.emp_sup = B.emp_id;

20. What is a View?

A view in SQL is a virtual table based on the result-set of an SQL statement. A view contains rows
and columns, just like a real table. The fields in a view are fields from one or more real tables in
the database.
21. What is Normalization?

Normalization represents the way of organizing structured data in the database efficiently. It
includes the creation of tables, establishing relationships between them, and defining rules for
those relationships. Inconsistency and redundancy can be kept in check based on these rules,
hence, adding flexibility to the database.

22. What is Denormalization?

Denormalization is the inverse process of normalization, where the normalized schema is


converted into a schema that has redundant information. The performance is improved by using
redundancy and keeping the redundant data consistent. The reason for performing
denormalization is the overheads produced in the query processor by an over-normalized
structure.

23. What are the various forms of Normalization?

Normal Forms are used to eliminate or reduce redundancy in database tables. The different
forms are as follows:

• First Normal Form:


A relation is in first normal form if every attribute in that relation is a single-valued attribute. If a
relation contains a composite or multi-valued attribute, it violates the first normal form. Let's
consider the following students table. Each student in the table, has a name, his/her address, and
the books they issued from the public library -

Students Table

Student Address Books Issued Salutation

Amanora Park Town Until the Day I Die (Emily Carpenter), Inception (Christopher
Sara Ms.
94 Nolan)

Ansh 62nd Sector A-10 The Alchemist (Paulo Coelho), Inferno (Dan Brown) Mr.

24th Street Park


Sara Beautiful Bad (Annie Ward), Woman 99 (Greer Macallister) Mrs.
Avenue
Student Address Books Issued Salutation

Ansh Windsor Street 777 Dracula (Bram Stoker) Mr.

As we can observe, the Books Issued field has more than one value per record, and to convert it
into 1NF, this has to be resolved into separate individual records for each book issued. Check the
following table in 1NF form -

Students Table (1st Normal Form)

Student Address Books Issued Salutation

Sara Amanora Park Town 94 Until the Day I Die (Emily Carpenter) Ms.

Sara Amanora Park Town 94 Inception (Christopher Nolan) Ms.

Ansh 62nd Sector A-10 The Alchemist (Paulo Coelho) Mr.

Ansh 62nd Sector A-10 Inferno (Dan Brown) Mr.

Sara 24th Street Park Avenue Beautiful Bad (Annie Ward) Mrs.

Sara 24th Street Park Avenue Woman 99 (Greer Macallister) Mrs.

Ansh Windsor Street 777 Dracula (Bram Stoker) Mr.

• Second Normal Form:

A relation is in second normal form if it satisfies the conditions for the first normal form and does
not contain any partial dependency. A relation in 2NF has no partial dependency, i.e., it has no
non-prime attribute that depends on any proper subset of any candidate key of the table. Often,
specifying a single column Primary Key is the solution to the problem. Examples -

Example 1 - Consider the above example. As we can observe, the Students Table in the 1NF
form has a candidate key in the form of [Student, Address] that can uniquely identify all records
in the table. The field Books Issued (non-prime attribute) depends partially on the Student field.
Hence, the table is not in 2NF. To convert it into the 2nd Normal Form, we will partition the
tables into two while specifying a new Primary Key attribute to identify the individual records in
the Students table. The Foreign Key constraint will be set on the other table to ensure referential
integrity.

Students Table (2nd Normal Form)

Student_ID Student Address Salutation

1 Sara Amanora Park Town 94 Ms.

2 Ansh 62nd Sector A-10 Mr.

3 Sara 24th Street Park Avenue Mrs.


Student_ID Student Address Salutation

4 Ansh Windsor Street 777 Mr.

Books Table (2nd Normal Form)

Student_ID Book Issued

1 Until the Day I Die (Emily Carpenter)

1 Inception (Christopher Nolan)

2 The Alchemist (Paulo Coelho)

2 Inferno (Dan Brown)

3 Beautiful Bad (Annie Ward)

3 Woman 99 (Greer Macallister)

4 Dracula (Bram Stoker)

Example 2 - Consider the following dependencies in relation to R(W,X,Y,Z)

WX -> Y [W and X together determine Y]


XY -> Z [X and Y together determine Z]

Here, WX is the only candidate key and there is no partial dependency, i.e., any proper subset of
WX doesn’t determine any non-prime attribute in the relation.

• Third Normal Form

A relation is said to be in the third normal form, if it satisfies the conditions for the second
normal form and there is no transitive dependency between the non-prime attributes, i.e., all
non-prime attributes are determined only by the candidate keys of the relation and not by any
other non-prime attribute.

Example 1 - Consider the Students Table in the above example. As we can observe, the Students
Table in the 2NF form has a single candidate key Student_ID (primary key) that can uniquely
identify all records in the table. The field Salutation (non-prime attribute), however, depends on
the Student Field rather than the candidate key. Hence, the table is not in 3NF. To convert it into
the 3rd Normal Form, we will once again partition the tables into two while specifying a
new Foreign Key constraint to identify the salutations for individual records in the Students
table. The Primary Key constraint for the same will be set on the Salutations table to identify
each record uniquely.
Students Table (3rd Normal Form)

Student_ID Student Address Salutation_ID

1 Sara Amanora Park Town 94 1

2 Ansh 62nd Sector A-10 2

3 Sara 24th Street Park Avenue 3

4 Ansh Windsor Street 777 1

Books Table (3rd Normal Form)

Student_ID Book Issued

1 Until the Day I Die (Emily Carpenter)

1 Inception (Christopher Nolan)

2 The Alchemist (Paulo Coelho)

2 Inferno (Dan Brown)

3 Beautiful Bad (Annie Ward)

3 Woman 99 (Greer Macallister)

4 Dracula (Bram Stoker)

Salutations Table (3rd Normal Form)

Salutation_ID Salutation

1 Ms.

2 Mr.

3 Mrs.

Example 2 - Consider the following dependencies in relation to R(P,Q,R,S,T)

P -> QR [P together determine C]


RS -> T [B and C together determine D]
Q -> S
T -> P

For the above relation to exist in 3NF, all possible candidate keys in the above relation should be
{P, RS, QR, T}.

• Boyce-Codd Normal Form


A relation is in Boyce-Codd Normal Form if satisfies the conditions for third normal form and for
every functional dependency, Left-Hand-Side is super key. In other words, a relation in BCNF has
non-trivial functional dependencies in form X –> Y, such that X is always a super key. For
example - In the above example, Student_ID serves as the sole unique identifier for the Students
Table and Salutation_ID for the Salutations Table, thus these tables exist in BCNF. The same
cannot be said for the Books Table and there can be several books with common Book Names
and the same Student_ID.

24. What are the TRUNCATE, DELETE and DROP statements?

DELETE statement is used to delete rows from a table.

DELETE FROM Candidates


WHERE CandidateId > 1000;

TRUNCATE command is used to delete all the rows from the table and free the space containing
the table.

TRUNCATE TABLE Candidates;

DROP command is used to remove an object from the database. If you drop a table, all the rows
in the table are deleted and the table structure is removed from the database.

DROP TABLE Candidates;

25. What is the difference between DROP and TRUNCATE statements?

If a table is dropped, all things associated with the tables are dropped as well. This includes - the
relationships defined on the table with other tables, the integrity checks and constraints, access
privileges and other grants that the table has. To create and use the table again in its original
form, all these relations, checks, constraints, privileges and relationships need to be redefined.
However, if a table is truncated, none of the above problems exist and the table retains its
original structure.

26. What is the difference between DELETE and TRUNCATE statements?

The TRUNCATE command is used to delete all the rows from the table and free the space
containing the table.
The DELETE command deletes only the rows from the table based on the condition given in the
where clause or deletes all the rows from the table if no condition is specified. But it does not
free the space containing the table.

27. What are Aggregate and Scalar functions?

An aggregate function performs operations on a collection of values to return a single scalar


value. Aggregate functions are often used with the GROUP BY and HAVING clauses of the
SELECT statement. Following are the widely used SQL aggregate functions:

• AVG() - Calculates the mean of a collection of values.


• COUNT() - Counts the total number of records in a specific table or view.
• MIN() - Calculates the minimum of a collection of values.
• MAX() - Calculates the maximum of a collection of values.
• SUM() - Calculates the sum of a collection of values.
• FIRST() - Fetches the first element in a collection of values.
• LAST() - Fetches the last element in a collection of values.

Note: All aggregate functions described above ignore NULL values except for the COUNT
function.

A scalar function returns a single value based on the input value. Following are the widely used
SQL scalar functions:

• LEN() - Calculates the total length of the given field (column).


• UCASE() - Converts a collection of string values to uppercase characters.
• LCASE() - Converts a collection of string values to lowercase characters.
• MID() - Extracts substrings from a collection of string values in a table.
• CONCAT() - Concatenates two or more strings.
• RAND() - Generates a random collection of numbers of a given length.
• ROUND() - Calculates the round-off integer value for a numeric field (or decimal point values).
• NOW() - Returns the current date & time.
• FORMAT() - Sets the format to display a collection of values.

28. What is User-defined function? What are its various types?

The user-defined functions in SQL are like functions in any other programming language that
accept parameters, perform complex calculations, and return a value. They are written to use the
logic repetitively whenever required. There are two types of SQL user-defined functions:

• Scalar Function: As explained earlier, user-defined scalar functions return a single scalar value.
• Table-Valued Functions: User-defined table-valued functions return a table as output.
o Inline: returns a table data type based on a single SELECT statement.
o Multi-statement: returns a tabular result-set but, unlike inline, multiple SELECT statements
can be used inside the function body.

29. What is OLTP?

OLTP stands for Online Transaction Processing, is a class of software applications capable of
supporting transaction-oriented programs. An essential attribute of an OLTP system is its ability
to maintain concurrency. To avoid single points of failure, OLTP systems are often decentralized.
These systems are usually designed for a large number of users who conduct short transactions.
Database queries are usually simple, require sub-second response times, and return relatively
few records. Here is an insight into the working of an OLTP system [ Note - The figure is not
important for interviews ] -
36. What are the differences between OLTP and OLAP?

OLTP stands for Online Transaction Processing, is a class of software applications capable of
supporting transaction-oriented programs. An important attribute of an OLTP system is its ability
to maintain concurrency. OLTP systems often follow a decentralized architecture to avoid single
points of failure. These systems are generally designed for a large audience of end-users who
conduct short transactions. Queries involved in such databases are generally simple, need fast
response times, and return relatively few records. A number of transactions per second acts as an
effective measure for such systems.

OLAP stands for Online Analytical Processing, a class of software programs that are
characterized by the relatively low frequency of online transactions. Queries are often too
complex and involve a bunch of aggregations. For OLAP systems, the effectiveness measure
relies highly on response time. Such systems are widely used for data mining or maintaining
aggregated, historical data, usually in multi-dimensional schemas.
37. What is Collation? What are the different types of Collation Sensitivity?

Collation refers to a set of rules that determine how data is sorted and compared. Rules defining
the correct character sequence are used to sort the character data. It incorporates options for
specifying case sensitivity, accent marks, kana character types, and character width. Below are the
different types of collation sensitivity:

• Case sensitivity: A and a are treated differently.


• Accent sensitivity: a and á are treated differently.
• Kana sensitivity: Japanese kana characters Hiragana and Katakana are treated differently.
• Width sensitivity: Same character represented in single-byte (half-width) and double-byte (full-
width) are treated differently.

38. What is a Stored Procedure?

A stored procedure is a subroutine available to applications that access a relational database


management system (RDBMS). Such procedures are stored in the database data dictionary. The
sole disadvantage of stored procedure is that it can be executed nowhere except in the database
and occupies more memory in the database server. It also provides a sense of security and
functionality as users who can't access the data directly can be granted access via stored
procedures.

DELIMITER $$
CREATE PROCEDURE FetchAllStudents()
BEGIN
SELECT * FROM myDB.students;
END $$
DELIMITER ;

39. What is a Recursive Stored Procedure?

A stored procedure that calls itself until a boundary condition is reached, is called a recursive
stored procedure. This recursive function helps the programmers to deploy the same set of code
several times as and when required. Some SQL programming languages limit the recursion depth
to prevent an infinite loop of procedure calls from causing a stack overflow, which slows down
the system and may lead to system crashes.

DELIMITER $$ /* Set a new delimiter => $$ */


CREATE PROCEDURE calctotal( /* Create the procedure */
IN number INT, /* Set Input and Ouput variables */
OUT total INT
) BEGIN
DECLARE score INT DEFAULT NULL; /* Set the default value => "score" */
SELECT awards FROM achievements /* Update "score" via SELECT query */
WHERE id = number INTO score;
IF score IS NULL THEN SET total = 0; /* Termination condition */
ELSE
CALL calctotal(number+1); /* Recursive call */
SET total = total + score; /* Action after recursion */
END IF;
END $$ /* End of procedure */
DELIMITER ; /* Reset the delimiter */
40. How to create empty tables with the same structure as another table?

Creating empty tables with the same structure can be done smartly by fetching the records of
one table into a new table using the INTO operator while fixing a WHERE clause to be false for all
records. Hence, SQL prepares the new table with a duplicate structure to accept the fetched
records but since no records get fetched due to the WHERE clause in action, nothing is inserted
into the new table.

SELECT * INTO Students_copy


FROM Students WHERE 1 = 2;
41. What is Pattern Matching in SQL?

SQL pattern matching provides for pattern search in data if you have no clue as to what that
word should be. This kind of SQL query uses wildcards to match a string pattern, rather than
writing the exact word. The LIKE operator is used in conjunction with SQL Wildcards to fetch the
required information.

• Using the % wildcard to perform a simple search

The % wildcard matches zero or more characters of any type and can be used to define wildcards
both before and after the pattern. Search a student in your database with first name beginning
with the letter K:

SELECT *
FROM students
WHERE first_name LIKE 'K%'

• Omitting the patterns using the NOT keyword

Use the NOT keyword to select records that don't match the pattern. This query returns all
students whose first name does not begin with K.

SELECT *
FROM students
WHERE first_name NOT LIKE 'K%'

• Matching a pattern anywhere using the % wildcard twice

Search for a student in the database where he/she has a K in his/her first name.

SELECT *
FROM students
WHERE first_name LIKE '%Q%'

• Using the _ wildcard to match pattern at a specific position

The _ wildcard matches exactly one character of any type. It can be used in conjunction with %
wildcard. This query fetches all students with letter K at the third position in their first name.

SELECT *
FROM students
WHERE first_name LIKE '__K%'

• Matching patterns for a specific length

The _ wildcard plays an important role as a limitation when it matches exactly one character. It
limits the length and position of the matched results. For example -

SELECT * /* Matches first names with three or more letters */


FROM students
WHERE first_name LIKE '___%'

SELECT * /* Matches first names with exactly four characters */


FROM students
WHERE first_name LIKE '____'

PostgreSQL Interview Questions


42. What is PostgreSQL?

PostgreSQL was first called Postgres and was developed by a team led by Computer Science
Professor Michael Stonebraker in 1986. It was developed to help developers build enterprise-
level applications by upholding data integrity by making systems fault-tolerant. PostgreSQL is
therefore an enterprise-level, flexible, robust, open-source, and object-relational DBMS that
supports flexible workloads along with handling concurrent users. It has been consistently
supported by the global developer community. Due to its fault-tolerant nature, PostgreSQL has
gained widespread popularity among developers.

43. How do you define Indexes in PostgreSQL?

Indexes are the inbuilt functions in PostgreSQL which are used by the queries to perform search
more efficiently on a table in the database. Consider that you have a table with thousands of
records and you have the below query that only a few records can satisfy the condition, then it
will take a lot of time to search and return those rows that abide by this condition as the engine
has to perform the search operation on every single to check this condition. This is undoubtedly
inefficient for a system dealing with huge data. Now if this system had an index on the column
where we are applying search, it can use an efficient method for identifying matching rows by
walking through only a few levels. This is called indexing.

Select * from some_table where table_col=120


44. How will you change the datatype of a column?

This can be done by using the ALTER TABLE statement as shown below:
Syntax:

ALTER TABLE tname


ALTER COLUMN col_name [SET DATA] TYPE new_data_type;
45. What is the command used for creating a database in PostgreSQL?

The first step of using PostgreSQL is to create a database. This is done by using the createdb
command as shown below: createdb db_name
After running the above command, if the database creation was successful, then the below
message is shown:

CREATE DATABASE
46. How can we start, restart and stop the PostgreSQL server?

• To start the PostgreSQL server, we run:

service postgresql start

• Once the server is successfully started, we get the below message:

Starting PostgreSQL: ok

• To restart the PostgreSQL server, we run:

service postgresql restart

Once the server is successfully restarted, we get the message:

Restarting PostgreSQL: server stopped


ok

• To stop the server, we run the command:

service postgresql stop

Once stopped successfully, we get the message:

Stopping PostgreSQL: server stopped


ok
47. What are partitioned tables called in PostgreSQL?

Partitioned tables are logical structures that are used for dividing large tables into smaller
structures that are called partitions. This approach is used for effectively increasing the query
performance while dealing with large database tables. To create a partition, a key called partition
key which is usually a table column or an expression, and a partitioning method needs to be
defined. There are three types of inbuilt partitioning methods provided by Postgres:

• Range Partitioning: This method is done by partitioning based on a range of values. This method
is most commonly used upon date fields to get monthly, weekly or yearly data. In the case of
corner cases like value belonging to the end of the range, for example: if the range of partition 1 is
10-20 and the range of partition 2 is 20-30, and the given value is 10, then 10 belongs to the
second partition and not the first.
• List Partitioning: This method is used to partition based on a list of known values. Most
commonly used when we have a key with a categorical value. For example, getting sales data
based on regions divided as countries, cities, or states.
• Hash Partitioning: This method utilizes a hash function upon the partition key. This is done when
there are no specific requirements for data division and is used to access data individually. For
example, you want to access data based on a specific product, then using hash partition would
result in the dataset that we require.

The type of partition key and the type of method used for partitioning determines how positive
the performance and the level of manageability of the partitioned table are.

48. Define tokens in PostgreSQL?

A token in PostgreSQL is either a keyword, identifier, literal, constant, quotes identifier, or any
symbol that has a distinctive personality. They may or may not be separated using a space,
newline or a tab. If the tokens are keywords, they are usually commands with useful meanings.
Tokens are known as building blocks of any PostgreSQL code.

49. What is the importance of the TRUNCATE statement?

TRUNCATE TABLE name_of_table statement removes the data efficiently and quickly from the
table.
The truncate statement can also be used to reset values of the identity columns along with data
cleanup as shown below:

TRUNCATE TABLE name_of_table


RESTART IDENTITY;

We can also use the statement for removing data from multiple tables all at once by mentioning
the table names separated by comma as shown below:

TRUNCATE TABLE
table_1,
table_2,
table_3;
50. What is the capacity of a table in PostgreSQL?

The maximum size of PostgreSQL is 32TB.

51. Define sequence.

A sequence is a schema-bound, user-defined object which aids to generate a sequence of


integers. This is most commonly used to generate values to identity columns in a table. We can
create a sequence by using the CREATE SEQUENCE statement as shown below:

CREATE SEQUENCE serial_num START 100;

To get the next number 101 from the sequence, we use the nextval() method as shown below:

SELECT nextval('serial_num');

We can also use this sequence while inserting new records using the INSERT command:
INSERT INTO ib_table_name VALUES (nextval('serial_num'), 'interviewbit');
52. What are string constants in PostgreSQL?

They are character sequences bound within single quotes. These are using during data insertion
or updation to characters in the database.
There are special string constants that are quoted in dollars.
Syntax: $tag$<string_constant>$tag$ The tag in the constant is optional and when we are not
specifying the tag, the constant is called a double-dollar string literal.

53. How can you get a list of all databases in PostgreSQL?

This can be done by using the command \l -> backslash followed by the lower-case letter L.

54. How can you delete a database in PostgreSQL?

This can be done by using the DROP DATABASE command as shown in the syntax below:

DROP DATABASE database_name;

If the database has been deleted successfully, then the following message would be shown:

DROP DATABASE
55. What are ACID properties? Is PostgreSQL compliant with ACID?

ACID stands for Atomicity, Consistency, Isolation, Durability. They are database transaction
properties which are used for guaranteeing data validity in case of errors and failures.

• Atomicity: This property ensures that the transaction is completed in all-or-nothing way.
• Consistency: This ensures that updates made to the database is valid and follows rules and
restrictions.
• Isolation: This property ensures integrity of transaction that are visible to all other transactions.
• Durability: This property ensures that the committed transactions are stored permanently in the
database.

PostgreSQL is compliant with ACID properties.

56. Can you explain the architecture of PostgreSQL?

• The architecture of PostgreSQL follows the client-server model.


• The server side comprises of background process manager, query processer, utilities and shared
memory space which work together to build PostgreSQL’s instance that has access to the data.
The client application does the task of connecting to this instance and requests data processing to
the services. The client can either be GUI (Graphical User Interface) or a web application. The most
commonly used client for PostgreSQL is pgAdmin.
57. What do you understand by multi-version concurrency control?

MVCC or Multi-version concurrency control is used for avoiding unnecessary database locks
when 2 or more requests tries to access or modify the data at the same time. This ensures that
the time lag for a user to log in to the database is avoided. The transactions are recorded when
anyone tries to access the content.

For more information regarding this, you can refer here.

58. What do you understand by command enable-debug?

The command enable-debug is used for enabling the compilation of all libraries and
applications. When this is enabled, the system processes get hindered and generally also
increases the size of the binary file. Hence, it is not recommended to switch this on in the
production environment. This is most commonly used by developers to debug the bugs in their
scripts and help them spot the issues. For more information regarding how to debug, you can
refer here.

59. How do you check the rows affected as part of previous transactions?

SQL standards state that the following three phenomena should be prevented whilst concurrent
transactions. SQL standards define 4 levels of transaction isolations to deal with these
phenomena.

• Dirty reads: If a transaction reads data that is written due to concurrent uncommitted transaction,
these reads are called dirty reads.
• Phantom reads: This occurs when two same queries when executed separately return different
rows. For example, if transaction A retrieves some set of rows matching search criteria. Assume
another transaction B retrieves new rows in addition to the rows obtained earlier for the same
search criteria. The results are different.
• Non-repeatable reads: This occurs when a transaction tries to read the same row multiple times
and gets different values each time due to concurrency. This happens when another transaction
updates that data and our current transaction fetches that updated data, resulting in different
values.

To tackle these, there are 4 standard isolation levels defined by SQL standards. They are as
follows:

• Read Uncommitted – The lowest level of the isolations. Here, the transactions are not isolated
and can read data that are not committed by other transactions resulting in dirty reads.
• Read Committed – This level ensures that the data read is committed at any instant of read time.
Hence, dirty reads are avoided here. This level makes use of read/write lock on the current rows
which prevents read/write/update/delete of that row when the current transaction is being
operated on.
• Repeatable Read – The most restrictive level of isolation. This holds read and write locks for all
rows it operates on. Due to this, non-repeatable reads are avoided as other transactions cannot
read, write, update or delete the rows.
• Serializable – The highest of all isolation levels. This guarantees that the execution is serializable
where execution of any concurrent operations are guaranteed to be appeared as executing
serially.
The following table clearly explains which type of unwanted reads the levels avoid:

Isolation levels Dirty Reads Phantom Reads Non-repeatable reads

Read Uncommitted Might occur Might occur Might occur

Read Committed Won’t occur Might occur Might occur

Repeatable Read Won’t occur Might occur Won’t occur

Serializable Won’t occur Won’t occur Won’t occur

60. What can you tell about WAL (Write Ahead Logging)?

Write Ahead Logging is a feature that increases the database reliability by logging
changes before any changes are done to the database. This ensures that we have enough
information when a database crash occurs by helping to pinpoint to what point the work has
been complete and gives a starting point from the point where it was discontinued.

For more information, you can refer here.

61. What is the main disadvantage of deleting data from an existing table using the DROP TABLE
command?

DROP TABLE command deletes complete data from the table along with removing the complete
table structure too. In case our requirement entails just remove the data, then we would need to
recreate the table to store data in it. In such cases, it is advised to use the TRUNCATE command.

62. How do you perform case-insensitive searches using regular expressions in PostgreSQL?

To perform case insensitive matches using a regular expression, we can use


POSIX (~*) expression from pattern matching operators. For example:

'interviewbit' ~* '.*INTervIewBit.*'
63. How will you take backup of the database in PostgreSQL?

We can achieve this by using the pg_dump tool for dumping all object contents in the database
into a single file. The steps are as follows:

Step 1: Navigate to the bin folder of the PostgreSQL installation path.

C:\>cd C:\Program Files\PostgreSQL\10.0\bin

Step 2: Execute pg_dump program to take the dump of data to a .tar folder as shown below:

pg_dump -U postgres -W -F t sample_data > C:\Users\admin\pgbackup\sample_data.tar

The database dump will be stored in the sample_data.tar file on the location specified.
64. Does PostgreSQL support full text search?

Full-Text Search is the method of searching single or collection of documents stored on a


computer in a full-text based database. This is mostly supported in advanced database systems
like SOLR or ElasticSearch. However, the feature is present but is pretty basic in PostgreSQL.

65. What are parallel queries in PostgreSQL?

Parallel Queries support is a feature provided in PostgreSQL for devising query plans capable of
exploiting multiple CPU processors to execute the queries faster.

66. Differentiate between commit and checkpoint.

The commit action ensures that the data consistency of the transaction is maintained and it ends
the current transaction in the section. Commit adds a new record in the log that describes the
COMMIT to the memory. Whereas, a checkpoint is used for writing all changes that were
committed to disk up to SCN which would be kept in datafile headers and control files.

Conclusion:

SQL is a language for the database. It has a vast scope and robust capability of creating and
manipulating a variety of database objects using commands like CREATE, ALTER, DROP, etc, and
also in loading the database objects using commands like INSERT. It also provides options for
Data Manipulation using commands like DELETE, TRUNCATE and also does effective retrieval of
data using cursor commands like FETCH, SELECT, etc. There are many such commands which
provide a large amount of control to the programmer to interact with the database in an efficient
way without wasting many resources. The popularity of SQL has grown so much that almost
every programmer relies on this to implement their application's storage functionalities thereby
making it an exciting language to learn. Learning this provides the developer a benefit of
understanding the data structures used for storing the organization's data and giving an
additional level of control and in-depth understanding of the application.

PostgreSQL being an open-source database system having extremely robust and sophisticated
ACID, Indexing, and Transaction supports has found widespread popularity among the developer
community.

You might also like