You are on page 1of 124

Shabnam Sangwan, A.P.


Introduction: Architecture Advantages, Disadvantages, Data models, relational algebra, SQL
Normal forms.

Objective of the Unit:

 Understanding data, information, database and DBMS.
 Development and need of DBMS
 Understanding architecture of DBMS, People associated with the Database
 Necessity and benefits of E-R diagram
 Introduction to Relational Model, Integrity constraints over relations
 Querying relational data, logical database design, introduction to views and tables
 Relational algebra – projection and selection, relational calculus
 Domain relational calculus, expressive power of algebra and calculus

Introductory Concepts
data—a fact, something upon which an inference is based (information or knowledge has value,
data has cost)

data item—smallest named unit of data that has meaning in the real world (examples: last
name, address, ssn, political party)

data aggregate (or group) -- a collection of related data items that form a

whole concept; a simple group is a fixed collection, e.g. date (month, day, year); a repeating
group is a variable length collection, e.g. a set of aliases.

record—group of related data items treated as a unit by an application program (examples:

presidents, elections, congresses)

file—collection of records of a single type (examples: president, election)

database—collection of interrelated stored data that serves the needs of multiple users within
one or more organizations; a collection of tables in the relational model.

database management system (DBMS) -- a generalized software system for storing and
manipulating databases. Includes logical view (schema, sub-schema), physical view (access

M.TECH (Computer Science & Engineering) 1

Shabnam Sangwan, A.P. in CSE, SKITM

methods, clustering), data manipulation language, data definition language, utilities - security,
recovery, integrity, etc.

database administrator (DBA) -- person or group responsible for the effective use of database
technology in an organization or enterprise.

DBMS is generally defined as a collection of logically related data and a set of programs to
access the data. Strictly speaking, this is definition of “Database System”, which comprises of
two components i.e. (i) Database and (ii) DBMS.


Query Processing
Storage Management




Discuss the Type of the Data Base

DATABASE A Database is a collection of logically related data that can be recorded. The
information stored in the database must have the following implicit properties:-

M.TECH (Computer Science & Engineering) 2

Shabnam Sangwan, A.P. in CSE, SKITM

(a) It must represent some real-world aspect; like a college or a company etc. The aspect
represented by the database is called its “Mini-world”.

(b) It must comprise a logically coherent collection of data, which should have well-
understood inherent meaning (semantics).

(c) The repository of data must be designed, developed and implemented for a specific
purpose. There must exist an intended group of users, who must have some pre-conceived
applications of the data.

A Database System will have the following major organs:-

- Sources of information, from where it derives its data.

- Some related real-world events, which influence its data.

- Some intended users, who would be interested in its data.

For example, in the college database, sources of information will be students, faculty, labs etc.
The real-world events affecting the information in the database will be admissions, exams,
results & placements etc. The set of intended users will be faculty, students, admin staff etc.

 Hierarchical database
A hierarchical data model is a data model in which the data is organized into a tree-like structure.
The structure allows repeating information using parent/child relationships: each parent can have
many children but each child only has one parent (also known as a 1:many ratio ). All attributes
of a specific record are listed under an entity type.

M.TECH (Computer Science & Engineering) 3

Shabnam Sangwan, A.P. in CSE, SKITM

In a database, an entity type is the equivalent of a table; each individual record is represented as a
row and an attribute as a column. Entity types are related to each other using 1: N mapping, also
known as one-to-many relationships. this model is recognized as the first data base model
created by IBM in the 1960s

 Network database
The network model is a database model conceived as a flexible way of representing objects and
their relationships. Its distinguishing feature is that the schema, viewed as a graph in which
object types are nodes and relationship types are arcs, is not restricted to being a hierarchy or

The network model's original inventor was Charles Bachman, and it was developed into a
standard specification published in 1969 by the CODASYL Consortium.

 Relational database

M.TECH (Computer Science & Engineering) 4

Shabnam Sangwan, A.P. in CSE, SKITM

A relational database matches data by using common characteristics found within the data set.
The resulting groups of data are organized and are much easier for many people to understand.

For example, a data set containing all the real-estate transactions in a town can be grouped by the
year the transaction occurred; or it can be grouped by the sale price of the transaction; or it can
be grouped by the buyer's last name; and so on.

Such a grouping uses the relational model (a technical term for this is schema). Hence, such a
database is called a "relational database."

The software used to do this grouping is called a relational database management system
(RDBMS). The term "relational database" often refers to this type of software.

Relational databases are currently the predominant choice in storing financial records, medical
records, manufacturing and logistical information, personnel data and much more

 Object-oriented database
An object database (also object-oriented database) is a database model in which information
is represented in the form of objects as used in object-oriented programming.

M.TECH (Computer Science & Engineering) 5

Shabnam Sangwan, A.P. in CSE, SKITM

Object databases are a niche field within the broader DBMS market dominated by relational
database management systems (RDBMS). Object databases have been considered since the early
1980s and 1990s but they have made little impact on mainstream commercial data processing,
though there is some usage in specialized areas

 Object-relational database
An object-relational database (ORD), or object-relational database management system
(ORDBMS), is a database management system (DBMS) similar to a relational database, but with
an object-oriented database model: objects, classes and inheritance are directly supported in
database schemas and in the query language. In addition, it supports extension of the data model
with custom data-types and methods.

M.TECH (Computer Science & Engineering) 6

Shabnam Sangwan, A.P. in CSE, SKITM

An object-relational database can be said to provide a middle ground between relational

databases and object-oriented databases (OODBMS). In object-relational databases, the
approach is essentially that of relational databases: the data resides in the database and is
manipulated collectively with queries in a query language; at the other extreme are OODBMSes
in which the database is essentially a persistent object store for software written in an object-
oriented programming language, with a programming API for storing and retrieving objects, and
little or no specific support for querying.



A structured collection of data, describes the activities of one more related organizations
stored in the computer system [1]. This repository of data is tasked with maintaining and
presenting the data in a consistent and efficient fashion to the applications, and the users of such
applications, that use it.

1.2 DBMS:

A Database Management System is a set of software programs to enable users to create,

maintain, and utilization of data in a database. Typical DBMS Functionality includes :

Defining a database: Specify the data types, structures and constraints for the data.

Constructing the database: Load the database on a secondary storage medium.

Manipulating the database: Querying to retrieve specific data, update to reflect changes,
deletion, and generating reports.

Other features include protection or security measures to prevent unauthorized access, and
presentation and visualization of data.

All manipulations of the structure of the database or the information must be done through the
DBMS as shown below:

M.TECH (Computer Science & Engineering) 7

Shabnam Sangwan, A.P. in CSE, SKITM

Fig : Database Management System

Database Features:

Database Management Systems were developed to handle the following difficulties of typical
file-processing systems supported by conventional operating systems.

Databases provide consistency, concurrency, performance, security, reliability and


Database Administration: By providing a common umbrella for a large collection of data that
is stored by several users, a DBMS facilitates maintenance and data administration tasks. A good
DBA can efficiently shield end-users from the chores of fine-tuning the data representation,
periodic back-ups etc.

Data Abstraction:

The main purpose of a database system is to provide users with an abstract view of the system.
The system hides certain details of how data is stored and maintained from level to level.

Types of Data Abstraction

There are three levels of abstraction:

Physical Level: how and where data are actually stored, lowest level of abstraction with low
level data structures.

M.TECH (Computer Science & Engineering) 8

Shabnam Sangwan, A.P. in CSE, SKITM

Conceptual Level: describes what data is stored, and relationship among the data and semantics
of the data. At this level, database administrator exists.

View Level: Highest level of abstraction, describes partial view of the database to a particular
group of users. This level can be many different views of the database.



A Schema can be defined as, a logical structure described in a formal language supported by the
DBMS [1]. In a relational database, the schema defines a table, fields, and relationships between
fields and tables. A Schema is analogous to type information of a variable in a program.


An Instance is the actual content of the database at a particular point in time. An Instance is
analogous to the value of the variable.

Data Model:

A Data Model is a Collection of tools or concepts for describing data, the meaning of data, data
relationships, and data constraints. There are three different Groups:

1. Object-based Logical Models

2. Record-based Logical Models

3. Physical Data Models

Object-based Logical Models:

In this model, the data is described at conceptual and view level. It provides fairly flexible
structuring capabilities. This model allows specifying data constraints explicitly. It includes:

Entity-relationship Model

Object-oriented Model

M.TECH (Computer Science & Engineering) 9

Shabnam Sangwan, A.P. in CSE, SKITM

Entity-relationship Model:

The Entity-Relationship Model is based on a perception of the world as consisting of a collection

of basic objects (Entities) and relationships among these objects. An Entity is a distinguishable
object that exists. Each Entity has associated with it a set of Attributes describing it. A
Relationship is an association among several Entities. The set of all Entities or Relationships of
the same type is called an Entity Set or Relationship Set.

Object-Oriented Model:

The Object-oriented Model is based on a collection of objects, like the E-R Model. An object
contains values stored in instance variables within the object. Unlike record-based models, these
values are themselves objects. Objects contain objects to an arbitrarily deep level of nesting. An
object also contains bodies of code that operate on the object, which are called Methods. Objects
that contain the same types of values and the same methods are grouped into classes. A class can
be viewed as a type definition for Objects, compared to the concept of an abstract data type in a
programming language. The only way in which one object can access the data of another object
is by invoking the method of that other object, which is called sending a message to the object.

In all Data Models, changing the interest rate entails changing code in application programs. In
object-oriented model, this only requires a change within the pay-interest method.

Unlike entities in the E-R Model, each object has its own unique identity, independent of the
values it contains. Two objects containing the same values are distinct. Distinction is maintained
in physical level by assigning distinct object identifiers.

Fig : An Example of Object-Oriented Data Model

Record-based Logical Models:

M.TECH (Computer Science & Engineering) 10

Shabnam Sangwan, A.P. in CSE, SKITM

In these models, the data is described at conceptual and view levels. These models specify
overall logical structure of the database. The database of this model is structured in fixed-format
records of several types. Each record type defines a fixed number of fields and attributes with
each field usually of fixed length. There are three different groups:

1. Relational Model

2. Network Model

3. Hierarchical Model

Relational Model:

This data model is based on first-order predicate logic. Its core idea is to describe a database as a
collection of predicates over a finite set of predicate variables, describing constraints on the
possible values and combination of values. Data and relationships are represented by a collection
of tables. Each table has a number of columns with unique names, e.g., customer, account. A
relational database allows the definition of data structures, storage and retrieval operations and
integrity constraints.

Fig: Relational Data Model

Network Model:

This model organizes data using two fundamental constructs, called records and sets. Records
contain fields, and sets define one-to-many relationships between records. Data are represented
by collection of records. A set consists of an owner record type, a set name, and a member record
type. An owner record type can also be a member or owner in another set. Relationships among
data are represented by links.

M.TECH (Computer Science & Engineering) 11

Shabnam Sangwan, A.P. in CSE, SKITM

Fig : An Example of a Network Data Model

Hierarchical Model:

In this model, data is organized into a tree-like structure, implying a single upward link in each
record to describe the nesting, and a sort field to keep the records in a particular order in each
same-level list [8].Organizes data in to a tree-like structure. Hierarchy of parent and child data
segments exists. It has repeating information generally in child data segments. This model
collects all the instances of a specific record together as a record type. Links are created between
record types using Parent Child relationships. 1: N mapping exists between record types.


Languages DBMS Speak:

The DBMS must provide appropriate languages once the design is completed. A conceptual and
internal schema and mappings between the two for the database must be specified (DDL). Once
the database schemas are compiled and is populated with data, users must have some means to
manipulate the database (DML).

Data Definition Language (DDL):

DDL is used to specify both conceptual and internal schemas as a set of definitions. DDL
statements are compiled, resulting in a set of tables stored in a special file called Data Dictionary
. Data Dictionary contains Metadata (Data about Data). DDL hides the implementation details
of the database schemas from the users.

Data Manipulation Language (DML):

M.TECH (Computer Science & Engineering) 12

Shabnam Sangwan, A.P. in CSE, SKITM

A Language which facilitates , retrieval of information from the database, and Insertion of new
information into the database, and Deletion of information in the database, and Modification of
information in the database.

There are two main types of DMLs:

Low-Level or Procedural:

Typically retrieves individual records or objects from the database and processes each separately.
Needs to use programming language constructs, such as looping, to retrieve and process each
record from a set of records.

High-level or Nonprocedural:

User specifies what data is needed. Easier for use. May not generate code as efficient as that
produced by procedural languages.

Database Administrator:

A Database Administrator (DBA) is a person having central control over data and programs
accessing that data and is responsible for the following tasks:

1. Schema definition / modification.

2. Storage structure definition / modification.

3. Authorization of data access.

4. Integrity constraints specification.

5. Monitoring performance.

6. Responding to changes in requirements.

Database Users:

Database Users fell into different categories :

Application Programmers:

M.TECH (Computer Science & Engineering) 13

Shabnam Sangwan, A.P. in CSE, SKITM

These people are computer professionals interacting with the system through DML calls
embedded in a program written in a host language. (E.g. C, Java, Pascal). The DML precompiler
converts DML calls to normal procedure calls in a host language. The host language compiler
then generates the object code. These are sometimes called Fourth-generation languages. The
often include features to help generate forms and display data.

Sophisticated Users:

These users interact with the system without writing programs. They form requests by writing
queries in a database query language. These are submitted to a query processor that breaks a
DML statement down into instructions for the database manager module.

Specialized Users:

These users are sophisticated users writing special database application programs. These may be
knowledge-based, expert systems and complex data systems (audio/video) etc.

Naïve Users:

These users are unsophisticated users who interact with the system by using permanent
application programs (e.g. Automated Teller Machine).

Fig 5 Database System Structure [2]

1. What is meant by data independence

M.TECH (Computer Science & Engineering) 14

Shabnam Sangwan, A.P. in CSE, SKITM

Data independence is the capacity to change the schema at one level of the architecture without
having to change the schema at the next higher level. We distinguish between logical and
physical data independence according to which two adjacent levels are involved. The former
refers to the ability to change the conceptual schema without changing the external schema. The
latter refers to the ability to change the internal schema without having to change the conceptual.

Logical Data Independence:

The capacity to change the conceptual schema without having to change the external schemas
and their associated application programs.

Physical Data Independence:

The capacity to change the internal schema without having to change the conceptual schema.

2. What are advantages of views: views are virtual(not real but in effect) tables or relations
which are based on user’s view of particular data base.
3. What is relational schema
 Representation of relational database's entities, attributes within those entities,
and relationships between those entities
 Represented as DDL or Visually
 Example: Employee (Ename,Eid,sal,bdate,hiredate,sex ) where primary key is
4. What is DDL
 DDL means Data Definition Language
 Used by the DBA and database designers to specify the conceptual schema of a
 In many DBMSs, the DDL is also used to define internal and external schemas
 DDL commands are

M.TECH (Computer Science & Engineering) 15

Shabnam Sangwan, A.P. in CSE, SKITM

5. what is Cartesian product
 This operation is used to combine tuples from two relations in a combinatorial
 Denoted by R(A1, A2, . . ., An) x S(B1, B2, . . ., Bm)
 Result is a relation Q with degree n + m attributes:
i. Q(A1, A2, . . ., An, B1, B2, . . ., Bm), in that order.
 The resulting relation state has one tuple for each combination of tuples—one
from R and one from S.
 Hence, if R has nR tuples (denoted as |R| = nR ), and S has nS tuples, then R x S
will have nR * nS tuples.
 The two operands do NOT have to be “type compatible”

6. What is data model

A data model ---a collection of concepts that can be used to describe the
conceptual/logical structure of a database--- provides the necessary means to achieve this

By structure is meant the data types, relationships, and constraints that should hold for
the data. Most data models also include a set of basic operations for specifying
7. What is data redundancy
Repeating the same data again and again is nothing but redundancy.Data redundancy
(such as tends to occur in the "file processing" approach) leads to wasted storage space,
duplication of effort (when multiple copies of a datum need to be updated), and a higher
likelihood of the introduction of inconsistency.
8. Write about Naïve users
 Naive/Parametric end users: Typically the biggest group of users; frequently
query/update the database using standard canned transactions that have been
carefully programmed and tested in advance. Examples:
ii. bank tellers check account balances, post withdrawals/deposits

M.TECH (Computer Science & Engineering) 16

Shabnam Sangwan, A.P. in CSE, SKITM

iii. Reservation clerks for airlines, hotels, etc., check availability of seats/rooms and make
iv. Shipping clerks (e.g., at UPS) who use buttons, bar code scanners, etc., to update status
of in-transit packages.
9. What is DBMS
Database management system is software of collection of small programs to perform
certain operation on data and manage the data.

Two basic operations performed by the DBMS are:

10. What is Relational algebra

 Relational algebra and relational calculus are formal languages associated

with the relational model.
 Informally, relational algebra is a (high-level) procedural language and
relational calculus a non-procedural language.

 Relational algebra operations work on one or more relations to define another

relation without changing the original relations.
11. Define catalog

System catalog, which contains a description of the structure of each file, the type and storage
format of each field, and the various constraints on the data (i.e., conditions that the data must
The system catalog is used not only by users (e.g., who need to know the names of tables and
attributes, and sometimes data type information and other things), but also by the DBMS
software, which certainly needs to "know" how the data is structured/organized in order to
interpret it in a manner consistent with that structure.

12. Define Data Dictionary

M.TECH (Computer Science & Engineering) 17

Shabnam Sangwan, A.P. in CSE, SKITM

Data dictionary / repository:

Used to store schema descriptions and other information such as design decisions, application
program descriptions, user information, usage standards, contains all information stored in
catalog, but accessed by users rather than dbms.

 Describes the (logical) structure of the whole database for a community of users. Hides
physical storage details, concentrating upon describing entities, data types, relationships,
user operations, and constraints. Can be described using either high-level or
implementation data model.

1) Idea of Format of Data
2) Logical storage of Data
3) Structure of DBMS

Students Expected to Learn:

1) Structure of Data, Database, Database Management Systems
2) Views and Levels of Abstraction
3) Database Languages – DDL, DML

Previous paper long answer questions

1. Describe the three schema architecture. Why do we need mapping b/w schema levels
2. list the cases in which null values are appropriate with examples
3. differentiate b/w FPS and DBMS
4. design a conceptual data base design for health insurance system
5. compare and contrast Relational model and Hierarchical model
6. explain the basic operations of Relational Algebra with examples
7. Draw and explain the DBMS component modules
8. what are advantages of DBMS

M.TECH (Computer Science & Engineering) 18

Shabnam Sangwan, A.P. in CSE, SKITM

9. explain the difference b/w among entity, entity type and relation ship set
10. what is integrity constraint explain deferent constraints in DBMS
11. what are the functions of DBA
12. write about architecture of DBMS
13. explain about various database users
14. what are various capabilities of DBMS
15. what is the difference b/w logical data independence and physical data independence
16. discuss the the main types of constraints on specialization and generalization
17. what is e-r model .explain the components E-R model
18. what is sql and various types of commands
19. explain about relation model and advantages of rm.

Important Questions:

Requirements Analysis – user needs; what must database do?

Conceptual Design – high level description (often done with ER model)

Logical Design – translate ER into DBMS data model(Relational model)

(NOW)Schema Refinement – consistency,normalization

Physical Design - indexes, disk layout

Security Design - who accesses what

Good Database Design

• no redundancy of FACT (!)

• no inconsistency
• no insertion, deletion or update anomalies
• no information loss
• no dependency loss
Informal Design Guidelines for Relational Databases

1. Semantics of the Relation Attributes

2. Redundant Information in Tuples and Update Anomalies

M.TECH (Computer Science & Engineering) 19

Shabnam Sangwan, A.P. in CSE, SKITM

3. Null Values in Tuples

4. Spurious Tuples
1: Semantics of the Relation Attributes

GUIDELINE 1: Informally, each tuple in a relation should represent one entity or relationship
instance. (Applies to individual relations and their attributes).

 Attributes of different entities (EMPLOYEEs, DEPARTMENTs, PROJECTs)

should not be mixed in the same relation
 Only foreign keys should be used to refer to other entities
 Entity and relationship attributes should be kept apart as much as possible.
Design a schema that can be explained easily relation by relation. The semantics of
attributes should be easy to interpret.

2: Redundant Information in Tuples and Update Anomalies

 Information is stored redundantly

 Wastes storage
 Causes problems with update anomalies
 Insertion anomalies
 Deletion anomalies
 Modification anomalies
Consider the relation:

EMP_PROJ(Emp#, Proj#, Ename, Pname, No_hours)

Insertion anomalies

Cannot insert a project unless an employee is assigned to it.

Deletion anomalies

a. When a project is deleted, it will result in deleting all the employees who work on
that project.

M.TECH (Computer Science & Engineering) 20

Shabnam Sangwan, A.P. in CSE, SKITM

b. Alternately, if an employee is the sole employee on a project, deleting that

employee would result in deleting the corresponding project.
Modification anomalies

Changing the name of project number P1 from “Billing” to “Customer-Accounting” may

cause this update to be made for all 100 employees working on project P1.


 Design a schema that does not suffer from the insertion, deletion and update
 If there are any anomalies present, then note them so that applications can be
made to take them into account.
 Null Values in Tuples


 Relations should be designed such that their tuples will have as few NULL values
as possible
 Attributes that are NULL frequently could be placed in separate relations (with
the primary key)
 Reasons for nulls:
 Attribute not applicable or invalid
 Attribute value unknown (may exist)
 Value known to exist, but unavailable

1) Understanding the nature of DBMS
2) Relational Algebra and calculus

Students expected to learn:

Relational Algebra, relational calculus

M.TECH (Computer Science & Engineering) 21

Shabnam Sangwan, A.P. in CSE, SKITM

Important Questions:

20. Describe the three schema architecture. Why do we need mapping b/w schema levels
21. list the cases in which null values are appropriate with examples
22. differentiate b/w FPS and DBMS
23. design a conceptual data base design for health insurance system
24. compare and contrast Relational model and Hierarchical model
25. explain the basic operations of Relational Algebra with examples
26. Draw and explain the DBMS component modules
27. what are advantages of DBMS
28. explain the difference b/w among entity, entity type and relation ship set
29. what is integrity constraint explain deferent constraints in DBMS
30. what are the functions of DBA
31. write about architecture of DBMS
32. explain about various database users
33. what are various capabilities of DBMS
34. what is the difference b/w logical data independence and physical data independence
35. discuss the the main types of constraints on specialization and generalization
36. what is e-r model .explain the components E-R model
37. what is sql and various types of commands
38. explain about relation model and advantages of rm.


The process of decomposing unsatisfactory "bad" relations by breaking up their attributes into
smaller relations

• Normalization is used to design a set of relation schemas that is optimal from the point of
view of database updating
• Normalization starts from a universal relation schema

Attributes must be atomic:

M.TECH (Computer Science & Engineering) 22

Shabnam Sangwan, A.P. in CSE, SKITM

– they can be chars, ints, strings

– they can’t be
1. _ tuples
2. _ sets
3. _ relations
4. _ composite
5. _ multivalued

Considered to be part of the definition of relation

Unorganized Relations

Name Paper List



• This is not ideal. Each person is associated with an unspecified

number of papers. The items in the PaperList column do not have a consistent form.

• Generally, RDBMS can’t cope with relations like this. Each entry in a table needs to have a
single data item in it.

• This is an unnormalised relation.

• All RDBMS require relations not to be like this - not to have multiple values in any column
(i.e. no repeating groups)

Name PaperList







M.TECH (Computer Science & Engineering) 23

Shabnam Sangwan, A.P. in CSE, SKITM

This clearly contains the same information.

• And it has the property that we sought. It is in First Normal Form (1NF).

– A relation is in 1NF if no entry consists of more than one value

(i.e. does not have repeating groups)

• So this will be the first requirement in designing our databases: Obtaining 1NF

1NF is obtained by

 Splitting composite attributes

 splitting the relation and propagating the primary key to remove multi valued attributes

There are three approaches to removing repeating groups from unnormalized tables:

 Removes the repeating groups by entering appropriate data in the empty columns of rows
containing the repeating data.

 Removes the repeating group by placing the repeating data, along with a copy of the
original key attribute(s), in a separate relation. A primary key is identified for the new

 By finding maximum possible values for the multi valued attribute and adding that many
attributes to the relation

M.TECH (Computer Science & Engineering) 24

Shabnam Sangwan, A.P. in CSE, SKITM


 The DEPARTMENT schema is not in 1NF because DLOCATION is not a single valued
 The relation should be split into two relations. A new relation DEPT_LOCATIONS is
created and the primary key of DEPARTMENT, DNUMBER, becomes an attribute of
the new relation. The primary key of this relation is {DNUMBER, DLOCATION}
 Alternative solution: Leave the DLOCATION attribute as it is. Instead, we have one
tuple for each location of a DEPARTMENT. Then, the relation is in 1NF, but redundancy
 A super key of a relation schema R = {A1, A2, ...., An} is a set of attributes S subset-of
R with the property that no two tuples t1 and t2 in any legal relation state r of R will have
t1[S] = t2[S]
 A key K is a super key with the additional property that removal of any attribute from K
will cause K not to be a super key any more.
 If a relation schema has more than one key, each is called a candidate key.

M.TECH (Computer Science & Engineering) 25

Shabnam Sangwan, A.P. in CSE, SKITM

 One of the candidate keys is arbitrarily designated to be the primary key, and the
others are called secondary keys.
 A Prime attribute must be a member of some candidate key
 A Nonprime attribute is not a prime attribute—that is, it is not a member of any
candidate key

Functional Dependencies (FDs)

 Definition of FD
 Inference Rules for FDs
 Equivalence of Sets of FDs
 Minimal Sets of FDs

Functional dependency describes the relationship between attributes in a relation.

For example, if A and B are attributes of relation R, and B is functionally dependent on A (

denoted A B), if each value of A is associated with exactly one value of B. ( A and B may
each consist of one or more attributes.)

Trivial functional dependency means that the right-hand side is a subset ( not necessarily a
proper subset) of the left- hand side.

Main characteristics of functional dependencies in normalization

• Have a one-to-one relationship between attribute(s) on the left- and right- hand side of a
• hold for all time;
• are nontrivial.

M.TECH (Computer Science & Engineering) 26

Shabnam Sangwan, A.P. in CSE, SKITM

A set of all functional dependencies that are implied by a given

set of functional dependencies X is called closure of X, written

X+. A set of inference rule is needed to compute X+ from X.

Inference Rules (RATPUP)

1. Reflexivity: If B is a subset of A, them A  B

2. Augmentation: If A  B, then A, C  B,C
3. Transitivity: If A  B and B  C, then A C
4. Projection: If A  B,C then A  B and A C
5. Union: If A  B and A  C, then A B,C
6. psudotransitivity:If A  B and C  D, then A,C B,


From F of above example we can infer:

Full functional dependency indicates that if A and B are

attributes of a relation, B is fully functionally dependent on A if B is functionally dependent on

A, but not on any proper subset of A.

A functional dependency AB is partially dependent if there is some attributes that can be
removed from A and the dependency still holds.


M.TECH (Computer Science & Engineering) 27

Shabnam Sangwan, A.P. in CSE, SKITM

Second normal form (2NF) is a relation that is in first normal form and every non--key attribute
is fully functionally dependent on the key.

The normalization of 1NF relations to 2NF involves the removal of partial dependencies. If a
partial dependency exists, we remove the functional dependent attributes from the relation by
placing them in a new relation along with

a copy of their determinant.

Obtaining 2NF

_ If a nonprime attribute is dependent only on a proper part of a key, then we take the given
attribute as well as the key attributes that determine it and move them all to a new relation

_ We can bundle all attributes determined by the same subset of the key as a unit

Transitive dependency

A condition where A, B, and C are attributes of a relation such that

if A  B and B  C, then C is transitively dependent on A via B

(provided that A is not functionally dependent on B or C).

Third normal form (3NF)

A relation that is in first and second normal form, and in which no non-primary-key attribute is
transitively dependent on the primary key.

The normalization of 2NF relations to 3NF involves the removal of transitive dependencies by
placing the attribute(s) in a new relation along with a copy of the determinant


R is in 3NF if and only if

if X 

_ X is a superkey of R, or

_ A is a key attribute of R

3NF: Alternative Definition

R is in 3NF if every nonprime attribute of R is

 fully functionally dependent on every key of R, and

M.TECH (Computer Science & Engineering) 28
Shabnam Sangwan, A.P. in CSE, SKITM

 non transitively dependent on every key of R.

Obtaining 3NF

 Split off the attributes in the FD that causes trouble and move them, so there are two
relations for each such FD

 The determinant of the FD remains in the original relation

Fig: The normalization process.(a) Normalizing EMP_PROJ into 2NF relations. (b)
Normalizating EMP_DEPT into 3NF relations

Boyce-Codd normal form (BCNF)

M.TECH (Computer Science & Engineering) 29

Shabnam Sangwan, A.P. in CSE, SKITM

A relation is in BCNF, if and only if, every determinant is a key.

The difference between 3NF and BCNF is that for a functional

dependency A  B, 3NF allows this dependency in a relation

if B is a key attribute and A is not a super key,

where as BCNF insists that for this dependency to remain in a relation, A must be a super key.

Fig: Boyce-Codd Normal form. (a) BCNF normalization with the dependency of FD2 being
“lost” in the decomposition.(b) A relation R in 3NF but not in BCNF


R is in Boyce-Codd Normal Form iff

� if X 

� more restrictive than 3NF , preferable—has fewer anomalies

Obtaining BCNF

� As usual, split the schema to move the attributes of the troublesome FD to another

relation, leaving its determinant in the original so they remain connected

M.TECH (Computer Science & Engineering) 30

Shabnam Sangwan, A.P. in CSE, SKITM


 The process of decomposing the universal relation schema R into a set of

relation schemas D = {R1,R2, …, Rm} that will become the relational
database schema by using the functional dependencies.
 Attribute preservation condition:
 Each attribute in R will appear in at least one relation schema Ri in the
decomposition so that no attributes are “lost”.
 Dependency Preservation Property of a Decomposition:
 Definition: Given a set of dependencies F on R, the projection of F on Ri,
denoted by pRi(F) where Ri is a subset of R, is the set of dependencies X  Y in
F+ such that the attributes in X υ Y are all contained in Ri.
 Hence, the projection of F on each relation schema Ri in the decomposition D is
the set of functional dependencies in F+, the closure of F, such that all their left-
and right-hand-side attributes are in Ri.

 Dependency Preservation Property:

 A decomposition D = {R1, R2, ..., Rm} of R is dependency-preserving
with respect to F if the union of the projections of F on each Ri in D is

M.TECH (Computer Science & Engineering) 31

Shabnam Sangwan, A.P. in CSE, SKITM

equivalent to F; that is
((R1(F)) υ . . . υ (Rm(F)))+ = F+

 Lossless (Non-additive) Join Property of a Decomposition:

 Definition: Lossless join property: a decomposition D = {R1, R2, ..., Rm} of R
has the lossless (nonadditive) join property with respect to the set of
dependencies F on R if, for every relation state r of R that satisfies F, the
following holds, where * is the natural join of all the relations in D:

* ( R1(r), ..., Rm(r)) = r

Multi-valued dependency (MVD)

represents a dependency between attributes (for example, A, B and C) in a relation, such

that for each value of A there is a set of values for B and a set of value for C. However, the
set of values for B and C are independent of each other.

A multi-valued dependency can be further defined as being trivial or nontrivial. A MVD A

> B in relation R is defined as being trivial if

• B is a subset of A

A MVD is defined as being nontrivial if neither of the above two conditions is satisfied.

Fourth normal form (4NF)

A relation that is in Boyce-Codd normal form and contains no nontrivial multi-valued


 A relation schema R is in 4NF with respect to a set of dependencies F (that includes

functional dependencies and multivalued dependencies) if, for every nontrivial
multivalued dependency X —>> Y in F+, X is a superkey for R.

M.TECH (Computer Science & Engineering) 32

Shabnam Sangwan, A.P. in CSE, SKITM


 A join dependency (JD), denoted by JD(R1, R2, ..., Rn), specified on relation schema R,
specifies a constraint on the states r of R.
 The constraint states that every legal state r of R should have a non-additive join
decomposition into R1, R2, ..., Rn; that is, for every such r we have
 * (R1(r), R2(r), ..., Rn(r)) = r
Note: an MVD is a special case of a JD where n = 2.

 A join dependency JD(R1, R2, ..., Rn), specified on relation schema R, is a trivial JD if
one of the relation schemas Ri in JD(R1, R2, ..., Rn) is equal to R.

Fifth normal form (5NF)


 A relation schema R is in fifth normal form (5NF) (or Project-Join Normal Form
(PJNF)) with respect to a set F of functional, multivalued, and join dependencies if,
 for every nontrivial join dependency JD(R1, R2, ..., Rn) in F+ (that is, implied by
 every Ri is a superkey of R.

M.TECH (Computer Science & Engineering) 33

Shabnam Sangwan, A.P. in CSE, SKITM

 Each normal form is strictly stronger than the previous one

 Every 2NF relation is in 1NF
 Every 3NF relation is in 2NF
 Every BCNF relation is in 3NF
 Every 4NF relation is in BCNF
 Every 5NF relation is in 4NF

Diagrammatic notation of normal forms:-

M.TECH (Computer Science & Engineering) 34

Shabnam Sangwan, A.P. in CSE, SKITM


 A technique for producing a set of relations with desirable properties, given the data
requirements of an enterprise

 UNF is a table that contains one or more repeating groups 1NF is a relation in which the
intersection of each row and column contains one and only one value

 2NF is a relation that is in 1NF and every non-primary-key attribute is fully functionally
dependent on the primary key.

 3NF is a relation that is in 1NF, 2NF in which no non-primary-key attribute is transitively

dependent on the primary key

 BCNF is a relation in which every determinant is a candidate key 4NF is a relation that is in
BCNF and contains no trivial multi-valued dependency

M.TECH (Computer Science & Engineering) 35

Shabnam Sangwan, A.P. in CSE, SKITM


Query Processing: General strategies for query processing, transformations, expected size, statistics
in estimation, query improvement, view processing, query processor


1) Objectives
2) Introduction
3) Query Processing Problem
4) Objectives of Query Processing
5) Characterization of Query Processors
6) Layers of Query Processing
7) Query Decomposition
8) Data Localization
9) Global Query Optimization
10) Local Query Optimization
Objectives: In this unit we learn about an overview of query processing in Distributed Data Base
Management Systems (DDBMSs). This is explained with the help of Relational Calculus and
Relational Algebra because of their generality and wide use in DDBMSs. In this we discuss

 Various problems of query processing

 About an ideal Query Processor
 The concept of layering in query processing
 Some related examples of query processing
Introduction: The increasing success of relational database technology in data processing is
suitable, in part, to the availability of nonprocedural languages, which can significantly improve
application development and end-user productivity. By hiding the low-level details about the
physical organization of the data, relational database languages allow the expression of complex
queries in a concise and simple fashion. In particular, to construct the answer to the query, the
user does not exactly specify the procedure to follow. This procedure is actually devised by a
M.TECH (Computer Science & Engineering) 36
Shabnam Sangwan, A.P. in CSE, SKITM

DBMS module, called as Query Processor. This relieves the user from query optimization, a
time consuming task that is handled properly by the query processor.

This issue has considerably important both in Centralized and Distributed processing systems.
However, the query processing problem is much more difficult in distributed environments than
in the conventional systems. In exact, the relations involved in distributed queries may be
fragmented and/or replicated, there by inducing communication overhead costs.

So, in this unit let us discuss the different issues of query processing, about an ideal query
processor for distributed environment and finally, a layered software approach for distributed
query processing.

Query Processing Problem:

The main duty of a relational query processor is to transform a high-level query (in relational
calculus), into an equivalent lower level query (in relational algebra). The distributed database is
of major importance for query processing since the definition of fragments is based on the
objective of increasing reference locality, and sometimes-parallel execution for the most
important queries. The role of a distributed query processor is to map a high level query on a
distributed database (a set of global relations) into a sequence of database operations (of
relational algebra) on relational fragments. Several important functions characterize this
 The calculus query must be decomposed into a sequence of relational operations called
an algebraic query
 The data accessed by the query must be localized so that the operations on relations are
translated to bear on local data (fragments)
 The algebraic query on fragments must be extended with communication operations and
optimized with respect to a cost function to be minimized. This cost function refers to
computing resources such as disk I/Os, CPUs, and communication networks.
The low-level query actually implements the execution strategy for the query. The
transformation must achieve both correctness and efficiency. The well-defined mapping with the
above said functional characteristics makes the correctness issue easy. But producing an efficient
execution strategy is more complex. A relational calculus query may have many equivalent and

M.TECH (Computer Science & Engineering) 37

Shabnam Sangwan, A.P. in CSE, SKITM

correct transformations into relational algebra. Since each equivalent execution strategy can lead
to different consumptions of computer resources, the main problem is to select the execution
strategy that minimizes the resource consumption.
Example: We consider the following subset of engineering database scheme given in fig.6.0: E
(ENO, ENAME, TITLE) G (ENO, JNO, RESP, DUR) and the simple user query: “ Find the
names of employees who are managing a project”.


E1 A Elect. Eng. E1 J1 Manager 12

E2 B Syst. Arial, E2 J1 Analyst 24

E3 C Mech. Eng. E2 J2 Analyst 6

E4 D Programmer E3 J3 Consultant 10

E5 E Syst. Anal. E3 J4 Engineer 48

E6 F Elect. Eng. E4 J2 Programmer 18

E7 G Mech. Eng. E5 J2 Manager 24

E8 H Syst. Anal. E6 J4 Manager 48

E7 J3 Engineer 36

E8 J3 Manager 40



J1 Instrumentation 150000 Montreal Elect. Eng. 40000

J2 Database Develop. 135000 New York Syst. Anal. 34000

J3 CAD/CAM 250000 New York Mech. Eng. 27000

M.TECH (Computer Science & Engineering) 38

Shabnam Sangwan, A.P. in CSE, SKITM

J4 Maintenance 310000 Paris Programmer 24000

Fig: Example Database

 The equivalent relational calculus using SQL syntax is:
AND RESP = “ Manager”
 Two equivalent relational algebra queries that are correct transformations of the above
query are:


PJ ENAME (E JN ENO (SL RESP = “Manager” (G)))

NOTE: The following observations are made from the above example:

 It can be observed that the second query avoids the Cartesian product (CP) of E and G,
consumes much less computing resource than the first and thus should be retained. That
is, we have to avoid performing Cartesian product operation on a full table.
 In a centralized environment, the role of the query processor is to choose the best
relational algebra query for a given query among all equivalent ones.
 In a distributed environment, relational algebra is not enough to express execution
strategies. It must be supported with operations for exchanging data between sites. The
distributed query processor has to select the best sites to process the data and the way in
which the data should be transformed with the choice of ordering the relations.
Example: This example illustrates the importance of site selection and communication for a
chosen relational algebra query against a fragmented database. We consider the following query:

PJ ENAME (E JN ENO (SL RESP = “Manager” (G)))

This query is written considering the relations of the previous example. We assume that the
relations E and G are horizontally fragmented as follows:

E 1 = SL ENO  “ E3” (E)

E 2 = SL ENO > “ E3” (E)

M.TECH (Computer Science & Engineering) 39

Shabnam Sangwan, A.P. in CSE, SKITM

G1 = SL ENO  “ E3” (G)

G2 = SL ENO > “ E3” (G)

Fragments G1, G2, E1 and E2 are stored at the sites 1,2,3, and 4, respectively, and the result is
expected at the site 5 as shown in the fig 6.1. For simplicity, we have ignored the project
operation here. In the figure two equivalent strategies for the above query are shown.

Some of the observations of the Strategies:

 An arrow from site i to site j labeled with R indicates that relation R is transferred from
site i to site j.
 Strategy A exploits the fact that relations E and G are fragmented in the same way in
order to perform the select and join operations in parallel.
 Strategy B centralizes all the operations and the data at the result site before processing
the query.
Resource consumption of these two strategies:

 Assumptions made:
 Tuple access denoted as tupacc is 1 unit.
 A tuple transfer, denoted as tuptrans, is 10 units.
 Relations E and G have 400 and 1000 tuples respectively.
 There are 20 managers in relation G.
 The data is uniformly distributed among sites.
 E and G relations are locally clustered an attributes RESP and ENO,
 There is direct access to tuples of G (respectively, E) based on the value of attribute
RESP (respectively, ENO)

The Cost Analysis:

The cost of strategy A can be derived as follows:

1. Produce G' by selecting G requires 20 * tupacc = 20

M.TECH (Computer Science & Engineering) 40

Shabnam Sangwan, A.P. in CSE, SKITM

2. Transfer G' to the sites of E requires 20 * tuptrans = 200

3. Produce E' by joining G' and E requires
(10*10)* tupacc*2 = 200

4. Transfer E' to result site requires 20* tuptrans = 200

he total cost 620

The cost of strategy B can be derived as follows:

1. Transfer E to site 5 requires 400 * tuptrans = 4000

2. Transfer G to site 5 requires 1000 * tuptrans = 10000

3. Produce G' by selecting G requires 1000 * tupacc = 1000

4. Join E and G' requires 400 * 20 * tupacc = 8000

The total cost 23000

The strategy A is better by a factor of 37, which is quite significant. Also it provides the better
distribution of work among the sites. The difference would be still better if we assume slower
communication and/or higher degree of fragmentation.

Result = E1 UN E2

E'1 E'2

Site 3 Site 4
E1 = E1 JN ENO G1 E2 = E2 JN ENO G2

G'1 G'2

Site 5
Site 2
Site 1 G = SL
1 RESP = ‘Manager’ G1 G2 = SLRESP = ‘Manager’ G2

M.TECH (Computer Science & Engineering) 41

(a) Strategy A
Shabnam Sangwan, A.P. in CSE, SKITM

Result = (E1 UN E2 JN ENO PJ RESP ='Manager’ (G1 UN G2)

G1 G2 E1 E2

Site 1 Site 2 Site 3 Site 4

(b) Strategy B
Fig. : Equivalent Distributed Execution Strategies

Objectives of Query Processing:

 The main objectives of query processing in a distributed environment is to form a high

level query on a distributed database, which is seen as a single database by the users, into
an efficient execution strategy expressed in a low level language on local databases.
 An important point of query processing is query optimization. Because many execution
strategies are correct transformations of the same high-level query, the one that optimizes
(minimizes) resource consumption should be retained.
 The good measures of resource consumption are:
 The total cost that will be incurred in processing the query. It is the some of all times
incurred in processing the operations of the query at various sites and intrinsic
 The resource time of the query. This is the time elapsed for executing the query.
Since operations can be executed in parallel at different sites, the response time of a
query may be significantly less than its cost.
 Obviously the total cost should be minimized.
 In a distributed system, the total cost to be minimized includes CPU, I/O, and
communication costs. These costs can be minimized by reducing the number of I/O
operations through fast access methods to the data and efficient use of main
memory. The communication cost is the time needed for exchanging the data
between sites participating in the execution of the query. This cost is incurred in
processing the messages and transmitting the data on the communication network.

M.TECH (Computer Science & Engineering) 42

Shabnam Sangwan, A.P. in CSE, SKITM

In distributed system, the communication cost factor is largely dominating the local
processing cost, so that the other cost factors are ignored.
 In centralized systems, only CPU and I/O cost have to be considered.

Characterization of Query Processors:

It is very difficult to give the characteristics, which differentiates centralized and distributed
query processors. Still some of them have been listed here. Out of them, the first four are
common to both and the next four are particular to distributed query processors.

 Languages: The input language to the query processor can be based on relational calculus
or relational algebra. The former requires an additional phase to decompose a query
expressed in relational calculus to relational algebra. In distributed context, the output
language is generally some form of relational algebra augmented with communication
primitives. That is it must perform perfect mapping between input languages with the
output language.
 Types of optimization: Conceptually, query optimization is to choose a best point of
solution space that leads to the minimum cost. A popular approach called exhaustive
search is used. This is a method where heuristic techniques are used. In both centralized
and distributed systems a common heuristic is to minimize the size of intermediate
relations. Performing unary operations first and ordering the binary operations by the
increasing size of their intermediate relations can do this.
 Optimization Timing: A query may be optimized at different times relative to the actual
time of query execution. Optimization can be done statically before executing the query
or dynamically as the query is executed. The main advantage of the later method is that
the actual sizes of the intermediate relations are available to the query processor, thereby
minimizing the probability of a bad choice. The main drawback of the dynamic method
is that the query optimization, which is an expensive one, must be repeated for each and
every query. So, Hybrid optimization may be better in some situation.
 Statistics: The effectiveness of the query optimization is based on statistics on the database.
Dynamic query optimization requires statistics in order to choose the operation that has

M.TECH (Computer Science & Engineering) 43

Shabnam Sangwan, A.P. in CSE, SKITM

to be done first. Static query optimization requires statistics to estimate the size of
intermediate relations. The accuracy of the statistics can be improved by periodical
 Decision sites: Most of the systems use centralized decision approach, in which a single
site generates the strategy. However, the decision process could be distributed among
various sites participating in the elaboration of the best strategy. The centralized
approach is simpler but requires the knowledge of the complete distributed database
where as the distributed approach requires only local information. Hybrid approach is
better where the major decisions are taken at one particular site and other decisions are
taken locally.
 Exploitation of the Network Topology: the distributed query processor exploits the network
topology. With wide area networks, the cost function to be minimized can be restricted
to the data communication cost, which is a dominant factor. This issue reduces the work
of distributed query optimization, that can be dealt as two separate problems: Selection
of the global execution strategy, based on the inter-site communication and selection of
each local execution strategy, based on a centralized query processing algorithms. With
local area networks, communication costs are comparable to I/O costs. Therefore, it is
reasonable to the distributed query processor to increase parallel execution at the cost of
increasing communication.
 Exploitation of Replicated fragments: For reliability purposes it is useful to have fragments
replicated at different sites. Query processors have to exploit this information either
statically or dynamically for processing the query efficiently.
 Use of semi- joins: The semi-join operation reduces the size of the data that are exchanged
between the sites so that the communication cost can be reduced.

Layers of Query Processing:

The problem of query processing can itself be decomposed into several subprograms,
corresponding to various layers. In figure 6.2, a generic layering scheme for query processing is
shown where each layer solves a well-defined sub-problem. The input is a query on distributed
data expressed in relational calculus. This distributed query is posed on global (distributed)

M.TECH (Computer Science & Engineering) 44

Shabnam Sangwan, A.P. in CSE, SKITM

relations, meaning that data distribution is hidden. Four main layers are involved to map the
distributed query into an optimized sequence of local operations, each acting on a local database.
These layers perform the functions of query decomposition, data localization, global query
optimization, and local query optimization. The first three layers are performed by a central site
and use global information; the local sites do the fourth.















Fig : Generic Layering Scheme for Distributed Query Processing

M.TECH (Computer Science & Engineering) 45

Shabnam Sangwan, A.P. in CSE, SKITM

Query Decomposition: The first layer decomposes the distributed calculus query into an
algebraic query on global relations. The information needed for this transformation is found in
the global conceptual schema describing the global relations. However, the information about
data distribution is not used here but in the next layer. Thus the techniques used by this layer are
those of a centralized DBMS.

Query decomposition can be viewed as four successive steps:

 The calculus query is rewritten in a normalized form that is suitable for subsequent
manipulation. Normalization of a query generally involves the manipulation of the query
quantifiers and of the query qualification by applying logical operator priority.
 The normalized query is analyzed semantically so that incorrect queries are detected and
rejected as early as possible. Techniques to detect incorrect queries exist only for a subset
of relational calculus. Typically, they use some sort of graph that captures the semantics
of the query.
 The correct query (still expressed in relational calculus) is simplified. One way to
simplify a query is to eliminate redundant predicates.
 The calculus query is restructured as an algebraic query. The quality of an algebraic
query is defined in terms of expected performance. The traditional way to do this
transformation toward a "better" algebraic specification is to start with an initial algebraic
query and transform it in order to find a "good" one. The initial algebraic query is derived
immediately from the calculus query by translating the predicates and the target statement
into relational operations as they appear in the query. This directly translated algebra
query is then restructured through transformation rules. The algebraic query generated by
this layer is good in the sense that the worse executions are avoided.

Data Localization:
The input to the second layer is an algebraic query on distributed relations. The main role of the
second layer is to localize the query’s data using data distribution information. Relations are
fragmented and stored in disjoint subsets called fragments, each being stored at a different site.
This layer determines which fragments are involved in the query and transforms the distributed
query into a fragment query. Fragmentation is defined through fragmentations rules that can be
expressed as relational operations. A distributed relation can be reconstructed by applying the

M.TECH (Computer Science & Engineering) 46

Shabnam Sangwan, A.P. in CSE, SKITM

fragmentation rules, and then deriving a program, called a localization program, of relational
algebra operations, which then act on fragments.
Generating a fragments query is done in two steps.
 The distributed query is mapped into a fragment query by substituting each distributed
relation by its reconstruction program (also called materialization program.
 The fragment query is simplified and restructured to produce another “good” query.
Simplification and restructuring may be done according to the same rules used in the
decomposition layer. As in the decomposition layer, the final fragment query is generally
far from optimal because information regarding fragments is not utilized.

Global Query Optimization:

The input to the third layer is a fragment query, that is, an algebraic query on fragments. The
goal of query optimization is to find an execution strategy for the query, which is close to
optimal. An execution strategy for a distributed query can be described with relational algebra
operations and communication primitives (send/receive operations) for transferring data between
sites. The previous layers have already optimized the query for example, by eliminating
redundant expressions. However, this optimization is independent of fragments characteristics
such as cardinalities. In addition, communication operations are not yet specified. By permuting
the ordering of operations within one fragment query, many equivalent queries may be found.
Query optimization consists of finding the “best” ordering of operations in the fragments
query, including communication operations, which minimize a cost function. The cost function,
often defined in terms of time units, refers to computing resources such as disk space, disk I/Os,
buffer space, CPU cost, communication cost and so on. An important aspect of query
optimization is join ordering, since permutations of the joint within the query may lead to
improvements of orders of magnitude. One basic technique for optimizing a sequence of
distributed join operations is through the semi-join operator. The main value of the semi-join in a
distributed system is to reduce the size of the join operands and then the communication cost.
The output of the query optimization layer is an optimized algebraic query with communication
operation included on fragments.

M.TECH (Computer Science & Engineering) 47

Shabnam Sangwan, A.P. in CSE, SKITM

Local Query Optimization:

The last layer us performed by all the sites having fragments involved in query. Each sub-query
executing at one site, called a local query, is then optimized using the local schema of the site. At
this time, the algorithms to perform the relational operations may be chosen. Local optimization
uses the algorithms of centralized systems.

Answer the following: -

a) What is a Query processor?

b) State the Query processing problem.

c) Explain the different characteristics of Query processor.

d) Describe the layer architecture of query processing.

e) Discuss Query optimization.

UNIT-3 & 4
Recovery: Reliability, transactions, recovery in centralized DBMS, reflecting updates, buffer
management, logging schemes, disaster recovery

Concurrency: Introduction, serializability, concurrency control, locking schemes, and timestamp

based order, optimistic scheduling, multiversion techniques, and deadlocks

1) Concept of Transaction
2) ACID properties
3) Serializability
4) Locks – implementation

What is a Transaction?
 A transaction is a logical unit of work –
It may consist of a simple SELECT to generate a list of table contents, or a series of related
UPDATE command sequences.
A database request is the equivalent of a single SQL statement in an application program or

M.TECH (Computer Science & Engineering) 48

Shabnam Sangwan, A.P. in CSE, SKITM

 Must be either entirely completed or aborted –

To sell a product to a customer, the transaction includes updating the inventory by
subtracting the number of units sold from the PRODUCT table’s available quantity on
hand, and updating the accounts receivable table in order to bill the customer later.
 No intermediate states are acceptable –
Updating only the inventory or only the accounts receivable is not acceptable.
 Example Transaction – (Refer Figure 1)
It illustrates a typical inventory transaction that updates a database table by subtracting 10
(units sold) from an already stored value of 40 (units in stock), which leaves 30 (units of
stock in inventory).
 A consistent database state is one in which all data integrity constraints are satisfied.
 At this transaction is taking place, the DBMS must ensure that no other transaction access
Evaluating Transaction Results
 Examine current account balance
WHERE ACC_NUM = ‘0908110638’;

- SQL code represents a transaction because of accessing the database

- Consistent state after transaction
- No changes made to Database
 Register credit sale of 100 units of product X to customer Y for $500:
Reducing product X’s quality on and (QOH) by 100


Adding $500 to customer Y’s accounts receivable

M.TECH (Computer Science & Engineering) 49

Shabnam Sangwan, A.P. in CSE, SKITM


- If both transactions are not completely executed, the transaction yields an inconsistent
- Consistent state only if both transactions are fully completed
- DBMS doesn’t guarantee transaction represents real-world event but it must be able to
recover the database to a previous consistent state. (For instance, the accountant inputs a
wrong amount.)

Transaction Properties
 All transactions must display atomicity, durability, serializability, and isolation.
 Atomicity –
 All transaction operations must be completed
 Incomplete transactions aborted
 Durability –
 Permanence of consistent database state
 Serializability –
 Conducts transactions in serial order
 Important in multi-user and distributed databases
 Isolation –
 Transaction data cannot be reused until its execution complete
 Consistency – (To preserve integrity of data, the database system must ensure: atomicity,
consistency, isolation, and durability (ACID).)
 Execution of a transaction in isolation preserves the consistency of the database.
 A single-user database system automatically ensures serializability and isolation of the
database because only one transaction is executed at a time.
 The atomicity and durability of transactions must be guaranteed by the single-user DBMS.

M.TECH (Computer Science & Engineering) 50

Shabnam Sangwan, A.P. in CSE, SKITM

 The multi-user DBMS must implement controls to ensure serializability and isolation of
transactions – in addition to atomicity and durability – in order to guard the database’s
consistency and integrity.

Transaction State
 Active, the initial state; the transaction
stays in this state while it is executing.
 Partially committed, after the final
statement has been executed.
 Failed, after the discovery that normal
execution can no longer proceed.
 Aborted, after the transaction has been
rolled back and the database restored to
its state prior to the start of the transaction. Two options after it has been aborted:
 Restart the transaction – only if no internal logical error but hardware or software
 Kill the transaction – once internal logical error occurs like incorrect data input.
 Committed, after successful completion. The transaction is terminated once it is aborted or

Transaction Management with SQL

 Defined by ANSI, the standards of SQL transaction support: COMMIT & ROLLBACK
 User initiated transaction sequence must continue until one of following four events occurs:
1. COMMIT statement is reached – all changes are permanently recorded within the
2. ROLLBACK statement is reached – all the changes are aborted and the database is rolled
back to its previous consistent state.
3. End of a program reached – all changes are permanently recorded within the database.
4. Program reaches abnormal termination – the changes made in the database are aborted
and the database is rolled back to its previous consistent state.
For example:

M.TECH (Computer Science & Engineering) 51

Shabnam Sangwan, A.P. in CSE, SKITM






WHERE ACCT_NUM = ‘60120010’;


In fact, the COMMINT statement used in this example is not necessary if the UPDATE
statement is the application’s last action and the application terminates normally.

Transaction Log
 The DBMS use transaction log to track all transactions that update database.
 May be used by ROLLBACK command for triggering recovery requirement.
 May be used to recover from system failure like network discrepancy or disk crash.
 While DBMS executes transactions that modify the database, it also updates the
transaction log. The log stores:
 Record for beginning of transaction
 Each SQL statement
- The type of operation being performed (update, delete, insert).
- The names of objects affected by the transaction (the name of the table).
- The “before” and “after” values for updated fields
- Pointers to previous and next entries for the same transaction.
 Commit Statement – the ending of the transaction.
Table 1 Transaction Log Example

Note: committed transaction are not rolled back –

1. If a system failure occurs, the DBMS will examine the transaction log for all uncommitted or
incomplete transactions, and it will restore (ROLLBACK) the database to its previous state

M.TECH (Computer Science & Engineering) 52

Shabnam Sangwan, A.P. in CSE, SKITM

on the basis of this information.

2. If a ROLLBACK is issued before the termination of a transaction, the DBMS will restore the
database only for that particular transaction, rather than for all transactions, in order to
maintain the durability of the previous transactions.

Concurrency Control
- Coordinates simultaneous transaction execution in multiprocessing database
- Ensure serializability of transactions in multiuser database environment
- Potential problems in multiuser environments
- Three main problems: lost updates, uncommitted data, and inconsistent retrievals
Lost updates
 Assume that two concurrent transactions (T1, T2) occur in a PRODUCT table which records
a product’s quantity on hand (PROD_QOH). The transactions are:
Transaction Computation
T1: Purchase 100 units PROD_QOH = PROD_QOH + 100
T2: Sell 30 units PROD_QOH = PROD_QOH - 30
Table 2 Normal Execution of Two Transactions

Note: this table shows the serial execution of these transactions under normal circumstances,
yielding the correct answer, PROD_QOH=105.

Table 3 Lost Updates

Note: the addition of 100 units is “lost” during the process.

1. Suppose that a transaction is able to read a product’s PROD_QOH value from the table
before a previous transaction has been committed.
2. The first transaction (T1) has not yet been committed when the second transaction (T2) is
3. T2 sill operates on the value 35, and its subtraction yields 5 in memory.
4. T1 writes the value 135 to disk, which is promptly overwritten by T2.

M.TECH (Computer Science & Engineering) 53

Shabnam Sangwan, A.P. in CSE, SKITM

Uncommitted Data
 When two transactions, T1 and T2, are executed concurrently and the first transaction (T1) is
rolled back after the second transaction (T2) has already accessed the uncommitted data –
thus violating the isolation property of transactions. The transactions are:
Transaction Computation
T1: Purchase 100 units PROD_QOH = PROD_QOH + 100 (Rollback)
T2: Sell 30 units PROD_QOH = PROD_QOH - 30

Table 4 Correct Execution of Two Transactions

Note: the serial execution of these transactions yields the correct answer.

Table 5 An Uncommitted Data Problem

Note: the uncommitted data problem can arise when the ROLLBACK is completed after T2 has
begun its execution.

Inconsistent Retrievals
 When a transaction calculates some summary (aggregate) functions over a set of data while
other transactions are updating the data.
 The transaction might read some data before they are changed and other data after they are
changed, thereby yielding inconsistent results.
1. T1 calculates the total quantity on hand of the products stored in the PRODUCT table.
2. T2 updates PROD_QOH for two of the PRODUCT table’s products.
Table 6 Retrieval During Update

Note: T1 calculates PROD_QOH but T2 represents the correction of a typing error, the user
added 30 units to product 345TYX’s PROD_QOH, but meant to add the 30 units to product
‘123TYZ’s PROD_QOH. To correct the problem, the user executes 30 from product
345TYX’s PROD_QOH and adds 30 to product 125TYZ’s PROD_QOH.

M.TECH (Computer Science & Engineering) 54

Shabnam Sangwan, A.P. in CSE, SKITM

Table 7 Transaction Results: Data Entry Correction

Note: The initial and final PROD_QOH values while T2 makes the correction – same results but
different transaction process.

Table 8 Transaction Result: Data Entry Correction


 The transaction table in Table 8 demonstrates that inconsistent retrievals are possible during
the transaction execution, making the result of T1’s execution incorrect.
 Unless the DBMS exercises concurrency control, a multi-user database environment can
create chaos within the information system.

The Scheduler – Schedule, Serializability, Recovery, Isolation

 Previous examples executed the operations within a transaction in an arbitrary order:
 As long as two transactions, T1 and T2, access unrelated data, there is no conflict, and the
order of execution is irrelevant to the final outcome.
 If the transactions operate on related (or the same) data, conflict is possible among the
transaction components, and the selection of one operational order over another may have
some undesirable consequences.
 Establishes order of concurrent transaction execution
 Interleaves execution of database operations to ensure serializability
 Bases actions on concurrency control algorithms
 Locking
 Time stamping
 Ensures efficient use of computer’s CPU
M.TECH (Computer Science & Engineering) 55
Shabnam Sangwan, A.P. in CSE, SKITM

 First-come-first-served basis (FCFS) – executed for all transactions if no way to schedule

the execution of transactions.
 Within multi-user DBMS environment, FCFS scheduling tends to yield unacceptable
response times.
 READ and/or WRITE actions that can produce conflicts.

Table 9 Read/Write Conflict Scenarios: Conflicting Database Operations Matrix

Note: the table below show the possible conflict scenarios if two transactions, T1 and T2, are
executed concurrently over the same data.

 Schedules – sequences that indicate the chronological order in which instructions of

concurrent transactions are executed
 a schedule for a set of transactions must consist of all
instructions of those transactions
 must preserve the order in which the instructions appear in
each individual transaction.
 Example of schedules (refer right figures)
 Schedule 1 (right figure): Let T1 transfer $50 from A to
B, and T2 transfer 10% of the balance from A to B. The
following is a serial schedule, in which T1 is followed by T2.
 Schedule 2 (right figure): Let T1 and T2 be the
transactions defined previously. The following
schedule is not a serial schedule, but it is equivalent
to Schedule 1.
 Schedule 3 (lower right figure): The following
concurrent schedule does not preserve the value of the

M.TECH (Computer Science & Engineering) 56

Shabnam Sangwan, A.P. in CSE, SKITM

sum A + B.
 Serializability – A (possibly concurrent) schedule is serializable if it is equivalent to a serial
schedule. Different forms of schedule equivalence give rise
to the notions of:
1. conflict serializability
2. view serializability
 Conflict Serializability: Instructions li and lj of
transactions Ti and Tj respectively, conflict if and only if
there exists some item Q accessed by both li and lj, and at
least one of these instructions wrote Q.
1. Ii = read(Q), Ij = read(Q). Ii and Ij don’t conflict.
2. Ii = read(Q), Ij = write(Q). They conflict.
3. Ii = write(Q), Ij = read(Q). They conflict
4. Ii = write(Q), Ij = write(Q). They conflict
 If a schedule S can be transformed into a schedule S’ by a series of swaps of non-
conflicting instructions, we say that S and S’ are conflict equivalent.
 We say that a schedule S is conflict serializable if it is conflict equivalent to a serial
 View Serializability: Let S and S´ be two schedules with the same set of transactions. S
and S´ are view equivalent if the following three conditions are met:
1. For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then
transaction Ti must, in schedule S’, also read the initial value of Q.
2. For each data item Q if transaction Ti executes read(Q) in schedule S, and that value
was produced by transaction Tj (if any), then transaction Ti must in schedule S´ also
read the value of Q that was produced by transaction Tj.
3. For each data item Q, the transaction (if any) that
performs the final write(Q) operation in
schedule S must perform the final write(Q)
operation in schedule S’.
 As can be seen, view equivalence is also based purely on reads and writes alone.
 A schedule S is view serializable it is view equivalent to a serial schedule.

M.TECH (Computer Science & Engineering) 57

Shabnam Sangwan, A.P. in CSE, SKITM

 Every conflict serializable schedule is also view serializable.

 Schedule in right figure – a schedule which is view-serializable but not conflict
 Every view serializable schedule that is not conflict
serializable has blind writes.
 Other Notions of Serializability
 Schedule in right figure given below produces same
outcome as the serial schedule < T1, T5 >, yet is not
conflict equivalent or view equivalent to it.
 Determining such equivalence requires analysis of
operations other than read and write.
 Recoverability – Need to address the effect of transaction failures on concurrently running
 Recoverable schedule – if a transaction Tj reads a data items previously written by a
transaction Ti, the commit operation of Ti appears before the commit operation of Tj.
 The schedule in right figure is not recoverable if T9 commits
immediately after the read.
 If T8 should abort, T9 would have read (and possibly shown
to the user) an inconsistent database state. Hence database
must ensure that schedules are recoverable.
 Cascading rollback – a single transaction failure
leads to a series of transaction rollbacks. Consider
the following schedule where none of the
transactions has yet committed (so the schedule is
 If T10 fails, T11 and T12 must also be rolled back.
 Can lead to the undoing of a significant amount of work
 Cascadeless schedules — cascading rollbacks cannot occur; for each pair of transactions
Ti and Tj such that Tj reads a data item previously written by Ti, the commit operation of
Ti appears before the read operation of Tj.
 Every cascadeless schedule is also recoverable

M.TECH (Computer Science & Engineering) 58

Shabnam Sangwan, A.P. in CSE, SKITM

 It is desirable to restrict the schedules to those that are cascadeless

 Implementation of Isolation
 Schedules must be conflict or view serializable, and recoverable, for the sake of database
consistency, and preferably cascadeless.
 A policy in which only one transaction can execute at a time generates serial schedules,
but provides a poor degree of concurrency.
 Concurrency-control schemes tradeoff between the amount of concurrency they allow
and the amount of overhead that they incur.
 Some schemes allow only conflict-serializable schedules to be generated, while others
allow view-serializable schedules that are not conflict-serializable.
Concurrency Control with Locking Methods
- Lock guarantees current transaction exclusive use of data item, i.e., transaction T2 does not
have access to a data item that is currently being used by transaction T1.
- Acquires lock prior to access.
- Lock released when transaction is completed.
- DBMS automatically initiates and enforces locking procedures.
- All lock information is managed by lock manager.

Lock Granularity
 Lock granularity indicates level of lock use: database, table, page, row, or field (attribute).

 The entire database is locked.
 Transaction T2 is prevented to use any tables in the database while T1 is being executed.
 Good for batch processes, but unsuitable for online multi-user DBMSs.
 Refer Figure 2, transactions T1 and T2 cannot access the same database concurrently, even if
they use different tables. (The access is very slow!)

 The entire table is locked. If a transaction requires access to several tables, each table may
be locked.

M.TECH (Computer Science & Engineering) 59

Shabnam Sangwan, A.P. in CSE, SKITM

 Transaction T2 is prevented to use any row in the table while T1 is being executed.
 Two transactions can access the same database as long as they access different tables.
 It causes traffic jams when many transactions are waiting to access the same table.
 Table-level locks are not suitable for multi-user DBMSs.
 Refer Figure 3, transaction T1 and T2 cannot access the same table even if they try to use
different rows; T2 must wait until T1 unlocks the table.

 The DBMS will lock an entire diskpage (or page), which is the equivalent of a diskblock as a
(referenced) section of a disk.
 A page has a fixed size and a table can span several pages while a page can contain several
rows of one or more tables.
 Page-level lock is currently the most frequently used multi-user DBMS locking method.
 Figure 4 shows that T1 and T2 access the same table while locking different diskpages.
 T2 must wait for using a locked page which locates a row, if T1 is using it.

 With less restriction respect to previous discussion, it allows concurrent transactions to
access different rows of the same table even if the rows are located on the same page.
 It improves the availability of data, but requires high overhead cost for management.
 Refer Figure 5 for row-level lock.
 It allows concurrent transactions to access the same row, as long as they require the use of
different fields (attributes) within a row.
 The most flexible multi-user data access, but cost extremely high level of computer
Lock Types
 The DBMS may use different lock types: binary or shared/exclusive locks.
 A locking protocol is a set of rules followed by all transactions while requesting and
releasing locks. Locking protocols restrict the set of possible schedules.

M.TECH (Computer Science & Engineering) 60

Shabnam Sangwan, A.P. in CSE, SKITM

Binary Locks
 Two states: locked (1) or unlocked (0).
 Locked objects are unavailable to other objects.
 Unlocked objects are open to any transaction.
 Transaction unlocks object when complete.
 Every transaction requires a lock and unlock operation for each data item that is accessed.

Table 10 Example of Binary Lock Table

Note: the lock and unlock features eliminate the lost update problem encountered in table 3.
However, binary locks are now considered too restrictive to yield optimal concurrency

Shared/Exclusive Locks
 Shared (S Mode)
 Exists when concurrent transactions granted READ access
 Produces no conflict for read-only transactions
 Issued when transaction wants to read and exclusive lock not held on item
 Exclusive (X Mode)
 Exists when access reserved for locking transaction
 Used when potential for conflict exists (also refer Table 9)
 Issued when transaction wants to update unlocked data
 Lock-compatibility matrix
 A transaction may be granted a lock on an item if the
requested lock is compatible with locks already held on
the item by other transactions
 Any number of transactions can hold shared locks on an
item, but if any transaction holds an exclusive on the item no other transaction may hold
any lock on the item.
 If a lock cannot be granted, the requesting transaction is made to wait till all incompatible
locks held by other transactions have been released. The lock is then granted.
 Reasons to increasing manager’s overhead
M.TECH (Computer Science & Engineering) 61
Shabnam Sangwan, A.P. in CSE, SKITM

 The type of lock held must be known before a lock can be granted
 Three lock operations exist: READ_LOCK (to check the type of lock), WRITE_LOCK
(to issue the lock), and UNLOCK (to release the lock).
 The schema has been enhanced to allow a lock upgrade (from shared to exclusive) and a
lock downgrade (from exclusive to shared).
 Problems with Locking
 Transaction schedule may not be serializable
 Managed through two-phase locking
 Schedule may create deadlocks
 Managed by using deadlock detection and prevention techniques

Two-Phase Locking
- Two-phase locking defines how transactions acquire and relinquish (or revoke) locks.
1. Growing phase – acquires all the required locks without unlocking any data. Once all
locks have been acquired, the transaction is in its locked point.
2. Shrinking phase – releases all locks and cannot obtain any new lock.
 Governing rules
 Two transactions cannot have conflicting locks
 No unlock operation can precede a lock operation in the same transaction
 No data are affected until all locks are obtained
 In the example for two-phase locking protocol (Figure 6), the transaction acquires all the
locks it needs (two locks are required) until it reaches its locked point.
 When the locked point is reached, the data are modified to conform to the transaction
 The transaction is completed as it released all of the locks it acquired in the first phase.
 Updates for two-phase locking protocols:
 Two-phase locking does not ensure freedom from deadlocks.
 Cascading roll-back is possible under two-phase locking. To avoid this, follow a
modified protocol called strict two-phase locking. Here a transaction must hold all its
exclusive locks till it commits/aborts.
 Rigorous two-phase locking is even stricter: here all locks are held till commit/abort. In

M.TECH (Computer Science & Engineering) 62

Shabnam Sangwan, A.P. in CSE, SKITM

this protocol transactions can be serialized in the order in which they commit.
 There can be conflict serializable schedules that cannot be obtained if two-phase locking
is used.
 However, in the absence of extra information (e.g., ordering of access to data), two-
phase locking is needed for conflict serializability in the following sense:
Given a transaction Ti that does not follow two-phase locking, we can find a transaction
Tj that uses two-phase locking, and a schedule for Ti and Tj that is not conflict
- Occurs when two transactions wait for each other to unlock data. For example:
T1 = access data items X and Y
T2 = access data items Y and X
 Deadly embrace – if T1 has not unlocked data item Y, T2 cannot begin; if T2 has not unlocked
data item X, T1 cannot continue. (Refer Table 11.)
 Starvation is also possible if concurrency control manager is badly designed.
 For example, a transaction may be waiting for an X-lock (exclusive mode) on an item,
while a sequence of other transactions request and are granted an S-lock (shared mode)
on the same item.
 The same transaction is repeatedly rolled back due to deadlocks.
 Control techniques
 Deadlock prevention – a transaction requesting a new lock is aborted if there is the
possibility that a deadlock can occur.
 If the transaction is aborted, all the changes made by this transaction are rolled back,
and all locks obtained by the transaction are released.
 It works because it avoids the conditions that lead to deadlocking.
 Deadlock detection – the DBMS periodically tests the database for deadlocks.
 If a deadlock is found, one of the transactions (the “victim”) is aborted (rolled back
and restarted), and the other transaction continues.
 Deadlock avoidance – the transaction must obtain all the locks it needs before it can be
 The technique avoids rollback of conflicting transactions by requiring that locks be

M.TECH (Computer Science & Engineering) 63

Shabnam Sangwan, A.P. in CSE, SKITM

obtained in succession.
 The serial lock assignment required in deadlock avoidance increase action response
 Control Choices
 If the probability of deadlocks is low, deadlock detection is recommended.
 If the probability of deadlocks is high, deadlock prevention is recommended.
 If response time is not high on the system priority list, deadlock avoidance might be

Implementation of Locking
 A Lock manager can be implemented as a separate process to which transactions send lock
and unlock requests
 The lock manager replies to a lock request by sending a lock grant messages (or a message
asking the transaction to roll back, in case of a deadlock)
 The requesting transaction waits until its request is answered
 The lock manager maintains a datastructure called a lock table to record granted locks and
pending requests
 The lock table is usually implemented as an in-memory hash table indexed on the name of
the data item being locked
 Lock Table
 Black rectangles indicate granted locks, white ones
indicate waiting requests
 Lock table also records the type of lock granted or
 New request is added to the end of the queue of
requests for the data item, and granted if it is
compatible with all earlier locks
 Unlock requests result in the request being deleted,
and later requests are checked to see if they can now
be granted
 If transaction aborts, all waiting or granted requests of the transaction are deleted

M.TECH (Computer Science & Engineering) 64

Shabnam Sangwan, A.P. in CSE, SKITM

 lock manager may keep a list of locks held by each transaction, to implement this
Concurrency Control with Time Stamping Methods
 Assigns global unique time stamp to each transaction
 Produces order for transaction submission
 Properties
 Uniqueness: ensures that no equal time stamp values can exist.
 Monotonicity: ensures that time stamp values always increase.
 DBMS executes conflicting operations in time stamp order to ensure serializability of the
 If two transactions conflict, one often is stopped, rolled back, and assigned a new time
stamp value.
 Each value requires two additional time stamps fields
 Last time field read
 Last update
 Time stamping tends to demand a lot of system resources because there is a possibility that
many transactions may have to be stopped, rescheduled, and re-stamped.

Timestamp-Based Protocols
 Each transaction is issued a timestamp when it enters the system. If an old transaction Ti has
time-stamp TS(Ti), a new transaction Tj is assigned time-stamp TS(Tj) such that TS(Ti) <
 The protocol manages concurrent execution such that the time-stamps determine the
serializability order.
 In order to assure such behavior, the protocol maintains for each data Q two timestamp
 W-timestamp(Q) is the largest time-stamp of any transaction that executed write(Q)
 R-timestamp(Q) is the largest time-stamp of any transaction that executed read(Q)
 The timestamp ordering protocol ensures that any conflicting read and write operations are

M.TECH (Computer Science & Engineering) 65

Shabnam Sangwan, A.P. in CSE, SKITM

executed in timestamp order.

 Suppose a transaction Ti issues a read(Q)
1. If TS(Ti) ≦ W-timestamp(Q), then Ti needs to read a value of Q that was already
overwritten. Hence, the read operation is rejected, and Ti is rolled back.
2. If TS(Ti) ≧ W-timestamp(Q), then the read operation is executed, and R-timestamp(Q)
is set to the maximum of R-timestamp(Q) and TS(Ti).
 Suppose that transaction Ti issues write(Q).
1. If TS(Ti) < R-timestamp(Q), then the value of Q that Ti is producing was needed
previously, and the system assumed that that value would never be produced. Hence, the
write operation is rejected, and Ti is rolled back.
2. If TS(Ti) < W-timestamp(Q), then Ti is attempting to write an obsolete value of Q.
Hence, this write operation is rejected, and Ti is rolled back.
3. Otherwise, the write operation is executed, and W-timestamp(Q) is set to TS(Ti).

Concurrency Control with Optimistic Methods

 A validation-based protocol that assumes most database operations do not conflict.
 No requirement on locking or time stamping techniques.
 Transaction executed without restrictions until committed and fully in the hope that all will
go well during validation.
 Two or three Phases:
 Read (and Execution) Phase – the transaction reads the database, executes the needed
computations, and makes the updates to a private copy of the database values.
 Validation Phase – the transaction is validated to ensure that the changes made will not
affect the integrity and consistency of the database.
 Write Phase – the changes are permanently applied to the database.
 The optimistic approach is acceptable for mostly read or query database system that require
very few update transactions.
 Each transaction Ti has 3 timestamps
 Start(Ti) : the time when Ti started its execution
 Validation(Ti): the time when Ti entered its validation phase
 Finish(Ti): the time when Ti finished its write phase
M.TECH (Computer Science & Engineering) 66
Shabnam Sangwan, A.P. in CSE, SKITM

 Serializability order is determined by timestamp given at validation time, to increase

concurrency. Thus TS(Ti) is given the value of Validation(Ti).
 This protocol is useful and gives greater degree of concurrency if probability of conflicts is
low. That is because the serializability order is not pre-decided and relatively less
transactions will have to be rolled back.

Database Recovery Management

 Restores a database to previously consistent state, usually inconsistent, to a previous
consistent state.
 Based on the atomic transaction property: all portions of the transaction must be treated as
a single logical unit of work, in which all operations must be applied and completed to
produce a consistent database.
 Level of backup
 Full backup – dump of the database.
 Differential backup – only the last modifications done to the database are copied.
 Transaction log – only the transaction log operations that are not reflected in a previous
backup copy of the database.
 The database backup is stored in a secure place, usually in a different building, and protected
against dangers such as file, theft, flood, and other potential calamities.
 Causes of Database Failure
 Software – be traceable to the operating system, the DBMS software, application
programs, or virus.
 Hardware – include memory chip errors, disk crashes, bad disk sectors, disk full errors.
 Programming Exemption – application programs or end users may roll back transactions
when certain conditions are defined.
 Transaction – the system detects deadlocks and aborts one of the transactions.
 External – a system suffers complete destruction due to fire, earthquake, flood, etc.

Transaction Recovery
 Four important concepts to affect recovery process –

M.TECH (Computer Science & Engineering) 67

Shabnam Sangwan, A.P. in CSE, SKITM

 Write-ahead-log protocol – ensures that transaction logs are always written before any
database data are actually updated.
 Redundant transaction logs – ensure that a disk physical failure will not impair the
DBMS ability to recover data.
 Database buffers – create temporary storage area in primary memory used to speed up
disk operations and improve processing time.
 Database checkpoint – setup an operation in which the DBMS writes all of its updated
buffers to disk and registered in the transaction log.
 Transaction recovery procedure generally make use of deferred-write and write-through
 Deferred-write (or Deferred-update)
 Changes are written to the transaction log, not physical database.
 Database updated after transaction reaches commit point.
 Steps:
1. Identify the last checkpoint in the transaction log. This is the last time transaction
data was physically saved to disk.
2. For a transaction that started and committed before the last checkpoint, nothing
needs to be done, because the data are already saved.
3. For a transaction that performed a commit operation after the last checkpoint, the
DBMS uses the transaction log records to redo the transaction and to update the
database, using “after” values in the transaction log. The changes are made in
ascending order, from the oldest to the newest.
4. For any transaction with a RP::BACK operation after the last checkpoint or that
was left active (with neither a COMMIT nor a ROLLBACK) before the failure
occurred, nothing needs to be done because the database was never updated.
 Write-through (or immediate update)
 Immediately updated by during execution
 Before the transaction reaches its commit point
 Transaction log also updated
 Transaction fails, database uses log information to ROLLBACK
 Steps:

M.TECH (Computer Science & Engineering) 68

Shabnam Sangwan, A.P. in CSE, SKITM

1. Identify the last checkpoint in the transaction log. This is the last time transaction
data was physically saved to disk.
2. For a transaction that started and committed before the last checkpoint, nothing
needs to be done, because the data are already saved.
3. For a transaction that committed after the last checkpoint, the DBMS redoes the
transaction, using “after” values in the transaction log. Changes are applied in
ascending order, from the oldest to the newest.
4. For any transaction with a ROLLBACK operation after the last checkpoint or that
was left active (with neither a COMMIT nor a ROLLBACK) before the failure
occurred, the DBMS uses the transaction log records to ROLLBACK or undo the
operations, using the “before” values in the transaction log. Changes are applied
in reverse order, from the newest to the oldest.
Important Questions:

1. What is atomicity property

 Atomicity means that a transaction is an atomic unit of processing; it is either performed
in its entirety or not performed at all.
 Although a transaction is conceptually atomic, a transaction would usually consist of a
number of steps. It is necessary to make sure that other transactions do not see partial
results of a transaction and therefore either all actions of a transaction are completed or
the transaction has no effect on the database. Therefore a transaction is either completed
successfully or rolled back. This is sometime called all-or-nothing.
2. Define serializability of transactions
Serializability is a given set of interleaved transactions is said to be serializable if and only if
it produces the same results as the serial execution of the same transactions.

3. What is functional dependency

A functional dependency is a property of the semantics of the attributes in a relation. The
semantics indicate how attributes relate to one another, and specify the functional
dependencies between attributes. When a functional dependency is present, the dependency
is specified as a constraint between the attributes.

M.TECH (Computer Science & Engineering) 69

Shabnam Sangwan, A.P. in CSE, SKITM

Consider a relation with attributes A and B, where attribute B is functionally dependent on

attribute A. If we know the value of A and we examine the relation that holds this
dependency, we will find only one value of B in all of the tuples that have a given value of A,
at any moment in time. Note however, that for a given value of B there may be several
different values of A.

In the figure above, A is the determinant of B and B is the consequent of A.

The determinant of a functional dependency is the attribute or group of attributes on the left-
hand side of the arrow in the functional dependency. The consequent of a fd is the attribute or
group of attributes on the right-hand side of the arrow.

4. What is Normalization
 The process of decomposing unsatisfactory "bad" relations by breaking up their
attributes into smaller relations.
 Normalization is a process of analyzing relation schemas so that the following
can be achieved
1. Minimizing redundancy
2. Minimizing insertion, updating, deletion anomalies
5. What is revoke command
Revoke is a DDL command which is used to disallow the privileges that are granted by
DBA using Grant command.

6. What is Transaction
Def 1: Logical unit of database processing that includes one or more access operations (read
-retrieval, write - insert or update, delete).

Def 2: Transaction is an executing program forming a logical unit of database access

operations that involves one or more database operations (read -retrieval, write - insert or
update, delete).

 Transaction boundaries:

M.TECH (Computer Science & Engineering) 70

Shabnam Sangwan, A.P. in CSE, SKITM

o Begin and End transaction.

 An application program may contain several transactions separated by the Begin and End
transaction boundaries.
 Basic operations are read and write
o read_item(X): Reads a database item named X into a program variable. To
simplify our notation, we assume that the program variable is also named X.
o write_item(X): Writes the value of program variable X into the database item
named X.

7. What is Trigger
Triggers are simply stored procedures that are ran automatically by the database whenever
some event happens.

8. What is the use of serializability

 Achieving concurrency by executing no. of transactions at a time
 Fast response to the user with correct result
 Utilization of recourses efficiently
9. What are transaction primitives
Transaction boundaries are nothing but transaction primitives. They are

Begin Transaction and End Transaction.

10. What is an Assertion

Assertion is nothing but a name given to a set of user defined constraints.

11. What are uses of Transaction

 It’s all about fast query response time and correctness
 DBMS is a multi-user systems
o Many different requests
o Some against same data items
 Figure out how to interleave requests to shorten response time while guaranteeing correct
o How does DBMS know which actions belong together?
 Solution: Group database operations that must be performed together into transactions
o Either execute all operations or none
What are Anomalies

 Refers to a deviation from the common rule(s), type(s), arrangement(s), or


M.TECH (Computer Science & Engineering) 71

Shabnam Sangwan, A.P. in CSE, SKITM

 The general anomalies are insertion ,updation,deletion anomalies.

Define view serializability

View equivalence:

 Two schedules are said to be view equivalent if the following three conditions hold:
1. The same set of transactions participates in S and S’, and S and S’ include the
same operations of those transactions.
2. For any operation Ri(X) of Ti in S, if the value of X read by the operation has
been written by an operation Wj(X) of Tj (or if it is the original value of X before
the schedule started), the same condition must hold for the value of X read by
operation Ri(X) of Ti in S’.
3. If the operation Wk(Y) of Tk is the last operation to write item Y in S, then
Wk(Y) of Tk must also be the last operation to write item Y in S’.

View serializability:

 Definition of serializability based on view equivalence.

 A schedule is view serializable if it is view equivalent to a serial schedule.

What are transaction properties

 Atomicity: A transaction is an atomic unit of processing; it is either performed in its
entirety or not performed at all.
 Consistency preservation: A correct execution of the transaction must take the
database from one consistent state to another.
 Isolation: A transaction should not make its updates visible to other transactions until
it is committed; this property, when enforced strictly, solves the temporary update
problem and makes cascading rollbacks of transactions unnecessary .
 Durability or permanency: Once a transaction changes the database and the changes
are committed, these changes must never be lost because of subsequent failure.

Define Latches

M.TECH (Computer Science & Engineering) 72

Shabnam Sangwan, A.P. in CSE, SKITM

Locks held for a short duration are called Latches. Latches do not follow concurrency
methods rather than they used to guarantee the physical integrity of a page when that page
being written from the buffer to disk.

Define Exclusive lock

Exclusive lock is a lock which specifies that no other transaction can able to access the data
item, except the current transaction which holds it. Generally write lock is called as
Exclusive lock.

Define Certify lock

Certify lock is a lock used in multi version concurrency technique to certify that the new
version created during write operation is going to be stored permanently in database.

Define Granularity
Granularity means size of data item which may be one of the following

1. A database record
2. A field value of a database record
3. A disk block
4. A whole file
5. The whole database
Important Questions:

1) What is recovery and atomicity

2) What is recovery with concurrent transaction
3) Write a note on system crash
4) Write a note on how to over come non volatile storage devices problem

M.TECH (Computer Science & Engineering) 73

Shabnam Sangwan, A.P. in CSE, SKITM

Object Oriented Data Base Development: Introduction, Object Definition language, creating
object instances, Object query language.


Definition of OODBMS:
An object-oriented database management system (OODBMS), is a database management system
(DBMS) that supports the modelling and creation of data as objects. This includes some kind of
support for classes of objects and the inheritance of class properties and methods by subclasses
and their objects.

Another definition provided by “The Object-Oriented Database Manifesto”, Malcolm Atkinson

and others define an OODBMS as follows:

An object-oriented database system must satisfy two criteria: it should be a DBMS, and it should
be an object-oriented system, i.e., to the extent possible, it should be consistent with the current
crop of object-oriented programming languages. The first criterion translates into five features:
persistence, secondary storage management, concurrency, recovery and an ad hoc query facility.
The second one translates into eight features: complex objects, object identity, encapsulation,
types or classes, inheritance, overriding combined with late binding, extensibility and
computational completeness.

Comparison to Relational Databases:

Relational databases store data in tables that are two dimensional. The tables have rows and
columns. Relational database tables are "normalized" so data is not repeated more often than
necessary. All table columns depend on a primary key (a unique value in the column) to identify
the column. Once the specific column is identified, data from one or more rows associated with
that column may be obtained or changed. Breaking complex information out into simple data
takes time and is labor intensive. Code must be written to accomplish this task.

When to use OODBMS:

Object oriented databases should be used when there is complex data and/or complex data
relationships. This includes a many to many object relationship. Object databases should not be

M.TECH (Computer Science & Engineering) 74

Shabnam Sangwan, A.P. in CSE, SKITM

used when there would be few join tables and there are large volumes of simple transactional

Object oriented databases work well with:

1. CAS Applications (CASE-computer aided software engineering, CAD-computer aided

design, CAM-computer aided manufacture)

2. Multimedia Applications

3. Object projects that change over time

4. Complex Data Relationships

Advantages and Disadvantages of OODBMS:

Advantages of OODBMS over RDBMS:

1. Objects don't require assembly and disassembly saving coding time and execution time to
assemble or disassemble objects.

2. Complex (Inter-) Relationships

3. Complex Data

4. Reduced paging

5. Easier navigation

6. Better concurrency control - A hierarchy of objects may be locked.

Disadvantages of OODBMS over RDBMS:

1. Lower efficiency when data is simple and relationships are simple.

2. Relational tables are simpler.

3. OODBMS is based on object model – which lacks the solid theoretical foundation of the
relational model on which the RDBMS is built.

4. OODBMSs do not provide a standard ad hoc query language, as relational systems do.

5. The lack of compatibility between different OODBMSs makes switching from one piece
of software to another very difficult. With RDBMSs, different products are very similar,
and switching from one to another is relatively easy.

M.TECH (Computer Science & Engineering) 75

Shabnam Sangwan, A.P. in CSE, SKITM

6. Late binding may slow access speed.

7. More user tools exist for RDBMS.

8. Standards for RDBMS are more stable.

Mandatory Features of an OODBMS:

An OODBMS is the result of combining OO features, such as class inheritance, encapsulation,

and polymorphism, with database features such as data integrity, security, persistence,
transaction management, concurrency control, backup, recovery, data manipulation, and system
tuning [14].

Fig : Object-Oriented Database Management Systems

The OODBMS features, which include 13 mandatory features and optional characteristics of
OODBMS, are defined as follows:

Rule 1: The system must support complex objects.

It must be possible to construct complex objects from existing objects. Simplest objects are
objects such as integers, characters, byte strings of any length, Booleans and floats (one might
add other atomic types).

Examples include sets, lists, and tuples that allow the user to define aggregation of objects as

M.TECH (Computer Science & Engineering) 76

Shabnam Sangwan, A.P. in CSE, SKITM

Sets are critical because they are a natural way of representing collections from the real world.
Tuples are critical because they are a natural way of representing properties of an entity.

Lists or arrays are important because they capture order, which occurs in the real world, and they
also arise in many scientific applications, where people need matrices or time series data.

Rule 2: Object identity must be supported.

The OID must be independent of the object’s state. This feature allows the system to compare
objects at two different levels: comparing the OID (identical objects) and comparing the object’s

An object has an existence which is independent of its value. Thus two notions of object
equivalence exist: two objects can be identical (they are the same object) or they can be equal
(they have the same value). This has two implications: one is object sharing and the other one is
object updates.

Object sharing: in an identity-based model, two objects can share a component.

Object updates: Object identity is also a powerful data manipulation primitive that can be the
basis of set, tuple and recursive complex object manipulation.

Rule 3: Objects must be encapsulated.

Objects have a public interface, but private implementation of data and methods. The
encapsulation feature ensures that only the public aspect of the object is seen, while the
implementation details are hidden. Need for modularity. Modularity is necessary to structure
complex applications designed and implemented by a team of programmers.

Encapsulation provides a form of `logical data independence`: we can change the

implementation of a type without changing any of the programs using that type. Thus, the
application programs are protected from implementation changes in the lower layers of the

Rule 4: The systems must support types or classes.

This rule allows the designer to choose whether the system supports types or classes. Types are
used mainly at compile time to check type errors in attribute value assignments.

M.TECH (Computer Science & Engineering) 77

Shabnam Sangwan, A.P. in CSE, SKITM

A type, corresponds to the notion of an abstract data type. It has two parts: the interface and the
implementation (or implementations). The interface part is visible to the users of the type, the
implementation of the object is seen only by the type designer.

A class, is the same as that of a type, but it is more of a run-time notion. It contains two aspects:
an object factory and an object warehouse. The object factory can be used to create new objects,
by performing the operation new on the class, or by cloning some prototype object representative
of the class. The object warehouse means that attached to the class is its extension, i.e., the set of
objects that are instances of the class.

Rule 5: The system must support inheritance

An object must inherit the properties of its super classes in the class hierarchy. Ensures code

Rule 6: The system must avoid premature binding.

This feature allows us to use the same method’s name in different classes. The OO system
decides which implementation to access at run time, on the basis of the class to which the object
belongs. Also known as late binding or dynamic binding.

Rule 7: The system must be computationally complete.

The basic notions of programming languages are augmented by features common to the database
data manipulation language (DML), thereby allowing us to express any type of operation in the

Rule 8: The system must be extensible.

The final OO feature concerns its ability to define new types. No management distinction
between user-defined types and system-defined types.

Rule 9: The system must be able to remember data locations.

The conventional DBMS stores the data permanently on disk, that is, the DBMS displays data
persistence. OO system usually keep the entire object space in memory, once the system is shut
down, the entire object is lost.

Rule 10: The system must be able to manage very large databases.

M.TECH (Computer Science & Engineering) 78

Shabnam Sangwan, A.P. in CSE, SKITM

Typical OO systems limit the object space to the amount of primary memory available. For
example: Smalltalk cannot handle objects larger than 64K. Therefore, a critical OODBMS
feature is to optimize the management of secondary storage devices by using buffers, indexes,
data clustering, and access path selection techniques.

Rule 11: The system must support concurrent users.

Conventional DBMSs are especially capable in this area. OODBMS must support the same level
of concurrency as conventional systems.

Rule 12: The system must be able to recover from hardware and software failures.

OODBMS must offer the same level of protection from hardware and software failures that the
traditional DBMS provides. It must provide support for automated backup and recovery tools.

Rule 13: Data query must be simple.

Relational DBMSs have provided a standard database query method through SQL (Structured
Query Language). OODBMS provided and object query language (OQL) with similar capability.

Object Oriented Database Concepts:
Basic Object-Oriented DB concepts are:
1. Objects
2. Object Identity
3. Attributes
4. Object State
5. Messages and Methods
6. Encapsulation
7. Classes
8. Inheritance
9. Method Overloading
10. Polymorphism
11. Object Classification

M.TECH (Computer Science & Engineering) 79

Shabnam Sangwan, A.P. in CSE, SKITM

An Object is an abstract representation of a real-world entity that has a unique identity,
embedded properties, and the ability to inherit with other objects.

Note: The difference between an object and entity is that entity has data components and
relationships but lacks manipulative ability.

Fig: Objects

Object Identity:

An Object’s identity is represented by an object ID (OID), which is unique. It is assigned by the

system at the moment of creation and cannot be changed. It can be deleted only if the object is
deleted, and that OID can never be reused.

Attributes (Instance Variables):

Objects described by their attributes, known as instance variables. Each attribute has a unique
name and a data type associated with it. It has a domain, which logically groups and describes
the set of all possible values that an attribute can have.

For Ex: GPA domain, “any positive number between 0.00 and 4.00, with only two decimal

Objects attribute can be single-valued or multivalued same as in E-R model. Object attributes
may reference one or more other objects.

For Ex: the attribute MAJOR refers to a Department object, the attributes COURSE_TAKEN
refers to a list of course objects.

Object State:

M.TECH (Computer Science & Engineering) 80

Shabnam Sangwan, A.P. in CSE, SKITM

Object State is a set of values that the object’s attributes have at a given time. It can be changed
by changing the values of object’s attributes. To change the object’s attribute values, send a
message to the object which will invoke a method.

Messages and Methods:

Methods are the code that performs a specific operation on object’s data. It protects data from
direct and unauthorized access by other objects. It represents object behaviour. They change the
object’s attributes values or to return the value of selected object attributes.

Fig : Depiction of an OBJECT

Every method is identified by a name and has a body. The body is composed of instructions
written in programming language to represent a real-world action.

Method Avegpa (Method’s Name)

Xgpa = 0



Return (xgpa) (Returns the average GPA)

[Compute student GPA by using objects attributes SEMESTER_GPA and OVERALL_GPA]

A method is invoked by sending a message to the object. A message is sent by specifying a

receiver object, the name of the method, and any required parameters.

Ability to hide the objects internal details (attributes and methods) from the message sender is
known as encapsulation.


M.TECH (Computer Science & Engineering) 81

Shabnam Sangwan, A.P. in CSE, SKITM

A class is a blueprint to create objects, which includes shared structure (attributes) and behavior
(methods) that all similar objects created share. It contains the description of the data structure
and the method implementation details for objects in that class.

An object must belong to only one class as an instance of that class (instance-of relationship). A
class is similar to an abstract data type. A class may also be primitive (no attributes), e.g. integer,
string, Boolean. Each object in a class is known as a class instance or object instance. It
encapsulates state through data placeholders called member variables and behavior through
reusable code called methods.

Fig : Class Illustrations

Fig : Representation of Class STUDENT

OO concepts (Class, Object, Method, State, Identity, Protocol, and Messages) together are shown
in a pictorial diagram:

M.TECH (Computer Science & Engineering) 82

Shabnam Sangwan, A.P. in CSE, SKITM



Inheritance derives a new class (subclass) from an existing class (superclass). Subclass inherits
all the attributed and methods of the existing class and may have additional attributes and
methods. An important benefit of inheritance in OO systems is the notion of substitutability.

Inheritance is a way to form new classes, known as derived classes, take over (or inherit)
attributes and behavior of the pre-existing classes, called base classes (or ancestor classes).

Two variants of inheritance exist:

1. Single

2. Multiple

Single Inheritance:

Exists, when a class has only one immediate (parent) superclass above it. A message is sent to an
object instance, to search for matching method through entire hierarchy, using the following

1. It scans the class to which the object belongs.

2. If method not found, scan the superclass.

3. Scanning process is run until the following occurs:

4. Method is found.

5. Top of class hierarchy is reached without finding the method.

M.TECH (Computer Science & Engineering) 83

Shabnam Sangwan, A.P. in CSE, SKITM

Fig : Single Inheritance

Multiple Inheritance:

Exists, when a class can have more than one immediate (parent) superclass above it. A class can
inherit behaviours and features from more than one superclass, whereas in single inheritance, a
class may inherit from at most one superclass.

Ex: Motorcycle subclass inherits characteristics from both the Motor Vehicle and Bicycle
superclasses. From Motor Vehicle superclass, the motorcycle subclass inherits: Characteristics,
such as fuel requirements, engine pistons, and horsepower. Behavior, such as start motor, fills
gas, and depress clutch.

From Bicycle superclass, the Motorcycle subclass inherits: Characteristics, such as two wheels
and handlebars. Behavior, such as straddle the seat and move the handlebar to turn.

Fig : Multiple Inheritance


Encapsulation is the ability of an object to be a container for related properties (i.e. data
variables) and methods (i.e. functions). Data hiding is the ability of objects to protect variables
from external access. Variables marked as private can only be seen or modified through the use
of public accessor and mutator methods.

Encapsulation is used to implement abstraction. We combine the data and methods that operate
on the data and put them in a single unit. Every module of the system can change independently,
no impact to the other modules.

M.TECH (Computer Science & Engineering) 84

Shabnam Sangwan, A.P. in CSE, SKITM

For Ex: Think of a person driving a car. He doesn’t need to know how the engine works or the
gear changes work, to be able to drive the car (Encapsulation). Instead, he needs to know things
how much turning the steering wheel needs, etc (Abstraction).

Fig : Graphical representation of Encapsulation for car class

Method Overriding:

The ability of a subclass to override a method in its superclass allows a class to inherit from a
superclass whose behavior is "close enough" and then override methods as needed. A subclass
cannot override methods that are declared final in the superclass (by definition, final methods
cannot be overridden). A subclass must override methods that are declared abstract in the
superclass, or the subclass itself must be abstract. It is the process of defining a function in the
child class with same name. The child class method hides parent class method. Method
Overriding is used to provide different implementations of a function so that a more specific
behavior can be realized.

Fig : Employee Class Hierarchy Method Override

M.TECH (Computer Science & Engineering) 85

Shabnam Sangwan, A.P. in CSE, SKITM

For Ex: Defined a Bonus method as shown above, to compute a Christmas bonus for all
employees. Bonus computation depends on the type of the employee. In this case, with the
exception of pilots, an employee receives a Christmas bonus equal to 5 percent of his salary.
Pilots receive a Christmas bonus on accumulated flight pay rather than on annual salary. By
defining the Bonus method and in the Pilot Subclass, we are overriding Employee Bonus method
for all objects that belong to the Pilot subclass.


Polymorphism is the capability of an action or method to do different things based on the object
that is acting up on. Overloading and Overriding are two types of polymorphism.

We may use the same name for a method defined in different classes in the class hierarchy. The
user may send the same message to different objects that belong to different classes and yet
generate the correct response.

Fig : Employee Class Hierarchy Polymorphism

For Ex: Let’s consider Expanded Employee Class hierarchy:

The Pilot monthPay method definition overrides and expands the Employee monthPay method
defined in the Employee superclass.

The monthPay method that was defined in the Employee superclass is reused by the Pilot and
Mechanic subclasses.

M.TECH (Computer Science & Engineering) 86

Shabnam Sangwan, A.P. in CSE, SKITM


Characteristics of an Object-Oriented Data Model:

OODBMS should support complex objects representation. It should be extensible; i.e. it should
be capable of defining new data types and operations to be carried out on them. It should support
encapsulation; i.e. data representation and method’s implementation must be hidden from
external entities. It should exhibit inheritance; i.e. an object must be able to inherit the properties
(data and methods) of other objects.

Drawbacks of Relational Data Model:

First deficiency in RDBMS is with SQL-92 relational language, which is limited. It supports a
restricted set of built-in data types that accommodate only number and strings, but whereas many
database applications are dealing with complex objects such as geographic points, text and
digital data. The problem is how this data is used. Second deficiency in RDBMS is that it suffers
from certain structural shortcomings. Relational tables are flat and do not provide good support
for nested structures, such as sets and arrays. Third deficiency is RDBMS did not take the
advantage of Object-Oriented approaches which have gained widespread acceptance. OO
techniques reduce costs and improve information system quality by adopting an object-centric
view of software development.

Drawbacks of Object Oriented Data Model:

First deficiency in OODBMS is, vendors rediscovered the difficulties of tying database design
too closely to application design. Second deficiency in OODBMS is, they relearned that
declarative languages such as SQL-92 bring such tremendous productivity gains that
organizations will pay for additional computational resources they require. Third deficiency in
OODBMS is, they rediscovered the fact that a lack of standard data model leads to design errors
and inconsistencies.

The main drawback of OODBMSs has been poor performance. Unlike RDBMSs, query
optimization for OODBMs is highly complex. OODBMSs also suffer from problems of
scalability, and are unable to support large-scale systems.

Thus, ORDBMS emerged as a way to enhance the capabilities of RDBMS with some of the
features that appeared in ODBMS.

M.TECH (Computer Science & Engineering) 87

Shabnam Sangwan, A.P. in CSE, SKITM

TABLE : A Comparison of Database Management Systems

Object Oriented Database

An object-oriented database consists of class as schema, object as data, and each object has OID
as an unique identifier, and data operation as encapsulation. A subclass and a superclass object is
physically one object but in different view such that a subclass inherits the data and operation of
a superclass. Object associated with each other through Stored OID in bi-direction references,
that is, association and inverse association. A Stored OID is a reference by use of OID which is
generated by the system. Polymorphism means overloading in object-oriented database such that
the same function can produce different output depending on the values of online input
In mapping relational schema into an object-oriented schema, each relation is mapped into a
class, and each foreign key is mapped into an association attribute which is the data structure of
Stored OID. Each superclass and subclass relations are also mapped into superclass and subclass
in object-oriented database with subclass inheriting data and operation of superclass.

M.TECH (Computer Science & Engineering) 88

Shabnam Sangwan, A.P. in CSE, SKITM

Extended relational database is an Object-relational database(ORDB). Fixed data type is not

flexible in relational database(RDB).

Objected-Oriented database(OODB) allows user define data types. Object-Oriented

programming is the programming in object-oriented functions but not necessary using OODB.

A method is an application program with a set of operations accessing a class inside object-
oriented database. In other words, a method describes object operations limited to a class only.

An object-oriented schema defines an object class definition including { Name }



An complex data type is an object inside another object. A primitive data type is a data type that
cannot be decomposed further.

An OID is an object identity which is system generated with a unique address. A Stored OID is
an OID stored in another object used as a pointer for reference.

Inheritance means the reuse (inherit) of superclass data and operations in subclass. The benefit is
to eliminate data redundancy in data storage, and providing different logical view of superclass
and subclass of the same object. Same object appears in superclass view and subclass view, but
stored as ONE object inside OODB.

A class is a set of objects grouped together to form a class which is the data structure of OODB
schema. An object is an object data including object title, attribute and method.

A superclass is a class that includes subclass(es). A subclass is a class that is inside a superclass
and can inherit data, and method of the superclass.

An object refers to each other in bi-directional pointers (pointers and inverse pointers). An object
can refer to itself in recursive pointer.

A data model is an DDL plus an DML of a database. A database system is a database storage
plus database basic functions of transaction process, recovery, concurrency control and security

A classification is a schema class. An instantiation is a data instance of a class. An Association

Attribute is an attribute with a stored OID referring to another object.

A Set in an Association Attribute data type means referring to a set of multiple occurrences of
other objects by using a set of stored OID in one-to-many association between two objects.

A Set value is a set of multiple values, which is a valid data type in OODB.

M.TECH (Computer Science & Engineering) 89

Shabnam Sangwan, A.P. in CSE, SKITM

Polymorphism is an overloading with same function name call but will give different result
depending on the runtime parameters.

The general rules of mapping an EER model into an OODB schema are:

Step 1 Map each entity into a class of OODB.

Step 2 Map disjoint generalization into Inherit and Method of OODB

Step 3 Map isa into inherit operation in subclass of OODB.

Step 4 Map categorization into multiple inheritance of subclass of OODB. One inheritance
overrides another inheritance if there is a conflict.

Step 5 Map cardinality into bi-directional Association Attribute of OODB schema.

Tutorial question:
An Extended Entity Relationship model has been designed for the database. Show the object-
oriented database schema for the implementation of the EER model design. (Classes (50%),

Boat_Person Birth_date

Name Name Name

Engineer Accountant Doctor d

c n 1 Detention_
Refugee Name Name Non-refugee Detain
Name Professional
Status Detain_Date

isa d
Country Resettle Name Waiting_refugee Name
1 refugee
Country_name n n

Stay Stay_date Reside Reside_date

Depature_center_name Open_center Open_center_name

Introduction to Object Oriented Databases

M.TECH (Computer Science & Engineering) 90

Shabnam Sangwan, A.P. in CSE, SKITM

This chapter introduces the basic concepts of object oriented databases. Its purpose is to help
you decide whether you should investigate such products further, and to understand how they
might work.

The Main Features

Object Oriented Databases generally provide persistent storage for objects. In addition, they
may provide one or more of the following: a query language; indexing; transaction support with
rollback and commit; the possibility of distributing objects transparently over many servers.
These features are described in the following sections. Some database vendors may charge
separately for some of these features.

Some Object Oriented Databases also come with extra tools such as visual schema designers,
Integrated Development Environments and debuggers.


Unlike a relational database, which usually works with SQL, an object oriented database works
in the context of a regular programming language such as C++, C or Java. Furthermore, an
object-oriented database may be host-specific, or it may be able to read the same database from
multiple hosts, or even from multiple kinds of host, such as a SPARC server under Solaris 8 and
a PC under Linux. Some object oriented database servers can support heterogeneous clients, so
that the SPARC system and a PC and a Macintosh (for example) might all be accessing the same


With a relational database, you store information explicitly in tables, and then get it back again
later with queries. Although you can use an object oriented database in that way, it's not the only

Consider a computer aided design application in which the user can save and load complex
engineering drawings into memory. With a file-based system, loading a drawing might involve
reading a large external file into memory and creating tens of thousands of objects before the

M.TECH (Computer Science & Engineering) 91

Shabnam Sangwan, A.P. in CSE, SKITM

user can start working. With a relational database the software would run database queries to
create those same objects.

With an object oriented database, the software calls a database function to load the illustration,
but objects are not created in memory until they are needed: instead, they are stored in the
database, and only references are loaded into memory.

When an object is changed, the database silently writes the changes to the database, keeping the
in-database version up to date at all times. When the user presses "save", all the application does
is to commit the current transaction; since the database is already up to date, this is generally
very fast. The code no longer needs to be able to read or write the proprietary save file format,
and may well also run faster.

Relational Database Management System

The most widely used DBMS is the Relational Database Management System (RDBMS). This
system is based on a table structure that stores and manages data. A table is a predefined
categories of datum that are made up of rows and columns. The columns store the fields that
define the category of data. Each row holds a complete record for the table where the data is
stored. Each table has a key field that uniquely identifies the table. The key field is the field that
is used to create relationships between other tables in an effort to connect data. This type of
organization allows data to be stored in smaller increments and then connected by through
association. A key field is a unique field that identifies the table and allows relationships to be
created between tables. Business rules are applied to the tables and fields to ensure the data is
accessed and used properly. SQL (Standard Query Language) is the tool/language that is used to
interact with and between tables to utilize the data in ways that is meaningful to the business

The Object Oriented Database Management System (OODMS) do not have as high a usage rate.
This type of DBMS provides high performance for companies with extensive amounts of data
that is highly complex. OODMSs incorporate Object Oriented technology where the data is seen
as an object. Data is defined as an objects and classes (collections of like minded objects). The
data objects utilize the concept of inheritance, where the lower classes inherit the data definitions
and methods from the upper classes. The class defines only the data it is associated with. This

M.TECH (Computer Science & Engineering) 92

Shabnam Sangwan, A.P. in CSE, SKITM

helps to determine how the classes of objects relate to each other. Data is accessed in a
transparent manner through intersections of persistent objects.

So why would one choose RDBMS or OODBMS? There is no real right or wrong answer to this
question. The choice made is based on the data to be stored/managed, the type of database
needed and the technology preferences of the company providing the service or company who is
receiving the service. Often the choice is made based on the skill set available and the DBMS
that is already available.

Benefits and Drawback of Each System

Regardless of the preference, each DBMS has its benefits and drawbacks. OOBMS are
documented as being easy to maintain as classes and objects can be developed and updated
separate from the system. Performance is also high with OODBMSs as one can store complex
datasets in their entirety and therefore process data more quickly. Due to the class structure, the
data can be more easily distributed across networks as well as the distribution of work. A query
language is not necessary since the interaction of the data is done by transparently accessing the
objects. No keys are needed to identify the datasets or create connections between the
relationships. Many developers find the programming time to be reduced with an OODBMS
since objects inherit the characteristics of the classes. The use of classes also helps to ensure the
integrity of the data. In addition, a class is reusable for the existing database and other databases
so that it can be distributed more easily across networks.

On the other hand, Relational Database Management Systems (RDMS) are much easier to learn
and create. Many of the available systems have a GUI interface that makes the technology
available to people who are not highly technical. Since the database is not dependent on a
complex schema, increasing the capability and size is relatively easy. Ad-hoc queries can also be
added using Structured Query Language (SQL) once the production database has been
completed. In addition, the data can be used independently as the tables are set up as separate
entities rather than grouped in class.

Both systems have their drawbacks as well. OODBMSs have their drawbacks. They can be
somewhat complex and difficult to learn due to the object oriented technology. When a change
needs to be made to the database, the entire schema must be updated. Queries are dependent

M.TECH (Computer Science & Engineering) 93

Shabnam Sangwan, A.P. in CSE, SKITM

upon the system and therefore must be predetermined in the planning stages. Adding queries to
the database after the fact is a difficult task.

While RDBMSs are easier to use, they are limited to simple data types and therefore do not
support more complex types such as multimedia. In addition, if the data that needs to be
processed is complicated and extensive, performance may suffer. While there are lots of
solutions within this family of database systems, they may not be robust enough to handle larger
scale projects.


Both types of database technologies provide a solution for the right type of project. The
choice to use one vs. the other depends on the type of project, skills of the development group
and the technology available for the company who is looking for a DBMS.

Distributed Databases: Basic concepts, options for distributing a database, distributed DBMS.

Distributed Database Architecture

A distributed database system allows applications to access data from local and remote
databases. In a homogenous distributed database system, each database is an Oracle Database.
In a heterogeneous distributed database system, at least one of the databases is not an Oracle
Database. Distributed databases use client/server architecture to process information requests.

It contains the following database systems:

 Homogenous Distributed Database Systems

 Heterogeneous Distributed Database Systems

 Client/Server Database Architecture

Homogenous Distributed Database Systems

M.TECH (Computer Science & Engineering) 94

Shabnam Sangwan, A.P. in CSE, SKITM

A homogenous distributed database system is a network of two or more Oracle Databases that
reside on one or more machines. Below Figure illustrates a distributed system that connects three
databases: hq, mfg, and sales. An application can simultaneously access or modify the data in
several databases in a single distributed environment. For example, a single query from a
Manufacturing client on local database mfg can retrieve joined data from the products table on
the local database and the dept table on the remote hq database.

For a client application, the location and platform of the databases are transparent. You can also
create synonyms for remote objects in the distributed system so that users can access them with
the same syntax as local objects. For example, if you are connected to database mfg but want to
access data on database hq, creating a synonym on mfg for the remote dept table enables you to
issue this query:


Homogeneous Distributed Database

Description of the illustration admin

M.TECH (Computer Science & Engineering) 95

Shabnam Sangwan, A.P. in CSE, SKITM

An Oracle Database distributed database system can incorporate Oracle Databases of different
versions. All supported releases of Oracle Database can participate in a distributed database
system. Nevertheless, the applications that work with the distributed database must understand
the functionality that is available at each node in the system. A distributed database application
cannot expect an Oracle7 database to understand the SQL extensions that are only available with
Oracle Database.

Distributed Databases versus Distributed Processing

The terms distributed database and distributed processing are closely related, yet have
distinct meanings. There definitions are as follows:

 Distributed database

A set of databases in a distributed system that can appear to applications as a single data

 Distributed processing

The operations that occurs when an application distributes its tasks among different
computers in a network. For example, a database application typically distributes front-
end presentation tasks to client computers and allows a back-end database server to
manage shared access to a database. Consequently, a distributed database application
processing system is more commonly referred to as a client/server database application

Distributed Databases versus Replicated Databases

The terms distributed database system and database replication are related, yet distinct. In
a pure (that is, not replicated) distributed database, the system manages a single copy of all data
and supporting database objects. Typically, distributed database applications use distributed
transactions to access both local and remote data and modify the global database in real-time.

The term replication refers to the operation of copying and maintaining database objects in
multiple databases belonging to a distributed system. While replication relies on distributed
database technology, database replication offers applications benefits that are not possible within
a pure distributed database environment.

M.TECH (Computer Science & Engineering) 96

Shabnam Sangwan, A.P. in CSE, SKITM

Most commonly, replication is used to improve local database performance and protect the
availability of applications because alternate data access options exist. For example, an
application may normally access a local database rather than a remote server to minimize
network traffic and achieve maximum performance. Furthermore, the application can continue to
function if the local server experiences a failure, but other servers with replicated data remain

Heterogeneous Distributed Database Systems

In a heterogeneous distributed database system, at least one of the databases is a non-Oracle

Database system. To the application, the heterogeneous distributed database system appears as a
single, local, Oracle Database. The local Oracle Database server hides the distribution and
heterogeneity of the data.

The Oracle Database server accesses the non-Oracle Database system using Oracle
Heterogeneous Services in conjunction with an agent. If you access the non-Oracle Database
data store using an Oracle Transparent Gateway, then the agent is a system-specific application.
For example, if you include a Sybase database in an Oracle Database distributed system, then
you need to obtain a Sybase-specific transparent gateway so that the Oracle Database in the
system can communicate with it.

Alternatively, you can use generic connectivity to access non-Oracle Database data stores so
long as the non-Oracle Database system supports the ODBC or OLE DB protocols.

Heterogeneous Services

Heterogeneous Services (HS) is an integrated component within the Oracle Database server and
the enabling technology for the current suite of Oracle Transparent Gateway products. HS
provides the common architecture and administration mechanisms for Oracle Database gateway
products and other heterogeneous access facilities. Also, it provides upwardly compatible
functionality for users of most of the earlier Oracle Transparent Gateway releases.

Transparent Gateway Agents

For each non-Oracle Database system that you access, Heterogeneous Services can use a
transparent gateway agent to interface with the specified non-Oracle Database system. The agent
is specific to the non-Oracle Database system, so each type of system requires a different agent.

M.TECH (Computer Science & Engineering) 97

Shabnam Sangwan, A.P. in CSE, SKITM

The transparent gateway agent facilitates communication between Oracle Database and non-
Oracle Database systems and uses the Heterogeneous Services component in the Oracle
Database server. The agent executes SQL and transactional requests at the non-Oracle Database
system on behalf of the Oracle Database server.

Generic Connectivity

Generic connectivity enables you to connect to non-Oracle Database data stores by using either a
Heterogeneous Services ODBC agent or a Heterogeneous Services OLE DB agent. Both are
included with your Oracle product as a standard feature. Any data source compatible with the
ODBC or OLE DB standards can be accessed using a generic connectivity agent.

The advantage to generic connectivity is that it may not be required for you to purchase and
configure a separate system-specific agent. You use an ODBC or OLE DB driver that can
interface with the agent. However, some data access features are only available with transparent
gateway agents.

Client/Server Database Architecture

A database server is the Oracle software managing a database, and a client is an application that
requests information from a server. Each computer in a network is a node that can host one or
more databases. Each node in a distributed database system can act as a client, a server, or both,
depending on the situation.

An Oracle Database Distributed Database System

Description of the illustration admin

M.TECH (Computer Science & Engineering) 98

Shabnam Sangwan, A.P. in CSE, SKITM

A client can connect directly or indirectly to a database server. A direct connection occurs when
a client connects to a server and accesses information from a database contained on that server.
For example, if you connect to the hq database and access the dept table on this database as
in below Figure, you can issue the following:


This query is direct because you are not accessing an object on a remote database.

In contrast, an indirect connection occurs when a client connects to a server and then accesses
information contained in a database on a different server. For example, if you connect to
the hq database but access the emp table on the remote sales database as in above Figure you can
issue the following:

SELECT * FROM emp@sales;

What Are Database Links?

A database link is a pointer that defines a one-way communication path from an Oracle Database
server to another database server. The link pointer is actually defined as an entry in a data
dictionary table. To access the link, you must be connected to the local database that contains the
data dictionary entry.

A database link connection is one-way in the sense that a client connected to local database A
can use a link stored in database A to access information in remote database B, but users
connected to database B cannot use the same link to access data in database A. If local users on
database B want to access data on database A, then they must define a link that is stored in the
data dictionary of database B.

A database link connection allows local users to access data on a remote database. For this
connection to occur, each database in the distributed system must have a unique global database
name in the network domain. The global database name uniquely identifies a database server in
a distributed system.

Database Link

M.TECH (Computer Science & Engineering) 99

Shabnam Sangwan, A.P. in CSE, SKITM

One principal difference among database links is the way that connections to a remote database
occur. Users access a remote database through the following types of links:

Type of Link Description

Connected Users connect as themselves, which means that they must have an account on the
user link remote database with the same username as their account on the local database.

Fixed user Users connect using the username and password referenced in the link. For
link example, if Jane uses a fixed user link that connects to the hq database with the
username and password scott/tiger, then she connects as scott, Jane has all the
privileges in hq granted to scott directly, and all the default roles that scott has been
granted in the hq database.

Current user A user connects as a global user. A local user can connect as a global user in the
link context of a stored procedure, without storing the global user's password in a link
definition. For example, Jane can access a procedure that Scott wrote, accessing
Scott's account and Scott's schema on the hq database. Current user links are an
aspect of Oracle Advanced Security.

M.TECH (Computer Science & Engineering) 100

Shabnam Sangwan, A.P. in CSE, SKITM

What Are Shared Database Links?

A shared database link is a link between a local server process and the remote database. The link
is shared because multiple client processes can use the same link simultaneously.

When a local database is connected to a remote database through a database link, either database
can run in dedicated or shared server mode. The following table illustrates the possibilities:

Local Database Mode Remote Database Mode

Dedicated Dedicated

Dedicated Shared server

Shared server Dedicated

Shared server Shared server

A shared database link can exist in any of these four configurations. Shared links differ from
standard database links in the following ways:

 Different users accessing the same schema object through a database link can share a
network connection.

 When a user needs to establish a connection to a remote server from a particular server
process, the process can reuse connections already established to the remote server. The
reuse of the connection can occur if the connection was established on the same server
process with the same database link, possibly in a different session. In a non shared
database link, a connection is not shared across multiple sessions.

 When you use a shared database link in a shared server configuration, a network
connection is established directly out of the shared server process in the local server.

Why Use Database Links?

The great advantage of database links is that they allow users to access another user's objects in a
remote database so that they are bounded by the privilege set of the object owner. In other words,

M.TECH (Computer Science & Engineering) 101

Shabnam Sangwan, A.P. in CSE, SKITM

a local user can access a link to a remote database without having to be a user on the remote

Global Database Names in Database Links

To understand how a database link works, you must first understand what a global database
name is. Each database in a distributed database is uniquely identified by its global database
name. The database forms a global database name by prefixing the database network domain,
specified by the DB_DOMAIN initialization parameter at database creation, with the individual
database name, specified by the DB_NAME initialization parameter.

Hierarchical Arrangement of Networked


Description of the illustration admin

The name of a database is formed by starting at the leaf of the tree and following a path to the
root. For example, the mfg database is in division3 of the acme_tools branch of the com domain.
The global database name for mfg is created by concatenating the nodes in the tree as follows:

M.TECH (Computer Science & Engineering) 102

Shabnam Sangwan, A.P. in CSE, SKITM


While several databases can share an individual name, each database must have a unique global
database name. For example, the network domain and each contain a sales database. The
global database naming system distinguishes the sales database in the Americas division from
the sales database in the Europe division as follows:



Names for Database Links

Typically, a database link has the same name as the global database name of the remote database
that it references. For example, if the global database name of a database is,
then the database link is also called

Connected User Database Links

Connected user links have no connected string associated with them. The advantage of a
connected user link is that a user referencing the link connects to the remote database as the same
user. Furthermore, because no connect string is associated with the link, no password is stored in
clear text in the data dictionary.

Connected user links have some disadvantages. Because these links require users to have
accounts and privileges on the remote databases to which they are attempting to connect, they
require more privilege administration for administrators. Also, giving users more privileges than
they need violates the fundamental security concept of least privilege: users should only be given
the privileges they need to perform their jobs.

Fixed User Database Links

A benefit of a fixed user link is that it connects a user in a primary database to a remote database
with the security context of the user specified in the connect string. For example, local
user Joe can create a public database link in Joe’s schema that specifies the fixed user Scott with
password tiger. If Jane uses the fixed user link in a query, then Jane is the user on the local
database, but she connects to the remote database as scott/tiger.

M.TECH (Computer Science & Engineering) 103

Shabnam Sangwan, A.P. in CSE, SKITM

Data warehousing: Introduction Basic concepts, data warehouse architecture, data
characteristics, reconciled data layer data transformations, derived data layer user interface.

A typical architecture of a data warehouse is shown below:










Fig : Each component and the tasks performed by them are explained below:


The data in a data warehouse comes from operational systems of the organization as well as from
other external sources. These are collectively referred to as source systems. The data extracted
from source systems is stored in a area called data staging area, where the data is cleaned,
transformed, combined, deduplicated to prepare the data for us in the data warehouse. The data
staging area is generally a collection of machines where simple activities like sorting and
sequential processing takes place. The data staging area does not provide any query or
presentation services. As soon as a system provides query or presentation services, it is
categorized as a presentation server. A presentation server is the target machine on which the

M.TECH (Computer Science & Engineering) 104

Shabnam Sangwan, A.P. in CSE, SKITM

data is loaded from the data staging area organized and stored for direct querying by end users,
report writers and other applications. The three different kinds of systems that are required for a
data warehouse are:

1. Source Systems
2. Data Staging Area
3. Presentation servers

The data travels from source systems to presentation servers via the data staging area. The entire
process is popularly known as ETL (extract, transform, and load) or ETT (extract, transform, and
transfer). Oracle’s ETL tool is called Oracle Warehouse Builder (OWB) and MS SQL Server’s
ETL tool is called Data Transformation Services (DTS).

The sources of data for the data warehouse is supplied from:

(i) The data from the mainframe systems in the traditional network and hierarchical
(ii) Data can also come from the relational DBMS like Oracle, Informix.
(iii) In addition to these internal data, operational data also includes external data
obtained from commercial databases and databases associated with supplier and
The load manager performs all the operations associated with extraction and loading data into the
data warehouse. These operations include simple transformations of the data to prepare the data
for entry into the warehouse. The size and complexity of this component will vary between data
warehouses and may be constructed using a combination of vendor data loading tools and
custom built programs.


M.TECH (Computer Science & Engineering) 105

Shabnam Sangwan, A.P. in CSE, SKITM

The warehouse manager performs all the operations associated with the management of data in
the warehouse. This component is built using vendor data management tools and custom built
programs. The operations performed by warehouse manager include:
(i) Analysis of data to ensure consistency
(ii) Transformation and merging the source data from temporary storage into data
warehouse tables
(iii) Create indexes and views on the base table.
(iv) Denormalization
(v) Generation of aggregation
(vi) Backing up and archiving of data
In certain situations, the warehouse manager also generates query profiles to determine which
indexes ands aggregations are appropriate.
Data Warehousing

Table : Comparison between OLTP and data warehouse databases

OLTP Data warehouse

Transaction oriented Business process oriented

Thousands of users Few users (typically under 100)

Generally small (MB up to several GB) Large (hundreds of GB up to

several TB)

Current data Historical data

Normalized data Denormalized data

(many tables, few columns per table) (few tables, many columns per table)

Continuous updates Batch updates*

Simple to complex queries Usually very complex queries

M.TECH (Computer Science & Engineering) 106

Shabnam Sangwan, A.P. in CSE, SKITM

Operational Operational Operational

Applications Applications Applications

feede feede feede

r r r

DB1 Extract Extract

DB2 Extract DB3

Staging Area


Report Generators OLAP

Ad Hoc Query Tools Data Mining

Fig : Basic data warehouse architecture

Core Requirements for Data Warehousing

1. DWs are organized around subject areas.

2. DWs should have some integration capability.

3. The data is considered to be nonvolatile and should be mass loaded.

4. Data tends to exist at multiple levels of granularity. Most important, the data tends to be of a
historical nature, with potentially high time variance.

5. The DW should be flexible enough to meet changing requirements rapidly.

6. The DW should have a capability for rewriting history, that is, allowing for “what-if” analysis.

7. A usable DW user interface should be selected.

8. Data should be either centralized or distributed physically.

Logical Design

«pk» CustId

Fact Table
Ship Calendar
1 «fk» CustID * CustType
«pk» ShipDateID
M.TECH (Computer Science *& Engineering)
«fk» ShipDateID City 107
Ship Date *
Bind Style
State Province
«fk» BindID
Ship Month «pk» BindId
«dd» JobID
1 Country
Shabnam Sangwan, A.P. in CSE, SKITM

Fig : Example star schema for a data warehouse


State Province

Cust Type

Ship Day of Week


Ship Date Fact Table

Ship Month Bind Style

Ship Quarter Bind Category

Ship Year

Fig: Example snow flake schema for a data warehouse

The query manager performs all operations associated with management of user queries. This
component is usually constructed using vendor end-user access tools, data warehousing
monitoring tools, database facilities and custom built programs. The complexity of a query
manager is determined by facilities provided by the end-user access tools and database.


M.TECH (Computer Science & Engineering) 108

Shabnam Sangwan, A.P. in CSE, SKITM

This area of the warehouse stores all the detailed data in the database schema. In most cases
detailed data is not stored online but aggregated to the next level of details. However the detailed
data is added regularly to the warehouse to supplement the aggregated data.


The area of the data warehouse stores all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager. This area of the warehouse is transient
as it will be subject to change on an ongoing basis in order to respond to the changing query
profiles. The purpose of the summarized information is to speed up the query performance. The
summarized data is updated continuously as new data is loaded into the warehouse.


This area of the warehouse stores detailed and summarized data for the purpose of archiving and
back up. The data is transferred to storage archives such as magnetic tapes or optical disks.

The data warehouse also stores all the Meta data (data about data) definitions used by all
processes in the warehouse. It is used for variety of purposed including:
(i) The extraction and loading process – Meta data is used to map data sources to a
common view of information within the warehouse.
(ii) The warehouse management process – Meta data is used to automate the
production of summary tables.
(iii) As part of Query Management process Meta data is used to direct a query to the
most appropriate data source.
The structure of Meta data will differ in each process, because the purpose is different.


The principal purpose of data warehouse is to provide information to the business managers for
strategic decision-making. These users interact with the warehouse using end user access tools.
The examples of some of the end user access tools can be:
(i) Reporting and Query Tools
(ii) Application Development Tools
(iii) Executive Information Systems Tools

M.TECH (Computer Science & Engineering) 109

Shabnam Sangwan, A.P. in CSE, SKITM

(iv) Online Analytical Processing Tools

(v) Data Mining Tools


In this section we will discussed about the 4 major process of the data warehouse. They are
extract (data from the operational systems and bring it to the data warehouse), transform (the
data into internal format and structure of the data warehouse), cleanse (to make sure it is of
sufficient quality to be used for decision making) and load (cleanse data is put into the data
The four processes from extraction through loading often referred collectively as Data Staging.

Some of the data elements in the operational database can be reasonably be expected to be useful
in the decision making, but others are of less value for that purpose. For this reason, it is
necessary to extract the relevant data from the operational database before bringing into the data
warehouse. Many commercial tools are available to help with the extraction process. Data
Junction is one of the commercial products. The user of one of these tools typically has an easy-
to-use windowed interface by which to specify the following:

(i) Which files and tables are to be accessed in the source database?
(ii) Which fields are to be extracted from them? This is often done internally by
SQL Select statement.
(iii) What are those to be called in the resulting database?
(iv) What is the target machine and database format of the output?
(v) On what schedule should the extraction process be repeated?


The operational databases developed can be based on any set of priorities, which keeps changing
with the requirements. Therefore those who develop data warehouse based on these databases are
typically faced with inconsistency among their data sources. Transformation process deals with
rectifying any inconsistency (if any).

M.TECH (Computer Science & Engineering) 110

Shabnam Sangwan, A.P. in CSE, SKITM

One of the most common transformation issues is ‘Attribute Naming Inconsistency’. It is

common for the given data element to be referred to by different data names in different
databases. Employee Name may be EMP_NAME in one database, ENAME in the other. Thus
one set of Data Names are picked and used consistently in the data warehouse. Once all the data
elements have right names, they must be converted to common formats. The conversion may
encompass the following:

(i) Characters must be converted ASCII to EBCDIC or vise versa.

(ii) Mixed Text may be converted to all uppercase for consistency.
(iii) Numerical data must be converted in to a common format.
(iv) Data Format has to be standardized.
(v) Measurement may have to convert. (Rs/ $)
(vi) Coded data (Male/ Female, M/F) must be converted into a common format.
All these transformation activities are automated and many commercial products are available to
perform the tasks. DataMAPPER from Applied Database Technologies is one such
comprehensive tool.

Information quality is the key consideration in determining the value of the information. The
developer of the data warehouse is not usually in a position to change the quality of its
underlying historic data, though a data warehousing project can put spotlight on the data quality
issues and lead to improvements for the future. It is, therefore, usually necessary to go through
the data entered into the data warehouse and make it as error free as possible. This process is
known as Data Cleansing.

Data Cleansing must deal with many types of possible errors. These include missing data and
incorrect data at one source; inconsistent data and conflicting data when two or more source are
involved. There are several algorithms followed to clean the data, which will be discussed in the
coming lecture notes.

Loading often implies physical movement of the data from the computer(s) storing the source
database(s) to that which will store the data warehouse database, assuming it is different. This
takes place immediately after the extraction phase. The most common channel for data

M.TECH (Computer Science & Engineering) 111

Shabnam Sangwan, A.P. in CSE, SKITM

movement is a high-speed communication link. Ex: Oracle Warehouse Builder is the API from
Oracle, which provides the features to perform the ETL task on Oracle Data Warehouse.

 Application program interface – An interface engine or library of precompiled

subroutines that enable application programs (such as those written in C or Java) to
interact with the database.

 End-user query processor – A program or utility that allows end users to retrieve data and
generate reports without writing application programs.

 Data definition interface – A program or utility that allows a database administrator to

define or modify the content and structure of the database (for example, add new fields or
redefine data types or relationships).

 Data access and control logic – The system software that controls access to the physical
database and maintains various internal data structures (for example, indices and

 Database – The physical data store (or stores) combined with the schema.

 Schema – A store of data that describes various aspects of the “real” data, including data
types, relationships, indices, content restrictions, and access controls.

 Physical data store – The “real” data as stored on a physical storage medium (for
example, a magnetic disk).

2. What is a database schema? What information does it contain?

A database schema is a store of data that describes the content and structure of the physical data
store (sometimes called metadata—data about data). It contains a variety of information about
data types, relationships, indices, content restrictions, and access controls.

M.TECH (Computer Science & Engineering) 112

Shabnam Sangwan, A.P. in CSE, SKITM

3. Why have databases become the preferred method of storing data used by an information

Databases are a common point of access, management, and control. They allow data to be
managed as an enterprise-wide resource while providing simultaneous access to many different
users and application programs. They solve many of the problems associated with separately
maintained data stores, including redundancy, inconsistent security, and inconsistent data access

4. List four different types of database models and DBMSs. Which are in common use

The four database models are hierarchical, network (CODASYL), relational, and object-oriented.
Hierarchical and network models are technologies of the 1960s and 1970s and are rarely found
today. The relational model was developed in the 1970s and widely deployed in the 1980s and
1990s. It is currently the predominant database model. The object-oriented database model was
first developed in the 1990s and is still being developed today. It is expected to slowly replace
the relational model over the next decade.

5. With respect to relational databases, briefly define the terms row and field.

Row – The portion of a table containing data that describes one entity, relationship, or object.

Field – The portion of a table (a column) containing data that describes the same fact about all
entities, relationships, or objects in the table.

6. What is a primary key? Are duplicate primary keys allowed? Why or why not?

A primary key is a field or set of fields, the values of which uniquely identify a row of a table.
Because primary keys must uniquely identify a row, duplicate key values aren’t allowed.

7. What is the difference between a natural key and an invented key? Which type is most
commonly used in business information processing?

A natural key is a naturally occurring attribute of or fact about something represented in a

database (for example, a human fingerprint or the atomic weight of an element). An invented key

M.TECH (Computer Science & Engineering) 113

Shabnam Sangwan, A.P. in CSE, SKITM

is one that is assigned by a system (for example, a social security or credit card number). Most
keys used in business information processing are invented.

8. What is a foreign key? Why are foreign keys used or required in a relational database?
Are duplicate foreign key values allowed? Why or why not?

A foreign key is a field value (or set of values) stored in one table that also exists as a primary
key value in another table. Foreign keys are used to represent relationships among entities that
are represented as tables. Duplicate foreign keys are not allowed within the same table because
they would redundantly represent the same relationship. Duplicate foreign keys may exist in
different tables because they would represent different relationships.

9. Describe the steps used to transform an ERD into a relational database schema.

1. Create a table for each entity type.

2. Choose a primary key for each table.

3. Add foreign keys to represent one-to-many relationships.

4. Create new tables to represent many-to-many relationships.

5. Define referential integrity constraints.

6. Evaluate schema quality and make necessary improvements.

7. Choose appropriate data types and value restrictions for each field.

10. How is an entity on an ERD represented in a relational database?

Each entity on an ERD is represented as a separate table.

11. How is a one-to-many relationship on an ERD represented in a relational database?

A one-to-many relationship is represented by adding the primary key field(s) of the table that
represents the entity participating in the “one” side of the relationship to the table that represents
the entity participating in the “many” side of the relationship.

M.TECH (Computer Science & Engineering) 114

Shabnam Sangwan, A.P. in CSE, SKITM

12. How is a many-to-many relationship on an ERD represented in a relational database?

A many-to-many relationship is represented by constructing a new table that contains the

primary key fields of the tables that represent each participating entity.

13. What is referential integrity? Describe how it is enforced when a new foreign key value is
created, when a row containing a primary key is deleted, and when a primary key value is

Referential integrity is content constraint between the values of a foreign key and the values of
the corresponding primary key in another table. The constraint is that values of the foreign key
field(s) must either exist as values of a primary key or must be NULL. A valid value must exist
in the foreign key field(s) before the row can be added. When a row containing the primary key
is deleted, the row with the foreign key must also be deleted for the data to maintain referential
integrity. A primary key should never be changed; but in the event that it is, the value of the
foreign key must also be changed.

14. What types of data (or fields) should never be stored more than once in a relational
database? What types of data (or fields) usually must be stored more than once in a relational

Non-key fields should never be stored more than once.

If a table represents an entity, the primary key values of each entity represented in the table are
redundantly stored (as foreign keys) for every relationship in which the entity participates.

15. What is relational database normalization? Why is a database schema in third normal
form considered to be of higher quality than an unnormalized database schema?

Relational database normalization is a process that increases schema quality by minimizing data
redundancy. A schema with tables in third normal form has less non-key data redundancy than a
schema with unnormalized tables. Less redundancy makes the schema and database contents
easier to maintain over the long term.

M.TECH (Computer Science & Engineering) 115

Shabnam Sangwan, A.P. in CSE, SKITM

16. Describe the process of relational database normalization. Which normal forms rely on
the definition of functional dependency?

The process of normalization modifies the schema and table definitions by successively applying
higher order rules of table construction. The rules each define a normal form, and the normal
forms are numbered one through three. First normal form eliminates repeating groups that are
embedded in tables.

Second and third normal forms are based on a concept called functional dependency—a one-to-
one correspondence between two field values. Second normal form ensures that every field in a
table is functionally dependent on the primary key. Third normal form ensures that no non-key
field is functionally dependent on any other non-key field.

17. Describe the steps used to transform a class diagram into an object database schema.

1. Determine which classes require persistent storage.

2. Define persistent classes within the schema.

3. Represent relationships among persistent classes.

4. Choose appropriate data types and value restrictions.

18. What is the difference between a persistent class and a transient class? Provide at least
one example of each class type.

An object of a transient class exists only for the duration of a program execution (for example,
the user interface of the program). An object of a persistent class (for example, a customer object
in a billing system) retains its identity and data content between program executions.

19. What is an object identifier? Why are object identifiers required in an object database?

An object identifier is a key or storage address that uniquely identifies an object within an object-
oriented database. Object identifiers are needed to represent relationships among objects. A
relationship is represented by embedding the object identifier of a participating object in the
other participating object.

M.TECH (Computer Science & Engineering) 116

Shabnam Sangwan, A.P. in CSE, SKITM

20. How is a class on a class diagram represented in an object database?

A class on a class diagram is represented “as is” in an object database. That is, each object of the
class type is stored in the database along with its data content and methods.

21. How is a one-to-many relationship on a class diagram represented in an object database?

The object identifier of each participating object is embedded in the other participating object.
The object on the “one” side of the relationship might have multiple embedded object identifiers
to represent multiple participants on the “many” side of the relationship.

22. How is a many-to-many relationship without attributes represented in an object database?

The object identifier of each participating object is embedded in the other participating object.
The objects on both sides of the relationship might have multiple embedded object identifiers to
represent multiple participants on the other side of the relationship.

23. What is an association class? How are association classes used to represent many-to-
many relationships in an object database?

An association class is an “artificial” class that is created to represent a many-to-many

relationship among “real” classes. The association class has data members that represent
attributes of the many-to-many relationship. Each “real” class implements a one-to-many
relationship with the association class.

24. Describe the two ways in which a generalization relationship can be represented in an
object database.

Generalization relationships can be represented directly (for example, using the ODL keyword
extends) or indirectly as a set of one-to-one relationships.

25. Does an object database require key fields or attributes? Why or why not?

Key fields aren’t required because they aren’t needed to represent relationships. However, they
are usually included because they are useful for a number of reasons, including guaranteeing
unique object content and searching or sorting database content.

M.TECH (Computer Science & Engineering) 117

Shabnam Sangwan, A.P. in CSE, SKITM

26. Describe the similarities and differences between an ERD and a class diagram that
models the same underlying reality.

Each entity on an ERD corresponds to one class on a class diagram. The one-to-one, one-to-
many, and many-to-many relationships among those classes are the same as those on the ERD.

27. How are classes and relationships on a class diagram represented in a relational database?

A class is represented as a table.

A one-to-many relationship among classes is represented in the same way as a one-to-many

among entities (see answer #11).

A many-to-many relationship among classes is represented in the same way as a many-to-many

among entities (see answer #12). Note that the table that represents the relationships serves the
same purpose as an association class.

28. What is the difference between a primitive data type and a complex data type?

A primitive data type (for example, integer, real, or character) is directly supported (represented)
by the CPU or a programming language. A complex data type (for example, record, linked list,
or object) contains one or more data elements constructed using the primitive data types as
building blocks.What are the advantages of having an RDBMS provide complex data types?

Providing complex data types in the RDBMS allows a wider range of data to be represented. It
also minimizes compatibility problems that might result from using different programming
languages or hardware.

29. Does an ODBMS need to provide predefined complex data types? Why or why not?

No. A required complex data type can be defined as a new class.

30. Why might all or part of a database need to be replicated in multiple locations?

Database accesses between distant servers and clients must traverse one or more network links.
This can slow the accesses due to propagation delay or network congestion. Access speed can be
increased by placing a database replica close to clients.

M.TECH (Computer Science & Engineering) 118

Shabnam Sangwan, A.P. in CSE, SKITM

31. Briefly describe the following distributed database architectures: replicated database
servers, partitioned database servers, and federated database servers. What are the comparative
advantages of each?

Replicated database servers – An entire database is replicated on multiple servers, and each
server is located near a group of clients. Best performance and fault tolerance for clients because
all data is available from a “nearby” server.

Partitioned database servers – A database is partitioned so that each partition is a database subset
used by a single group of clients. Each partition is located on a separate server, and each server is
located close to the clients that access it. Better performance and less replication traffic than
replicated servers if similar collocated clients use only a subset of database content.

Federated database servers – Data from multiple servers with different data models and/or
DBMSs is pooled by implementing a separate (federated) server that presents a unified view of
the data stored on all the other servers. The federated server constructs answers to client queries
by forwarding requests to other servers and combining their responses for the client. Simplest
and most manageable way to combine data from disparate DBMSs into a single unified data

32. What additional database management complexities are introduced when database
contents are replicated in multiple locations?

Replicated copies are redundant data stores. Thus, any changes to data content must be
redundantly implemented on each copy. Implementing redundant maintenance of data content
requires all servers to periodically exchange database updates.

Object Relational Databases: Basic concepts enhanced SQL, advantages of object relational

Introduction to Object-Relational DBMSs

M.TECH (Computer Science & Engineering) 119

Shabnam Sangwan, A.P. in CSE, SKITM

Several major software companies including IBM, Informix, Microsoft, Oracle, and Sybase have
all released object-relational versions of their products. These companies are promoting a new,
extended version of relational database technology called object-relational database
management systems also known as ORDBMSs

 A certain group thinks that future applications can only be implemented with pure object-
oriented systems. Initially these systems looked promising. However, they have been
unable to live up to the expectations. A new technology has evolved in which relational
and object-oriented concepts have been combined or merged. These systems are called
object-relational database systems. The main advantages of ORDBMSs are massive
scalability and support for object-oriented features.

 The need for richer storage mechanisms

• Multimedia applications
• Incorporation of business rules
• Reusability (inheritance)
• Nested complex types
• Relationships
• Options
• Object-oriented databases ?
• Object-relational databases ?


 The main advantages come from reuse and sharing.

 Reuse comes from the ability to extend the database server so that core functionality is
performed centrally, rather than coded in each application.

 An example is a complex type (or extended base type) which is defined within the database,
but is used by many applications. Previously it was required to define this type in every
application that used it, and develop the interface between the software ‘type’ and its
representation in the database. Sharing is a consequence of this reuse.

M.TECH (Computer Science & Engineering) 120

Shabnam Sangwan, A.P. in CSE, SKITM

 From a practical point of view, end-users are happier to make the smaller ‘leap’ from
relational to object-relational, rather that have to deal with a completely different paradigm


 The ORDBMS is more complex and thus has increased costs.

 Relational purists believe that the simplicity of the original model was its strength.

 Pure object-oriented database engineers are unhappy with the object-relational

terminology which is based on the relational model and not on object-oriented software
engineering concepts.

 An example is “user-defined data types” v “classes”.

 Thus, there is a large semantic gap between the o-o and o-r database worlds.

 ORDBMS engineers are data focused while OODB engineers have models which attempt
to mirror the real-world (data & behaviour).

Third Generation Database System Manifesto

The third-generation DSM was devised by Stonebraker’s group (of proposers) and defines those
principles that ORDBMS designers should follow.

• A third-generation DBMS must have a rich type system.

• Inheritance is a good idea.

• Functions (including database procedures and methods) and encapsulation are a good

• Unique identifiers for tuples should be assigned by the DBMS only if a user-defined
primary key is unavailable.

• Rules (triggers or constraints) will become a major feature in future database systems.
They should not be associated with a specific function or collection.

M.TECH (Computer Science & Engineering) 121

Shabnam Sangwan, A.P. in CSE, SKITM

• All programming access to a database should be through a non-procedural, high-level

access language (such as SQL).

• There should be more that one way to specify collections: one using enumeration of
members, and a second using the query language to specify membership.

• Updateable views are essential.

• Performance indicators have nothing to do with data models.

• Third-generation DBMSs must be accessible from multiple high-level languages.

• Persistent forms of multiple high-level languages are a good idea.

• SQL is ‘intergalactic data-speak’ regardless of its many faults.

• Queries and results should be the lowest level of communication between client and


• Rules are valuable in that they protect the integrity of data in a database.

• Relational databases have referential integrity for foreign key management.

• The general form of a rule is “on the occurrence of event x do action y”.

• The are four variations in the proposed standard for ORDBs: update-update, query-
update, update-query, and query-query rules.

Update-Update Rules
• In this case, the event is an update, and the action is an update.

• This is useful in cases where it is necessary to implement an audit eg. Create a new tuple
in the Audit relation with username, date and description, each time a change is made to
the Salary relation.
M.TECH (Computer Science & Engineering) 122
Shabnam Sangwan, A.P. in CSE, SKITM

CREATE RULE Salary_Update AS



insert into Audit

Values ($username, date, Salary.lname)

• In the above example, the current username, date and the lname of the updated employee
(in Salary) are recorded. Note that if we were only interested in one or some group of
employees we could use a where clause (see next example).

Query-Update Rules

• In this case, the event is a query, and the action is an update.

• Similar to the previous example: a user is accessing the Salary relation (for a specific
employee), and the system automatically records it. In this case, only for employee A515.

CREATE RULE Salary_Access AS

ON SELECT TO Salary where salary.StaffID = “A515”


insert into Audit

Values ($username, date, Salary.lname)

• Many relational databases systems cannot implement query-update rules.

Update Query Rules

• In this case, the event is an update, and the action is a query (which uses the results in a

M.TECH (Computer Science & Engineering) 123

Shabnam Sangwan, A.P. in CSE, SKITM

• Suppose that the deletion of tuples from the Author table is not recommended since new
titles may come into stock.

CREATE RULE Author_Delete_Alert AS



ShowMessage “Deleting “”prevents new titles being entered into the database”

• The query in this case is select which is used in the message.

Query-Query Rules
• In this case, both the event and the action are read-only queries.

• A example is where one retrieval operation will require an attribute from some other

• For example, when viewing details for a customer (from the Customer relation), their
credit may be listed as “A2”, where the actual value for “A2” is inside a Credit relation.
(Note we could do the same using a join query)




Select C.value

From Credit C

Where = X.CredRating

M.TECH (Computer Science & Engineering) 124