You are on page 1of 208

RDBMS

SYLLABUS

Basics of database systems, Traditional file approach, Motivation for database approach, The
evolution of database systems, Database basics, Three views of data, The three level architecture
of DBMS, Relational database systems, Data models, Database languages, Client-server and
multi-tier architectures, Multimedia data, Information integration, Data-definition language
commands, Overview of query processing, Storage and buffer management, Transaction
processing, The query processor. The Entity-Relationship Data Model, Introduction of entity
Relationship model, Elements of the E/R Model, Requirement, Relationship, Entity-Relationship
Diagrams, Multiplicity of Binary E/R Relationships, Design Principles, Avoiding Redundancy,
Simplicity Counts, Extended ER Models

Representing Data Elements: Data Elements and Fields, Representing Relational Database
Elements, Records, Representing Block and Record Addresses, Client-Server Systems, Logical and
Structured Addresses, Record Modifications, Index Structures, Indexes on Sequential Files,
Secondary Indexes, B-Trees, Hash Tables.

The Relational Data Model: Basics of the Relational Model, Relation Instances, Functional
Dependencies, Rules About Functional Dependencies, Design of Relational Database Schemas,
Normalization, First Normal form, Second Normal Form, Third Normal Form, Boyce-Codd Normal
Form, Multi-valued dependency, Fifth Normal Form. Relational Algebra: Basics of Relational
Algebra , Set Operations on Relations , Extended Operators of Relational Algebra, Constraints on
Relations , Modification of the Database, Views, Relational Calculus, Tuple Relational Calculus,
Domain Relational Calculus.

SQL: Use Of SQL, DDL Statements, DML Statements, View Definitions, Constraints and Triggers
Keys and Foreign Keys, Constraints on Attributes and Tuples, Modification of Constraints,
Cursors, Dynamic SQL.

Normal Forms: 1NF, 2NF, 3NF, BCNF, Difference between third normal form and BCNF, Multi-
valued Dependencies And Join Dependencies, 4NF, 5NF, Difference between 4NF and 5NF.

Query Execution: Introduction to Physical-Query-Plan Operators, One-Pass Algorithms for


Database Operations, Nested-Loop Joins, Two-Pass Algorithms Based on Sorting, Two-Pass
Algorithms Based on Hashing, Index-Based Algorithms, Buffer Management, Parallel Algorithms
for Relational Operations, Using Heuristics in Query Optimization, Basic Algorithms for Executing
Query Operations.

The Query Compiler: Parsing, Algebraic Laws for Improving Query Plans, From Parse Trees to
Logical Query Plans, Estimating the Cost of Operations, Introduction to Cost-Based Plan
Selection, Completing the Physical-Query-Plan, Coping With System Failures, Issues and Models
for Resilient Operation, Redo Logging, Undo/Redo Logging, Protecting Against Media Failures

Concurrency Control: Serializability, Conflict-Serializability, Enforcing Serializability by Locks,


Locking Systems With Several Lock Modes, Architecture for a Locking Scheduler Managing
Hierarchies of Database Elements, Concurrency Control by Timestamps, Concurrency Control by
Validation.

More About Transaction Management: Introduction of Transaction management, Serializability


and Recoverability, View Serializability, Resolving Deadlocks, Distributed Databases, Distributed
Commit, Distributed Locking.
www.arihantinfo.com
1
RDBMS

Database System Architectures: Centralized And Client-Server Architectures, Server System


Architectures, Parallel Systems, Distributed Systems, Network Types.

Distributed Database: Homogeneous And Heterogeneous Database, Distributed Data Storage,


Distributed Transaction, Commit Protocols and Concurrency Control In Distributed Databases,
Availability and Heterogeneous.

www.arihantinfo.com
2
RDBMS

TABLE OF CONTENTS

UNIT 1
INTRODUCTION OF DATABASE SYSTEMS

1.1 Basics of database systems


1.2 Traditional file oriented approach
1.3 Motivation for database approach
1.4 The evolution of database systems
1.5 Database basics
1.6 Three views of data
1.7 The three level architecture of DBMS
1.8 Relational database systems
1.9 Data models
1.10Database languages
1.11Client-server and multi-tier architectures
1.12Multimedia data
1.13Information integration
1.14Data-definition language commands
1.15Overview of query processing
1.16Storage and buffer management
1.17Transaction processing
1.18The query processor

UNIT 2
THE ENTITY-RELATIONSHIP DATA MODEL

2.1 Introduction of entity Relationship model


2.2 Elements of the E/R Model
2.3 Relationship
2.4 Requirements
2.5 Entity-Relationship Diagrams
2.6 Multiplicity of Binary E/R Relationships
2.7 Design Principles
2.8 Avoiding Redundancy
2.9 Simplicity Counts
2.10Extended ER Models

UNIT 3
REPRESENTING DATA ELEMENTS

3.1 Data Elements and Fields


3.2 Representing Relational Database Elements
3.3 Records
3.4 Representing Block and Record Addresses
3.5 Client-Server Systems
3.6 Logical and Structured Addresses
3.7 Record Modifications
3.8 Index Structures
3.9 Indexes on Sequential Files
3.10 Secondary Indexes
3.11 B-Trees
3.12 Hash Tables

www.arihantinfo.com
3
RDBMS

UNIT 4
THE RELATIONAL DATA MODEL

4.1 Basics of the Relational Model


4.2 Relation Instances
4.3 Functional Dependencies
4.4 Rules About Functional Dependencies
4.5 Design of Relational Database Schemas
4.6 Normalization
4.6.1First Normal form
4.6.2Second Normal Form
4.6.3Third Normal Form
4.6.4Boyce-Codd Normal Form
4.6.5Multi-valued dependency
4.6.6Fifth Normal Form

UNIT 5
RELATIONAL ALGEBRA

5.1 Basics of Relational Algebra


5.2 Set Operations on Relations
5.3 Extended Operators of Relational Algebra
5.4 Constraints on Relations
5.5 Modification of the Database
5.6 Views
5.7 Relational Calculus
5.7.1Tuple Relational Calculus
5.7.2Domain Relational Calculus

UNIT 6
SQL

6.1 Introduction and Usage Of SQL


6.2 DDL Statements
6.3 DML Statements
6.4 View Definitions
6.5 Constraints and Triggers
6.6 Keys and Foreign Keys
6.7 Constraints on Attributes and Tuples
6.8 Modification of Constraints
6.9 Cursors
6.10Dynamic SQL

UNIT 7
NORMAL FORMS

7.1 First Normal Form


7.2 Second Normal Form
7.3 Third Normal Form
7.4 BCNF
7.5 Fourth Normal Form
7.6 Fifth Normal Form
7.7 Difference between 4NF and 5NF

www.arihantinfo.com
4
RDBMS
UNIT 8
QUERY EXECUTION

1.1.1 Introduction to Physical-Query-Plan Operators


1.1.2 One-Pass Algorithms for Database Operations
1.1.3 Nested-Loop Joins
1.1.4 Two-Pass Algorithms Based on Sorting
1.1.5 Two-Pass Algorithms Based on Hashing
1.1.6 Index-Based Algorithms
1.1.7 Buffer Management
1.1.8 Parallel Algorithms for Relational Operations
1.1.9 Using Heuristics in Query Optimization
1.1.10 Basic Algorithms for Executing Query Operations

UNIT 9
THE QUERY COMPILER

9.1 Parsing
9.2 Algebraic Laws for Improving Query Plans
9.3 From Parse Trees to Logical Query Plans
9.4 Estimating the Cost of Operations
9.5 Introduction to Cost-Based Plan Selection
9.6 Completing the Physical-Query-Plan
9.7 Coping With System Failures
9.8 Issues and Models for Resilient Operation
9.9 Redo Logging
9.10 Undo/Redo Logging
9.11 Protecting Against Media Failures

UNIT 10
CONCURRENCY CONTROL

10.1Serial and Serializable Schedules


10.2Conflict-Serializability
10.3Enforcing Serializability by Locks
10.4Locking Systems With Several Lock Modes
10.5Architecture for a Locking Scheduler
10.6Managing Hierarchies of Database Elements
10.7Concurrency Control by Timestamps
10.8Concurrency Control by Validation

UNIT 11
MORE ABOUT TRANSACTION MANAGEMENT

11.1Introduction of Transaction management


11.2Serializability and Recoverability
11.3View Serializability
11.4Resolving Deadlocks
11.5Distributed Databases
11.6Distributed Commit
11.7Distributed Locking

UNIT 12
DATABASE SYSTEM ARCHITECTURES

12.1Centralized And Client-Server Architectures


12.2Server System Architectures
12.3Parallel Systems
www.arihantinfo.com
5
RDBMS
12.4Distributed Systems
12.5Network Types

UNIT 13
DISTRIBUTED DATABASE

13.1Homogeneous And Heterogeneous Database


13.2Distributed Data Storage
13.3Distributed Transaction
13.4Commit Protocols
13.5Concurrency Control In Distributed Databases
13.6Availability

www.arihantinfo.com
6
RDBMS
UNIT 1

INTRODUCTION TO DATABASE SYSTEMS

1.1 Basics of database systems


1.2 Traditional file oriented approach
1.3 Motivation for database approach
1.4 The evolution of database systems
1.5 Database basics
1.6 Three views of data
1.7 The three level architecture of DBMS
1.8 Relational database systems
1.9 Data models
1.10Database languages
1.11Client-server and multi-tier architectures
1.12Multimedia data
1.13Information integration
1.14Data-definition language commands
1.15Overview of query processing
1.16Storage and buffer management
1.17Transaction processing
1.18The query processor

1.1 Introduction of Database management System

Data manipulation and information processing have become the major tasks of any organization,
small or big, whether it is an educational institution, government concern, scientific, commercial
or any other. It is the plural of a Greek word datum, which means any raw facts, or figure like
numbers, events, letters, transactions, etc, based on which we cannot reach any conclusion. It
can be useful after processing, e.g. 78, it is simply a number (data) but if we say physics (78) then
it will becomes information. It means somebody got distinctional marks in physics. Information is
processed data. The user can take decision based on information.

Data Processing Information

An organization is only a mechanism for processing information and considers that the traditional
management of information can be viewed in the context of information and process. The
manager may be considered as a planning and decision center. Established routes of information
flow are used to determine the effectiveness of the organization in achieving its objectives. Thus,
information is often described as the key to success in business.

In essence a database is nothing more than a collection of information that exists over a long
period of time, often many years. In common parlance, the term database refers to a collection of
data that is managed by a DBMS.

The DBMS is expected to:


1. Allow users to create new databases and specify their schema (logical structure of the data),
using a specialized language called a data-definition language.

www.arihantinfo.com
7
RDBMS
2. Give users the ability to query the data (a \query" is database lingo for a question about the
data) and modify the data, using an appropriate language, often called a query language or data-
manipulation language.
3. Support the storage of very large amounts of data many gigabytes or more | over a long period
of time, keeping it secure from accident or unauthorized use and allowing efficient access to the
data for queries and database modifications.
4. Control access to data from many users at once, without allowing the actions of one user to
accept other users and without allowing simultaneous
.

A database is a collection of related data or operational data extracted from any firm or
organization. For example, consider the names, telephone number, and address of people you
know. You may have recorded this data in an indexed address book, or you may have stored it on
a diskette, using a personal computer and software such as Microsoft Access of MS Office or
ORACLE, SQL SERVER etc.

A Database Management System (DBMS) is a computer program you can use to store, organize
and access a collection of interrelated data. The collection of data is usually referred to as the
database. The primary goal of a DBMS is to provide a convenient and effective way to store and
retrieve data from the database. There are several types of data models (a data model is used to
describe the structure of a database) and Empress is a Relational Database Management System
(RDBMS) with Object Oriented extensions. Empress is capable of managing data in multiple
databases. The data stored in each database is organized as tables with rows and columns. In
relational database terminology, these tables are referred to as relations, rows are referred to as
records, and columns are referred to as attributes.

A database management system, therefore, is a combination of hardware and software


that can be used to set up and monitor a database, and can manage the updating and retrieval of
database, and can manage that can be used to set up and monitor a database, and can manage
the updating and retrieval of database that has been stored in it.

Most database management systems have the following facilities:


a) Creating of a file, addition to data, modification of data; creation, addition and deletion of
entire files.
b) Retrieving data collectively or selection.
c) The Data stored or indexed at the user’s discretion and direction.
d) Various reports can be produced from the system.
e) Mathematically function can be performed and the data will be stored in the database can be
manipulated with these functions.
f) To maintain data integrity and database use.

Queries
DBMS OS
Data
COBOL/PL Base

The DBMS response to a query by involving the appropriate subgroups, each of which performs
its special functions to interpret the query, or to locate the desired data in the database and
present it in the desired order.

As already mentioned, a database consists of a group of related files of different record types, and
the database allows users to access data anywhere in the database without the knowledge of how
data are actually organized on the storage device.

1.2 TRADITIONAL FILE ORIENTED APPROACH

www.arihantinfo.com
8
RDBMS
The traditional file-oriented approach to information processing has for each application a
separate master file and its own set of personal files. An organization needs flow of information
across these applications also and this requires sharing of data, which is significantly lacking in
the traditional approach. One major limitations of such a file-based approach is that the
programs become dependent on the files and the files become dependent upon the programs.

Disadvantages
• Data Redundancy: The same piece of information may be stored in two or more files. For
example, the particulars of an individual who may be a customer or client may be stored
in two or more files. Some of this information may be changing, such as the address, the
payment maid, etc. It is therefore quite possible that while the address in the master file
for one application has been updated the address in the master file for another application
may have not been. It may be not easy to even find out as to in how many files the
repeating items such as the name occur.

• Program/Data Dependency: In the traditional approach if a data field is to be added to a


master file, all such programs that access the master file would have to be changed to
allow for this new field which would have been added to the master record.
• Lack of Flexibility: In view of the strong coupling between the program and the data,
most information retrieval possibilities would be limited to well-anticipated and pre-
determined requests for data, the system would normally be capable of producing
scheduled records and queries which it has been programmed to create.

1.3 MOTIVATION FOR DATABASE APPROACH

Having pointed out some difficulties that arise in a straightforward file-oriented approach towards
information system development. The work in the organization may not require significant sharing
of data or complex access. In other words the data and the way it is used in the functioning of the
organization are not appropriate to database processing. Apart from needing a more powerful
hardware platform, the software for database management systems is also quite expensive. This
means that a significant extra cost has to be incurred by an organization if it wants to adopt this
approach.

Advantages gained by the possibility of sharing of the data with others, also carries with it the
risk of unauthorized access of data. This may range from violation of office procedures to violation
of privacy rights of information to down right thefts. The organizations, therefore, have to be ready
to cope with additional managerial problems.
A database management processing system is complex and it could lead to a more inefficient
system than the equivalent file-based one.

The use of the database and its possibility of being shared will, therefore affect many departments
within the organization. If die integrity of the data is not maintained, it is possible that one
relevant piece of data could have been used by many programs in different applications by
different users without they are being aware of it. The impact of this therefore may be very
widespread. Since data can be input from a variety sources, the control over the quality of data
become very difficult to implement.
However, for most large organization, the difficulties in moving over to a database approach are
still worth getting over in view of the advantages that are gained, namely, avoidance of data
duplication, sharing of data by different programs, greater flexibility and data independence.

1.4 The Evolution of Database Systems


Early days: database applications built on top of file systems. Drawbacks of using file systems to
store data:
• Data redundancy and inconsistency
• Difficulty in accessing data
• Atomicity of updates

www.arihantinfo.com
9
RDBMS
• Concurrency control
• Security
• Data isolation — multiple files and formats
• Integrity problems

www.arihantinfo.com
10
RDBMS

1.5 DATABASE BASICS

Since the DBMS of an organization will in some sense reflect the nature of activities in the
organization, some familiarity with the basic concepts, principles and terms used in the field are
important.
• Data-items: The term data item is the word for what has traditionally been called the field
in data processing and is the smallest unit of data that has meaning to its users. The
phrase data element or elementary item is also sometimes used. Although the data item
may be treated as a molecule of the database, data items are grouped together to form
aggregates described by various names. For example, the data record is used to refer to a
group of data items and a program usually reads or writes the whole records. The data
items could occasionally be further broken down into what may be called an automatic
level for processing purposes.
• Entities and Attributes: The real world would consist of occasionally a tangible object such
as an employee; a component in an inventory or a space or it may be intangible such as an
event, a job description, identification numbers, or an abstract construct. All such items
about which relevant information is stored in the database are called Entities. The
qualities of the entity that we store as information are called the attributes. An attribute
may be expressed as a number or as a text. It may even be a scanned picture, a sound
sequence, and a moving picture that is now possible in some visual and multi-media
databases.
Data processing normally concerns itself with a collection of similar entities and records
information about the same attributes of each of them. In the traditional approach, a
programmer usually maintains a record about each entity and a data item in each record
relates to each attribute. Similar records are grouped into files and such a 2-dimensional
array is sometimes referred to as a flat file.

• Logical and Physical Data: One of the key features of the database approach is to bring
about a distinction between the logical and the physical structures of the data. The term
logical structure refers to the way the programmers see it and the physical structure refers
to the way data are actually recorded on the storage medium. Even in the early stages of
records stored on tape, the length of the inter-record tape requires that many logical
records be grouped into one physical record to several storage places on tape. It was the
software, which separated them when used in an application program, and combined them
again before writing back on tape. In today's system the complexities are even greater and
as will be seen when one is referring to distributed databases that some records may
physically be located at significantly remote places.
• Schema and Subschema: Having seen that the database does not focus on the logical
organization and decouples it from the physical representation of data, it is useful to have
a term to describe the logical database description. A schema is a logical database
description and is drawn as a chart of the types of data that are used. It gives the names of
the entities and attributes, and specifies the relationships between them. It is a framework
into which the values of the data item can be fitted. Like an information display system
such as that giving arrival and departure time at airports and railway stations, the schema
will remain the same though the values displayed in the system will change from time to
time. The relationships that has specified between the different entities occurring in the
schema may be a one to one, one to many, many to many, or conditional.
The term schema is used to mean an overall chart of all the data item types and record-
types stored in a database. The term sub schema refers to the same view but for the data-
item types and record types which a particular user uses in a particular application or.
Therefore, many different sub schemas can be derived from one schema.
• Data Dictionary: It holds detailed information about the different structures and data
types: the details of the logical structure that are mapped into the different structure,
details of relationship between data items, details of all users privileges and access rights,
performance of resource with
www.arihantinfo.com
11
RDBMS
• details.

1.6 THREE VIEWS OF DATA

DBMS is a collection of interrelated files and a set of programs that allow several users to access
and modify these files. A major purpose of a database system is to provide users with an abstract
view of the data. However, in order for the system to be usable, data must be retrieved efficiently.
The concern for efficiently leads to the design of complex data structure for the representation of
data in the database. By defining levels of abstract as which the database may be viewed, there
are logical view or external, conceptual view and internal view or physical view.
• External view: This is the highest level of abstraction as seen by a user. This level of
abstraction describes only the part of entire database.
• Conceptual view: This is the next higher level of abstraction which is the sum total of Data
Base Management System user's views. In this we consider; what data are actually stored
in the database. Conceptual level contains information about entire database in terms of a
small number of relatively simple structures.
• Internal level: The lowest level of abstraction at which one describes how the data are
physically stored. The interrelationship of any three levels of abstraction is illustrated in
figure 2.

Fig: The three views of data

To illustrate the distinction among different views of data, it can be compared with the concept of
data types in programming languages. Most high level programming language such as C, VC++,
etc. support the notion of a record or structure type. For example in the ‘C’ language we declare
structure (record) as follows:
struct Emp{
char name [30];

www.arihantinfo.com
12
RDBMS
char address [100];
}
This defines a new record called Emp with two fields. Each field has a name and data type
associated with it.
In an Insurance organization, we may have several such record types, including among others:
-Customer with fields name and Salary
-Premium paid and Due amount at what date
-Insurance agent name and salary + Commission
At the internal level, a customer, Premium account, or employee (insurance agent) can be
described as a sequence of consecutive bytes. At the conceptual level each such record is
described by a type definition, illustrated above and also die interrelation among these record
types is defined and describing the rights or privileges assign to individual customer or end-users.
Finally at the external level, we define several views of the database. For example, for preparing
the Insurance checks of Customer_details’, only information about them is required; one does not
need to access information about customer accounts. Similarly, tellers can access only account
information. They cannot access information concerning about the premium paid or amount
received.

1.7 The Three Level Architecture Of DBMS

A database management system that provides these three levels of data is said to follow three-
level architecture as shown in fig. . These three levels are the external level, the conceptual level,
and the internal level.

Fig : The three level architecture for a DBM

A schema describes the view at each of these levels. A schema as mentioned earlier is an outline
or a plan that describes the records and relationships existing in the view. The schema also
describes the way in which entities at one level of abstraction can be mapped to the next level.
The overall design of the database is called the database schema. A database schema includes
such information as:
· Characteristics of data items such as entities and attributes
· Format for storage representation
· Integrity parameters such as physically authorization and backup politics.
· Logical structure and relationship among those data items

The concept of a database schema corresponds to programming language notion of type


definition. A variable of a given type has a particular value at a given instant in time. The concept
of the value of a variable in programming languages corresponds to the concept of an instance of
a database schema.

www.arihantinfo.com
13
RDBMS
Since each view is defined by a schema, there exists several schema in the database and these
exists several schema in the database and these schema are partitioned following three levels of
data abstraction or views. At the lower level we have the physical schema; at the intermediate
level we have the conceptual schema, while at the higher level we have a subschema. In general,
database system supports one physical schema, one conceptual schema, and several sub-
schemas.

1.7.1 External Level or Subschema


The external level is at the highest level of database abstraction where only those portions of the
database of concern to a user or application program are included. Any number of user views may
exist for a given global or conceptual view.
Each external view is described by means of a schema called an external schema or subschema.
The external schema consists of the, definition of the logical records and the relationships in the
external view. The external schema also contains the method of deriving the objects in the
external view from the objects in the conceptual view. The objects include entities, attributes, and
relationships.

1.7.2 Conceptual Level or Conceptual Schema


One conceptual view represents the entire database. The conceptual schema defines this
conceptual view. It describes all the records and relationships included in the conceptual view
and, therefore, in the database. There is only one conceptual schema per database. This schema
also contains the method of deriving the objects in the conceptual view from the objects in the
internal view.
The description of data at this level is in a format independent of its physical representation. It
also includes features that specify the checks to retain data consistency and integrity.

1.7.3 Internal Level or Physical Schema


It indicates how the data will be stored and describes the data structures and access methods to
be used by the database. The internal schema, which contains the definition of the stored record,
the method of representing the data fields, expresses the internal view and the access aids used.

1.7.4 Mapping between different Levels


Two mappings are required in a database system with three different views. A mapping between
the external and conceptual level gives the correspondence among the records and the
relationships of the external and conceptual levels.
a) EXTERNAL to CONCEPTUAL: Determine how the conceptual record is viewed by the user
b) INTERNAL to CONCEPTUAL: Enable correspondence between conceptual and internal levels. It
represents how the conceptual record is represented in storage.

An internal record is a record at the internal level, not necessarily a stored record on a physical
storage device. The internal record of figure 3 may be split up into two or more physical records.
The physical database is the data that is stored on secondary storage devices. It is made up of
records with certain data structures and organized in files. Consequently, there is an additional
mapping from the internal record to one or more stored records on secondary storage devices.

1.8 Relational Database Systems

The relational model, invented by IBM researcher Ted CODD in 1970, wasn't turned into a
commercial product until almost 1980. Since then database systems based on the relational
model, called relational database management systems or RDBMS, have come to dominate the
database software market. Today few people know about any other kind of database management
system.
Few RDBMS implement the relational model completely. Although commercial RDBMS have a lot
in common, each system has quirks and non-standard extensions. You must understand
relational theory to correctly design a database -just learning a particular RDBMS won't get you
all the way there.
A good RDBMS and a well-designed relational database give you some important benefits:
www.arihantinfo.com
14
RDBMS
• Data integrity and consistency maintained and/or enforced by the RDBMS.
• Redundant data eliminated or kept to a practical minimum.
• Data retrieved by unique keys.
• Relationships expressed through matching keys.
• Physical organization of data managed by RDBMS.
• Optimization of storage and database operation execution times.

1.9 Data Models

Collections of conceptual tools for describing data, data relationships, data semantics and
consistency constraints. The various data models that have been proposed fall into three different
groups. Object based logical models, record-based logical models and physical models.
Object-Based Logical Models: They are used in describing data at the logical and view levels. They
are characterized by the fact that they provide fairly flexible structuring capabilities and allow
data constraints to be specified explicitly. There are many different models and more are likely to
come. Several of the more widely known ones are:

• The E-R model

• The object-oriented model

• The semantic data model

• The functional data model


Data Modeling Using E-R Model

The (E-R) data model is based on a perception of a real worker that consists of a collection of basic
objects, called entities, and of relationships among these objects.

The overall logical structure of a database can be expressed graphically by an E-R diagram. Which
is built up by the following components:

• Rectangles, which represent entity sets

• Ellipses, which represent attributes

• Diamonds, which represent relationships among entity sets

• Lines, which link attributes to entity sets and entity sets to relationships.
E.g. suppose we have two entities like customer and account, then these two entities can be
modeled as follow:

Account
r nam e number
Custome Customer city Balanc
e

Customer Depos
Account
it

A sample E-R diagram


The Object-Oriented Model
Like the E-R model the object-oriented model is based on a collection of objects. An object
contains values stored in instance variables within the object. An object also contains bodies of
code that operate on the object. These bodies of code are called methods.
www.arihantinfo.com
15
RDBMS
Classes- It is the collection of objects which consist of the same types of values and the same
methods.
E.g. account number & balance are instance variables; pay-interest is a method that uses the
above two variables and adds interest to the balance.
Semantic Models
These include the extended relational, the semantic network and the functional models. They are
characterized by their provision of richer facilities for capturing the meaning of data objects and
hence of maintaining database integrity. Systems based on these models exist in monotype for at
the time of writing and will begin to filter through the next decade.
Record-Based Logical Models
Record based logical models are used in describing data at the logical and view levels. In contrast
to object-based data models, they are used both to specify the overall logical structures of the
database, and to provide a higher-level description of the implementation.
Record-based models are so named because the database is structured in fixed-format records of
several types. Each record type defines a fixed number of fields, or attributes, and each field is
usually of a fixed length.
The three most widely accepted record-based data models are the relational, network, and
hierarchical models.
Relational Model
The relational model uses a collection of tables to represent both data and the relationships
among those data. Each table has multiple columns, and each column has a unique name as
follows:

CUSTOMER (Table name)

Customer–name Customer-street Customer-City Account-number

Johnsons Alma Pala Alto A-101


Smith North Ryc A-215
Hayes Main Harrison A-102
Turner Dutnam Stanford A-305
Johnson Alma PalaAlto A-201
Jones Main Harrison A-217
Lindsay Park Pittifield A-222
Smith North Rye A-201
A sample relational database
Network Model
Data in the network model are represented by collections of records, and relationship among data
is represented by links, which can be viewed as pointers. The records in the database are
organised as collections of arbitrary graphs. Such type of database is shown in the figure given
bellow.

Johnsons Alma Pala Alto A-101 500


Smith North Rye A-215 700
Hayes Main Harrison A-102 400

www.arihantinfo.com
16

A-222 700
RDBMS
Turner Dutnam Stanford A-305 350
Jones Main Harrison A-201 900
Lindsay Park Pittifield A-217 750

A sample network database

Hierarchical Model
The hierarchical model is similar to the network model in the sense that data and relationships
among data one represented by records and links, respectively. It differs from the network model
in that records are organised as collections of trees rather than arbitrary graphs.
CUSTOMER

Johnson customer
street -------
Smith North -------

Hayes Main -------


A-101 500

A-201 900 Turner Putnam -------

Jones Main -------


A-215 700
Lindsay Park -------
A-201 900

A-217 350
A-102 400

A-222 700
A-305 350
A sample Hierachical database
Physical Data Models
Physical data models are used to describe data at the lowest level. In contrast to logical data
models, there are few physical data models in use. Two of the widely known ones are the unifying
model and the frame-memory model.

10. Database Languages

www.arihantinfo.com
17
RDBMS
A database system provides two different types of languages: one to specify the database schema,
and the other to express database queries and updates.
1.10 Database Languages
Data Definition Language (DDL)
A database schema is specified by a set of definition expressed by a special language called a
data-definition language (DDL). The result of compilation of DDL statement is a set of tables that
is stored in a special file called data dictionary, or data directory.
A data dictionary is a file that contain metadata-that is about data. This file is consulted before
actual data are read or modified in the database system.
The storage structure and access methods used by the database system are specified by as set of
definitions in a special type of DDL called a data storage and definition language. The result of
compilation of these definitions is a set of instructions to specify the implementation details of the
database schema - details are usually hidden from the users.
Data Manipulation Language
By data manipulation, we mean

• The retrieval of information stored in the database.

• The insertion of new information into the database.

• The deletion of information stored in the database.


A data-manipulation language (DML) is a language that enables user to access or manipulate data
as organised by the appropriate data model.
There are basically two types:
Procedural DML
It requires a user to specify what data are needed and how to get those data.
Nonprocedural DML
It requires a user to specify what data are needed without specifying how to get those data.

Normalization
About relational databases, you probably know about normalization. The process of normalization
transforms data into forms that conform to the relational model. Normalized data enables the
RDBMS to enforce integrity rules, guarantee consistency, and optimize database access. Learning
how to normalize data takes significant time and practice. Data modelers spend a lot of time
understanding the meaning of data so they can properly normalize it, but programmers frequently
downplay normalization, or dismiss it outright as an academic problem. Most databases come
from power users and programmers, not data modelers, and most databases suffer from un-
normalized data, redundancy, integrity and performance problems. Un-normalized databases
usually need a lot of application code to protect the database from corruption.

1.11 Client-Server and Multi-Tier Architectures

Client/Server Technology
Client/server technology is the computer architecture used in almost all automated library
systems now being offered to libraries. The simple definition is:
Client/server is a computer architecture that divides functions into client (requestor) and server
(provider) subsystems, with standard communication methods (such as TCP/IP and z39.50) to
facilitate the sharing of information between them. Among the characteristics of a client/server
architecture are the following:

www.arihantinfo.com
18
RDBMS
• The client and server can be distinguished from one another by the differences in tasks they
perform
• The client and server usually operate on different computer platforms
• Either the client or server may be upgraded without affecting the other. Clients may connect
to one or more servers; servers may connect to multiple clients concurrently.
• Clients always initiate the dialogue by requesting a service.

Client/server is most easily differentiated from hierarchical processing, which uses a host and
slave, by the way a PC functions within a system. In client/server the PC-based client
communicates with the server as a computer; in hierarchical processing the PC emulates a
"dumb" terminal to communicate with the host. In client/server the client controls part of the
activity, but in hierarchical processing the host controls all activity. A client PC almost always
does the following in a client/server environment: screen handling, menu or command
interpretation, data entry, help processing, and error recovery.
The dividing line between the client and a server can be anywhere along a broad continuum: at
one end only the user interface has been moved onto the client; at the other, almost all
applications have been moved onto the client and the database may be distributed. There are at
least five points along the continuum:

Distributed presentation:
The presentation is handled partly by the server and partly by the client.

Remote presentation:
The presentation is controlled and handled entirely by the client.

Distributed logic:
The application logic is handled partly by the server and partly by the client.

Remote data management:


Database management is controlled and handled entirely by the server.

Distributed database:
Database management is handled partly by the server and partly by the client. There are,
therefore, two major applications for client/server in a library environment:
1) as the architecture for an automated library system, and
2) as an approach to linking heterogeneous systems.
In the first application, a vendor designs a system using client/server architecture to facilitate use
of that system to access multiple servers, to facilitate bringing together multiple product lines,
and/or to improve productivity. In the second application, a vendor designs a client to facilitate
www.arihantinfo.com
19
RDBMS
transparent access to systems of other vendors, and a server to facilitate transparent access to its
system from others. While the underlying principles are the same, the vendor has considerable
latitude in the design of its own client/server system, but must strictly conform to standards
when using client/server to link its system with those of other libraries.
While it has been possible to access a wide variety of electronic resources through an automated
library system for a number of years, client/server technology has made it possible to tailor the
user interface to provide a personalized interface which meets the needs of any particular user
based on an analysis of tasks performed or on an individual's expressed preferences. An example
of this tailoring is the recent introduction of portals, common user interfaces to a wide variety of
electronic resources with the portal. [See the Tech Note on Portal Technology]. The portal can be
tailored to groups of staff or patrons, or to each individual.
Vendors with multiple product lines can build a single client to work with any of their server
products. This substantially reduces development costs. Client/server can also improve
productivity. Many vendors are now offering different clients for technical services, circulation,
and patron access catalog applications.
A GUI (graphical user interface) -- a presentation of information to the user using icons and other
graphics -- is sometimes called client/server, but unless information moves from the server to the
client in machine-readable (raw) form, and the client does the formatting to make it human-
readable, it is not true client/server. Further, there is nothing in the client/server architecture
that requires a GUI. Nevertheless, most vendors of automated library systems use GUI for staff
applications. The GUIs are proprietary to each vendor. Web browsers are preferred for patron
applications because they are more likely to be familiar to them than a proprietary GUI.
An important computer industry development which has facilitated client/server architecture is
referred to as "open systems" C a concept which features standardized connectivity so that
components from several vendors may be combined. The trend to open systems began in the
1970s as a reaction against proprietary systems that required that all hardware and system
software come from a single source, and gained momentum in the 1980s, as networking became
common. While various parts of an organization might not hesitate to purchase proprietary
systems to meet their own needs, the desire to provide access from other parts of the organization,
or to exchange information, would be an incentive to select an open system. For client/server,
open systems are essential.
Most client/server systems offered by automated library system vendors use an open operating
system such as UNIX or one of its variations, or Windows NT or 2000 server. UNIX is the most
popular operating system for servers because of the large range of platform sizes available, but
Windows NT or 2000 server has been growing in popularity, especially for systems supporting
fewer than 100 concurrent users.
The most popular client operating systems are Windows 95/98/Me/2000 and Linux. By
supporting multiple client operating systems, a vendor of an automated library system makes it
possible for the client to conform to a staff member’s or patron's accustomed operating system
environment.
Almost all client/server systems use a relational database management system (RDBMS) for
handling the storage and retrieval of records in the database using a series of tables of values.
There is a common misconception that client/server is synonymous with networked SQL
(Structured Query Language) databases. SQL, a popular industry-standard data definition and
access language for relational databases, is only one approach -- albeit the one selected by almost
all automated library system vendors. While one can reasonably expect the use of an RDBMS and
SQL, the absence of either does not mean that a system is not client server.
A network computer is a PC without a hard disk drive. They have been little used in
libraries because most libraries use their PCs for a variety of applications in addition to accessing
the automated library system. Even when there are applications that lend themselves for use on
thin clients, most libraries have preferred to use older PCs that are no longer suitable for
applications that require robust machines. They use a two-tier PC strategy that involves the
purchase and deployment of new PCs for applications that require robust machines and
redeployment of the replaced machines for applications that they can support. For example, new
PCs are used for most staff applications and patron access to the Internet: older PCs are used as
“express catalogs,” devices that have a Web browser, but are limited to accessing the library’s
patron access catalog. The two-tier PC strategy can extend the life of a PC by as much as three
years.
www.arihantinfo.com
20
RDBMS
Network computers are most widely used in large organizations that have to support thousands of
users. Almost all applications, including word processing, spreadsheets, and other office
applications, are loaded on a server. It is then not necessary to load new product releases on each
machine; only the applications on the server have to be updated. Most libraries do not have
enough users to realize significant savings by taking this approach. For libraries that have
hundreds of PCs, remote PC management is an alternative to thin clients; for libraries that have
fewer than 100 PCs, it is possible to support each individually.
A PDA is a handheld device that combines computing, telephone, fax, and networking features.
While originally pen-based (i.e., using a stylus), many models now come with a small keyboard.
The Palm Pilot is an example of a PDA. A number of libraries now encourage their users to access
the patron access catalog with a PDA. It is this application which holds the most promise for the
use of thin clients in libraries. As the bandwidth available for wireless applications increases and
the costs of PDAs drops, the use of PDAs for access to databases is expected to increase
dramatically.
The key to thin client technology is Java, a general purpose programming language with a number
of features that make it well suited for use on the Web. Small Java applications are called Java
applets and can be downloaded from a Web server and run on a device which includes a Java-
compatible Web browser such as Netscape Navigator or Microsoft Internet Explorer. This means
that the thin client does not need to be loaded with applications software.
A thin client can also be GUI-based. In that case, the client handles only data presentation and
deals with the user interaction; all applications and database management is found on the server.
Vendors of automated library systems favor proprietary GUI-based clients for staff because that
makes it possible to exploit the features of their systems.

1.12 Multimedia Data


The weblicon PIM has been built on a multi-tier architecture to support high scalability through
load-balancing and high availability through redundancy. The HTTP Client tier communicates
with the HTTP Server tier which routes incoming requests to the Application Server tier. The
Application Server tier connects to the Database Server tier consisting of RDBMS, LDAP and IMAP
Servers. All tiers can be scaled individually by building clusters of servers for load-balancing and
high redundancy embedding information into multimedia data that should be imperceptible but
unmemorable.

For model-based compression and animation of face models as defined in MPEG-4, a watermark
can be embedded into the transmitted facial animation parameters. The watermark can be
retrieved from the animation parameters or from video sequences rendered with the watermarked
animation parameters, even after subsequent low-quality video compression of the rendered
sequence. Entering the derived or generated data into the database and associating it with the
original information gives other users automatic access to the conclusions, thoughts, and
annotations of previous users of the information. This ability to modify, adjust, enhance, or add to
the global set of information and then share that information with others is a powerful and
important service. This type of service requires cooperation between the multimedia data
manipulation tools described above and the information repositories scattered across the network.
Generated or extracted information must be deposited and linked with existing information so
that future users will not only benefit from the original information but also from the careful
analysis and insight of previous users.

www.arihantinfo.com
21
RDBMS

1.13 Information Integration

Information integration is an important feature of Oracle9i Database. Oracle provides many


features that allow customers to synchronously and asynchronously integrate their their data
including Oracle Streams, Oracle Advanced Queuing, replication, and distributed SQL.
Oracle also provides gateways to non-Oracle database servers, providing seamless interoperation
in heterogeneous environments.

Asynchronous Integration with Oracle Streams


Oracle Streams enables the propagation and management of data, transactions and events in a
data stream either within a database, or from one database to another. The stream routes
published information to subscribed destinations. The result is a new feature that provides
greater functionality and flexibility than traditional solutions for capturing and managing events,
and sharing the events with other databases and applications. Oracle Streams provides both
message queuing and replication functionality within the Oracle database. To simplify
deployment of these specialized information integration solutions, Oracle provides customized
APIs and tools for message queuing, replication, and data protection solutions. These include:
• Advanced Queuing for message queuing
• Replication with Oracle Streams

Asynchronous Integration with Oracle Advanced Replication


Oracle9i also includes Advanced Replication. This is a powerful replication feature first
introduced in Oracle7. Advanced Replication is useful for replicating object between Oracle9i
Database and older versions of the Oracle Server (that do not support Oracle Streams-based
Replication). Advanced Replication also provides Oracle Materialized Views; a replication solution
targeted for mass deployments and disconnected replicas.

Synchronous Integration with Distributed SQL


Oracle Distributed SQL enables synchronous access from one Oracle database, to other Oracle or
non-Oracle databases. Distributed SQL supports location independence, making remote objects
appear local, facilitating movement of objects between databases without application code

www.arihantinfo.com
22
RDBMS
changes. Distributed SQL supports both queries and DML operations, and can intelligently
optimize execution plans to access data in the most efficient manner.

1.14 Data-Definition Language Commands

Empress provides a set of Structured Query Language (SQL) commands that allows users to
request information from the database. The Empress SQL language has three levels of operations:
1. A basic level at which data management commands are typed in without any prompting.
The full range of data management commands is available at this level.
2. Data Definition Language commands are concerned with the structure of the database and
its tables.
3. Data Manipulation Language commands are concerned with the maintenance and retrieval
of the data stored in the database tables.
4. Data Control Language commands provide facilities for assuring the integrity and security
of the data.

Data Definition Language Command


The overall structure of the database is defined by the Data Definition Language (DDL)
commands. The result of compiling DDL commands is a set of tables called the data dictionary
which contains information about the data in the database. Detailed descriptions of the Empress
DDL commands are provided in the Empress SQL: Reference manual. A summary of the available
DDL commands follows:
Command Description

ALTER TABLE Changes the structure of an existing table without having to dump and re-
load its records. This also includes enable/disable trigger, define table type for
replication, enabling and disabling replication relations, and setting
checksum for a table.
CREATE COMMENT Attaches a comment to a table or attribute.
CREATE INDEX Sets up a search-aiding mechanism for an attribute.
CREATE MODULE Creates the definition of a persistent stored module into the data dictionary.
CREATE RANGE
Sets up data validation checks on an attribute.
CHECK
CREATE
Sets up data referential constraints on attributes.
REFERENTIAL
CREATE
REPLICATION Assings replication master entries to a replication table.
MASTER
CREATE
REPLICATION Assings replication replicate entries to a replication table.
REPLICATE
CREATE REPLICATE
Creates replicate table from a replication master table.
TABLE
CREATE TABLE Creates a new table or replicate table including its name and the name and
data type of each of its attributes.
CREATE TRIGGER Sets up trigger events into data dictionary.
CREATE VIEW Creates a logical table from parts of one or more tables.
DISPLAY DATABASE Shows the tables in the database.
DISPLAY GRANT
Shows privilege grant options for a table.
PRIVILEGE
DISPLAY MODULE Shows the persistent stored module definition.

www.arihantinfo.com
23
RDBMS
DISPLAY PRIVILEGE Shows access privileges for a table.
DISPLAY TABLE Shows the structure of a table.
DROP COMMENT Removes a comment on a table or attribute.
DROP INDEX Removes an index on an attribute.
DROP MODULE Removes a persistent stored module definition from the data dictionary.
DROP RANGE
Removes data validation checks on an attribute.
CHECK
DROP
Removes a data referential constraints from attributes.
REFERENTIAL
DROP REPLICATION
Removes replication master entry from a replication table.
MASTER
DROP REPLICATION
Removes replication replicate entry from a replication table.
REPLICATE
DROP TABLE Removes an existing table.
DROP TRIGGER Removes a trigger event from the data dictionary.
DROP VIEW Removes a logical table.
GRANT PRIVILEGE Changes access privileges for tables or attributes.
LOCK LEVEL Sets the level of locking on a table.
RENAME Changes the name of a table or attribute.
REVOKE PRIVILEGE Removes table or attribute access privileges.
UPDATE MODULE Links a persistent stored module definition with the module shared library.

1.15 Overview of Query Processing

The techniques used by a DBMS to process, optimize, and execute high-level queries. A query
expressed in a high-level query language such as SQL must first be scanned, parsed, and
validated. The scanner identifies the language tokens—such as SQL keywords, attributes names
and relation names—in the text of the query, whereas the parser checks the query syntax to
determine whether it is formulated according to the syntax rules (rules of grammar) of the query
language. The query must also be validated, by checking that all attribute and relation names are
valid and semantically meaningful names in the schema of the particular database being queried.
An internal representation of the query is then created, usually as a tree data structure called
query tree. It is also possible to represent the query using a graph data structure called a query
graph. The DBMS must then devise an execution strategy for retrieving the result of the query
from the database files. A query typically has many possible execution strategies, and the process
of choosing a suitable one for processing a query is known as query optimization.

Query Processing Strategies


The steps involved in processing a query are illustrated in Figure 5.2. The basic steps are:
1. Parsing and translation
2. Optimization
3. Evaluation
Before query processing can begin, the system must translate the query into a usable form. A
language such as SQL is suitable for human use, but is ill suited to be the system’s internal
representation of a query. A more useful internal representation is one based on the extended
relational algebra.

www.arihantinfo.com
24
RDBMS

Steps in Query Processing

Thus, the first action the system must take in query processing is to translate a given query into
its internal form. This translation process is similar to the work performed by the parser of a
compiler. In generating the internal form of the query, the parser checks the syntax of the user's
query, verifies that the relation names appearing in the query are names of relations in the
database, and so on.
Query optimization is left, for the most part, to the application programmer. That choice is made
because the data-manipulation-language statements of these two models are usually embedded in
a host programming language, and it is not easy to transform a network or hierarchical query into
an equivalent one without knowledge of the entire application program. In contrast, relational-
query languages are either declarative or algebraic. Declarative languages permit users to specify
what a query should generate without saying how the system should do the generating. Algebraic
languages allow for algebraic transformation of users' queries. Based on the query specification, it
is relatively easy for an optimizer to generate a variety of equivalent plans for a query, and to
choose the least expensive one.

www.arihantinfo.com
25
RDBMS

1.16 Storage and Buffer Management

Despite its many virtues, the relational data model is a poor fit for many types of data now
common across the enterprise. In fact, object databases owe much of their existence to the
inherent limitations of the relational model as reflected in the SQL2 standard. In recent years, a
growing chorus of demands has arisen from application developers seeking more flexibility and
functionality in the data model, as well as from system administrators asking for a common
database technology managed by a common set of administrative tools. As a result, vendors and
the SQL3 standard committees to include object capabilities are now extending the relational
model.
Object/relational (O/R) database products are still quite new, and the production databases to
which they have been applied are usually modest in size--50GB or less. As O/R technology
becomes more pervasive and memory and storage costs continue to fall, however, databases
incorporating this new technology should grow to a size comparable to that of pure relational
databases. Indeed, growth in the new technology is likely if for no other reason than that much of
this new data is inherently larger than the record/field type data of traditional relational
applications.
However, while limits to the growth of individual pure relational databases have been imposed as
much by hardware evolution as by software, the limits for O/R databases will arise primarily from
software. In this article, I'll explore the implications of the architecture approaches chosen by the
principal O/R database product designers--IBM, Informix, NCR, Oracle, Sybase, and Computer
Associates--for scalability of complex queries against very large collections of O/R data. The
powerful new data type extension mechanisms in these products limit the ability of vendors to
assume as much of the burden of VLDB complexity as they did for pure relational systems.
Instead, these mechanisms impose important additional responsibilities on the designers of new
types and methods, as well as on application designers and DBAs; these responsibilities become
more crucial and complex as the size of the databases and the complexity of the queries grow.
Finally, I'll explain how parallel execution is the key to the cost effectiveness--or even in some
cases to the feasibility--of applications that exploit the new types and methods, just as with pure
relational data. In contrast to the pure relational approach, however, achieving O/R parallelism is
much more difficult.
The term "VLDB" is overused; size is but one descriptive parameter of a database, and generally
not the most important one for the issues I'll raise here. Most very large OLTP relational
databases, for example, involve virtually none of these issues because high-volume OLTP queries
are almost always short and touch few data and metadata items. In addition, these OLTP
databases are frequently created and administered as collections of semi-independent smaller
databases, partitioned by key value. In contrast, the VLDB issues discussed here arise in very
large databases accessed by queries that individually touch large amounts of data and metadata
and involve join operations, aggregations, and other operations touching large amounts of data
and metadata. Such databases and applications today are usually found in data warehousing and
data mining applications, among other places.
The VLDB environments with these attributes are characterized by many I/O operations
within a single query involving multiple complex SQL operators and frequently generating large
intermediate result sets. Individual queries regularly cross any possible key partition boundary
and involve data items widely dispersed throughout the database. For these reasons, such a
database must normally be administered globally as a single entity. As a "stake in the ground," I'll
focus on databases that are at least 250GB in size, are commonly accessed by complex queries,
require online administration for reorganization, backup and recovery, and are regularly subject
to bulk operations (such as insert, delete, and update) in single work-unit volumes of 25GB or
more. For these databases, an MPP system, or possibly a very large SMP cluster, is required.
However, many of the issues we raise will apply to smaller databases as well.

Parallel Execution of Individual Queries


In a pure relational database, success in such an environment of 250GB or more may require the
highest degree of parallelism possible in execution of each individual query. The first key to
parallel success is the structure of the SQL query language itself. In most queries, SQL acts as a
set operator language that imposes few constraints on the order of execution; that is why
www.arihantinfo.com
26
RDBMS
relational database optimizers have so much flexibility in choosing parallel execution plans. Thus,
SQL as an application language is extremely hospitable to parallel execution.
Given this characterization, one must devise execution strategies that are optimized for
parallel execution of an individual query, rather than merely providing for parallel execution of
strategies optimized for serial execution. Note that individual DDL and DML operations--as well as
other utilities such as backup and recovery--must execute in parallel. As indicated earlier, within
the context of SQL2, RDBMS vendors did most of the work necessary to provide the necessary
parallelism. Furthermore, it was possible to do so in ways that were largely transparent to the
application developer and the DBA. They were greatly aided not only by the structure of SQL, but
also by the fact that the core language and data types were mostly fixed during the '80s and early
'90s, at least as insofar as they affect parallel execution. Even so, it takes a vendor at least two or
three years to evolve a mature relational product that fully supports parallelism within a query for
VLDBs.
It’s worth noting that the complexity of adding parallelism for VLDB has two sources. The first is
replacing serial algorithms with parallel ones. The second source is subtler: There is a qualitative
change of mindset that must take place if one is to design competitive software for these very large
collections of data. For example, the first reaction of a designer building a database load program
is to make it use the INSERT function. This technique works for small- and medium-sized
databases but not for giant ones; it is simply too slow. Instead, one must design for multiple
streams of data to be loaded in parallel, capitalizing on the fact that multiple CPUs will be
available on the MPP or SMP cluster. If the data is coming from a single serial source, such as a
tape drive, the initial load task should focus only on getting each data item into memory as
quickly as possible. The power of the other CPUs should then be invoked in parallel through tasks
to get each datum to the right place, to index it properly, and so on.
These problems become more complex as O/R databases enter the mainstream; new and
potentially exotic data types must be accommodated, and new methods must be written and
optimized for parallel execution. Complex relationships far beyond the simple row and column
structure of the relational model can and will exist among instances of the new data types and
even among instances of current SQL data types--in a parts-explosion database, for example, or
when a "purchase order" object view is superimposed on an existing on-order database. New data
access methods to support the new operations and data types will also be required and must also
be optimized for parallel execution. Finally, expect giant ratios of column sizes in O/R tables with
the largest column entries being hundreds to thousands of times larger than the smallest
columns within a table. This change will create a need for new approaches to storage and buffer
management in servers as well as clients.

1.17 Transaction Processing

Is the following schedule serial? serialisable? Why or why not?


T1 T2 T3
-------------------------------------------------
1 start
2 read Z
3 start
4 read Y
5 Y=Y+1
6 write Y
7 start
8 read Y
9 read X
10 Z = Z + 1
11 X = X + 1
12 checkpoint
13 write X
14 write Z
15 Y=Y+1
16 write Y
17 read X
www.arihantinfo.com
27
RDBMS
18 read Y
19 Y=Y+1
20 write Y
21 commit
22 X=X+1
23 write X
24 commit
25 read W
26 W=W+1
27 write W
28 commit

For the above sequence of transactions, show the log file entries, assuming initially Z = 1, Y = 2, X
= 1, and W = 3. What happens to the transactions at the checkpoint?
1) The following transaction increases the total sales (A) and number of sales (B) in a store
inventory database. Write the values of A and B for each statement, assuming A is initially 100
and B is initially 2. The correctness criteria is that the average sale is $50. For each statement
determine if the transaction dies during the statement whether the database is in a correct state.
What should be done to repair the damage if the database is left in an incorrect state?
Transaction Start
1) Read A
2) A  A + 50
3) Read B
4) B  B + 1
5) Write B
6) Write A
7) Transaction ends

2) Which protocols potentially cause cascading rollbacks and why?

1.18 The Query Processor


The query optimizer module has the task of producing an execution plan, and the code generator
generates the code to execute the plan. The runtime database processor has the task of running
the query code, whether in compiled or interpreted mode, to produce the query result. If a runtime
error results, an error message is generated by the runtime database processor.

Figure: Typical Steps of a Query processor


The term optimization is actually a misnomer because in some cases the chosen execution plan is
not the optimal (best) strategy—it is just a reasonably efficient strategy for executing the query.
Finding the optimal strategy is usually too time-consuming except for the simplest of queries and

www.arihantinfo.com
28
RDBMS
may require information on how the files are implemented and even on the contents of these-
information that may not be fully available in the DBMS catalog. Hence, planning of an execution
strategy may be a more accurate description than query optimization.
For lower-level navigational database languages in legacy systems–such as the network DML or
the hierarchical HDML- the programmer must choose the query execution strategy while writing a
database program. If a DBMS provides only a navigational language, there is limited need or
opportunity for extensive query optimization by the DBMS; instead, the programmer is given the
capability to choose the "optimal" execution strategy. On the other hand, a high-level query
language—such as SQL for relational DBMSs (RDBMSs) or OQL for object DBMSs (ODBMSs)—is
more declarative in nature because it specifies what the intended results of the query are, rather
than the details of how the result should be obtained. Query optimization's thus necessary for
queries that are specified in a high-level query language.

www.arihantinfo.com
29
RDBMS
UNIT 2

THE ENTITY-RELATIONSHIP DATA MODEL

2.1 Introduction of entity Relationship model


2.2 Elements of the E/R Model
2.3 Requirement
2.4 Relationship
2.5 Entity-Relationship Diagrams
2.6 Multiplicity of Binary E/R Relationships
2.7 Design Principles
2.8 Avoiding Redundancy
2.9 Simplicity Counts
2.10Extended ER Models

2.1 Introduction of entity Relationship model

Object-based logical models are used in describing data at conceptual and external schemas.
They provide fairly flexible structuring capabilities and allow data constraints to be specified
explicitly. Some of object based models are:
1) The entity-relationship model
2) The object-oriented model
3) The semantic model
4) The functional data model

Entity-relational model and the object-oriented model act as representatives of the class of the
object-based logical models.
The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76] as a
way to unify the network and relational database views. Simply stated the ER model is a
conceptual data model that views the real world as entities and relationships. A basic component
of the model is the Entity-Relationship diagram which is used to visually represents data objects.
Since Chen wrote his paper the model has been extended and today it is commonly used for
database design For the database designer, the utility of the ER model is:
(A) It maps well to the relational model. The constructs used in the ER model can easily be
transformed into relational tables.
(B) It is simple and easy to understand with a minimum of training. Therefore, the model can be
used by the database designer to communicate the design to the end user.
(C) In addition, the model can be used as a design plan by the database developer to implement a
data model in a specific database management software.

www.arihantinfo.com
30
RDBMS

2.2 Elements of the E/R Model

The purpose of logical data modeling is to discover, analyze, carefully define, standardize, and
normalize the data elements required by the business into the entities established in the
conceptual data model. Data elements are the logical facts the business must store in order to
know and remember what it must in order to conduct its business. The logical data modeler must
not only be concerned with correctly interpreting the business’ information requirements, forming
data elements which will satisfy those information requirements, but also with making data
elements which are sharable across organizational/process boundaries in the business, and
eliminating overlaps and conflicts in those data elements.

The techniques used to discover, analyze, standardize, and normalize data elements into an E/R
model in a rigorous, methodical manner. It also presents practical assists in this pursuit such as
data element formation and normalization rules, data element naming standards, a standard data
element definitional pro-forma (template), and metadata structures which will allow the
dictionary/repository independent. An extensive workshop (a continuation of the one used in the
Conceptual Data Modeling seminar) exercises the skills of data element discovery, formation,
standardization, definition, and normalization into the E/R model.
The ER model views the real world as a construct of entities and association between entities.

Entities
Entities are the principal data object about which information is to be collected. Entities are
usually recognizable concepts, either concrete or abstract, such as person, places, things, or
events which have relevance to the database. Some specific examples of entities are EMPLOYEES,
PROJECTS, INVOICES. An entity is analogous to a table in the relational model.
Entities are classified as independent or dependent (in some methodologies, the terms used are
strong and weak, respectively). An independent entity is one that does not rely on another for
identification. A dependent entity is one that relies on another for identification.
An entity occurrence (also called an instance) is an individual occurrence of an entity. An
occurrence is analogous to a row in the relational table.

Special Entity Types


Associative entities (also known as intersection entities) are entities used to associate two or more
entities in order to reconcile a many-to-many relationship.
Subtypes entities are used in generalization hierarchies to represent a subset of instances of their
parent entity, called the supertype, but which have attributes or relationships that apply only to
the subset.Associative entities and generalization hierarchies are discussed in more detail below.

www.arihantinfo.com
31
RDBMS

2.3 Relationships
A Relationship represents an association between two or more entities. An example of a
relationship would be:
employees are assigned to projects projects have subtasks departments manage one or more
projects Relationships are classified in terms of degree, connectivity, cardinality, and existence.
These concepts will be discussed below.

Attributes
Attributes describe the entity of which they are associated. A particular instance of an attribute is
a value. For example, "Jane R. Hathaway" is one value of the attribute Name. The domainof an
attribute is the collection of all possible values an attribute can have. The domain of Name is a
character string.
Attributes can be classified as identifiers or descriptors. Identifiers, more commonly called keys,
uniquely identify an instance of an entity. A descriptor describes a non-unique characteristic of
an entity instance.

Classifying Relationships
Relationships are classified by their degree, connectivity, cardinality, direction, type, and
existence. Not all modeling methodologies use all these classifications.

Degree of a Relationship
The degree of a relationship is the number of entities associated with the relationship. The n-ary
relationship is the general form for degree n. Special cases are the binary, and ternary ,where the
degree is 2, and 3, respectively.
Binary relationships, the association between two entities is the most common type in the real
world. A recursive binary relationship occurs when an entity is related to itself. An example might
be "some employees are married to other employees".
A ternary relationship involves three entities and is used when a binary relationship is
inadequate. Many modeling approaches recognize only binary relationships. Ternary or n-ary
relationships are decomposed into two or more binary relationships.

Connectivity and Cardinality


The connectivity of a relationship describes the mapping of associated entity instances in the
relationship. The values of connectivity are "one" or "many". The cardinality of a relationship is the
actual number of related occurences for each of the two entities. The basic types of connectivity
for relations are: one-to-one, one-to-many, and many-to-many.
A one-to-one (1:1) relationship is when at most one instance of a entity A is associated with one
instance of entity B. For example, "employees in the company are each assigned their own office.
For each employee there exists a unique office and for each office there exists a unique employee.
A one-to-many (1:N) relationships is when for one instance of entity A, there are zero, one, or many
instances of entity B, but for one instance of entity B, there is only one instance of entity A. An
example of a 1:N relationships is a department has many employees each employee is assigned to
one department
A many-to-many (M:N) relationship, sometimes called non-specific, is when for one instance of
entity A, there are zero, one, or many instances of entity B and for one instance of entity B there
are zero, one, or many instances of entity A. An example is:
employees can be assigned to no more than two projects at the same time; projects must have
assigned at least three employees A single employee can be assigned to many projects; conversely,
a single project can have assigned to it many employee. Here the cardinality for the relationship
between employees and projects is two and the cardinality between project and employee is three.
Many-to-many relationships cannot be directly translated to relational tables but instead must be
transformed into two or more one-to-many relationships using associative entities.

Direction

www.arihantinfo.com
32
RDBMS
The direction of a relationship indicates the originating entity of a binary relationship. The entity
from which a relationship originates is the parent entity; the entity where the relationship
terminates is the child entity. The direction of a relationship is determined by its connectivity. In a
one-to-one relationship the direction is from the independent entity to a dependent entity. If both
entities are independent, the direction is arbitrary. With one-to-many relationships, the entity
occurring once is the parent. The direction of many-to-many relationships is arbitrary.

Type
An identifying relationship is one in which one of the child entities is also a dependent entity. A
non-identifying relationship is one in which both entities are independent.

Existence
Existence denotes whether the existence of an entity instance is dependent upon the existence of
another, related, entity instance. The existence of an entity in a relationship is defined as either
mandatory or optional. If an instance of an entity must always occur for an entity to be included
in a relationship, then it is mandatory. An example of mandatory existence is the statement "every
project must be managed by a single department". If the instance of the entity is not required, it is
optional. An example of optional existence is the statement, "employees may be assigned to work
on projects".

Generalization Hierarchies
A generalization hierarchy is a form of abstraction that specifies that two or more entities that
share common attributes can be generalized into a higher level entity type called a supertype or
generic entity. The lower-level of entities become the subtype, or categories, to the supertype.
Subtypes are dependent entities.
Generalization occurs when two or more entities represent categories of the same real-world
object. For example, Wages_Employees and Classified_Employees represent categories of the same
entity, Employees. In this example, Employees would be the supertype; Wages_Employees and
Classified_Employees would be the subtypes.
Subtypes can be either mutually exclusive (disjoint) or overlapping (inclusive). A mutually
exclusive category is when an entity instance can be in only one category. The above example is a
mutually exclusive category. An employee can either be wages or classified but not both. An
overlapping category is when an entity instance may be in two or more subtypes. An example
would be a person who works for a university could also be a student at that same university. The
completeness constraint requires that all instances of the subtype be represented in the
supertype.
Generalization hierarchies can be nested. That is, a subtype of one hierarchy can be a supertype
of another. The level of nesting is limited only by the constraint of simplicity. Subtype entities may
be the parent entity in a relationship but not the child.

E - R Notation
There is no standard for representing data objects in ER diagrams. Each modeling methodology
uses its own notation. The original notation used by Chen is widely used in academics texts and
journals but rarely seen in either CASE tools or publications by non-academics. Today, there are
a number of notations used, among the more common are Bachman, crow's foot, and IDEFIX.
All notational styles represent entities as rectangular boxes and relationships as lines connecting
boxes. Each style uses a special set of symbols to represent the cardinality of a connection. The
notation used in this document is from Martin. The symbols used for the basic ER constructs are:
(a) Entities are represented by labeled rectangles. The label is the name of the entity. Entity
names should be singular nouns.
(b) Relationships are represented by a solid line connecting two entities. The name of the
relationship is written above the line. Relationship names should be verbs.
(c ) Attributes, when included, are listed inside the entity rectangle. Attributes which are
identifiers are underlined. Attribute names should be singular nouns.
(d) Cardinality of many is represented by a line ending in a crow's foot. If the crow's foot is
omitted, the cardinality is one.
(e) Existence is represented by placing a circle or a perpendicular bar on the line. Mandatory
existence is shown by the bar (looks like a 1) next to the entity for an instance is required.
www.arihantinfo.com
33
RDBMS
(f) Optional existence is shown by placing a circle next to the entity that is optional.
Examples of these symbols are shown in Figure 1 below:
E - R Notation

2.4 Requirements
The goals of the requirements analysis are:
• to determine the data requirements of the database in terms of primitive objects
• to classify and describe the information about these objects
• to identify and classify the relationships among the objects
• to determine the types of transactions that will be executed on the database and the
interactions between the data and the transactions
• to identify rules governing the integrity of the data
• The modeler, or modelers, works with the end users of an organization to determine the
data requirements of the database. Information needed for the requirements analysis can
be gathered in several ways:
o Review of existing documents - such documents include existing forms and reports,
written guidelines, job descriptions, personal narratives, and memoranda. Paper
documentation is a good way to become familiar with the organization or activity
you need to model.
• Intervie1ws with end users - these can be a combination of individual or group meetings.
Try to keep group sessions to under five or six people. If possible, try to have everyone with
the same function in one meeting. Use a blackboard, flip charts, or overhead
transparencies to record information gathered from the interviews.
• review of existing automated systems - if the organization already has an automated
system, review the system design specifications and documentation
The requirements analysis is usually done at the same time as the data modeling. As information
is collected, data objects are identified and classified as entities, attributes, or relationship;
assigned names; and, defined using terms familiar to the end-users. The objects are then modeled
and analyzed using an ER diagram. The diagram can be reviewed by the modeler and the end-
users to determine its completeness and accuracy. If the model is not correct, it is modified, which
sometimes requires additional information to be collected. The review and edit cycle continues
until the model is certified as correct. Three points to keep in mind during the requirements
analysis are:

1. Talk to the end users about their data in "real-world" terms. Users do not think in terms of
entities, attributes, and relationships but about the actual people, things, and activities
they deal with daily.

www.arihantinfo.com
34
RDBMS
2. Take the time to learn the basics about the organization and its activities that you want to
model. Having an understanding about the processes will make it easier to build the
model.
3. End-users typically think about and view data in different ways according to their function
within an organization. Therefore, it is important to interview the largest number of people
that time permits.

4. Relationship
Once entities and relationships have been identified and defined, the first draft of the entity
relationship diagram can be created. This section introduces the ER diagram by demonstrating
how to diagram binary relationships. Recursive relationships are also shown.

Developing the Basic Schema


Once entities and relationships have been identified and defined, the first draft of the entity
relationship diagram can be created. This section introduces the ER diagram by demonstrating
how to diagram binary relationships. Recursive relationships are also shown.

Binary Relationships
shows examples of how to diagram one-to-one, one-to-many, and many-to-many relationships.

Example of Binary Relationships

One-To-One
Shows an example of a one-to-one diagram. Reading the diagram from left to right represents the
relationship every employee is assigned a workstation. Because every employee must have a
workstation, the symbol for mandatory existence—in this case the crossbar—is placed next to the
WORKSTATION entity. Reading from right to left, the diagram shows that not all workstation are
www.arihantinfo.com
35
RDBMS
assigned to employees. This condition may reflect that some workstations are kept for spares or
for loans. Therefore, we use the symbol for optional existence, the circle, next to EMPLOYEE. The
cardinality and existence of a relationship must be derived from the "business rules" of the
organization. For example, if all workstations owned by an organization were assigned to
employees, then the circle would be replaced by a crossbar to indicate mandatory existence. One-
to-one relationships are rarely seen in "real-world" data models. Some practioners advise that
most one-to-one relationships should be collapsed into a single entity or converted to a
generalization hierarchy.

One-To-Many
shows an example of a one-to-many relationship between DEPARTMENT and PROJECT. In this
diagram, DEPARTMENT is considered the parent entity while PROJECT is the child. Reading from
left to right, the diagram represents departments may be responsible for many projects. The
optional of the relationship reflects the "business rule" that not all departments in the
organization will be responsible for managing projects. Reading from right to left, the diagram tells
us that every project must be the responsibility of exactly one department.

Many-To-Many
shows a many-to-many relationship between EMPLOYEE and PROJECT. An employee may be
assigned to many projects; each project must have many employee Note that the association
between EMPLOYEE and PROJECT is optional because, at a given time, an employee may not be
assigned to a project. However, the relationship between PROJECT and EMPLOYEE is mandatory
because a project must have at least two employees assigned. Many-To-Many relationships can be
used in the initial drafting of the model but eventually must be transformed into two one-to-many
relationships. The transformation is required because many-to-many relationships cannot be
represented by the relational model. The process for resolving many-to-many relationships is
discussed in the next section.

Recursive relationships
A recursive relationship is an entity is associated with itself. Figure 2 shows an example of the
recursive relationship. An employee may manage many employees and each employee is managed
by one employee.

2.5 Entity-Relationship Diagrams


Entities cannot be modeled unrelated to any other entity. Otherwise, when the model was
transformed to the relational model, there would be no way to navigate to that table. The
exception to this rule is a database with a single table.

Resolve Many-To-Many Relationships


Many-to-many relationships cannot be used in the data model because they cannot be
represented by the relational model. Therefore, many-to-many relationships must be resolved
early in the modeling process. The strategy for resolving many-to-many relationship is to replace
the relationship with an association entity and then relate the two original entities to the
association entity. This strategy is demonstrated below Figure (a) shows the many-to-many
relationship:

Employees may be assigned to many projects.


Each project must have assigned to it more than one employee.

www.arihantinfo.com
36
RDBMS
In addition to the implementation problem, this relationship presents other problems. Suppose we
wanted to record information about employee assignments such as who assigned them, the start
date of the assignment, and the finish date for the assignment. Given the present relationship,
these attributes could not be represented in either EMPLOYEE or PROJECT without repeating
information. The first step is to convert the relationship assigned to to a new entity we will call
ASSIGNMENT. Then the original entities, EMPLOYEE and PROJECT, are related to this new entity
preserving the cardinality and optionality of the original relationships.

Resolution of a Many-To-Many Relationship

Notice that the schema changes the semantics of the original relation to employees may be given
assignments to projects and projects must be done by more than one employee assignment. A
many to many recursive relationship is resolved in similar fashion.
Transform Complex Relationships into Binary Relationships Complex relationships are classified
as ternary, an association among three entities, or n-ary, an association among more than three,
where n is the number of entities involved. For example, Figure shows the relationship Employees
can use different skills on any one or more projects.
Each project uses many employees with various skills. Complex relationships cannot be directly
implemented in the relational model so they should be resolved early in the modeling process. The
strategy for resolving complex relationships is similar to resolving many-to-many relationships.
The complex relationship replaced by an association entity and the original entities are related to
this new entity. entity related through binary relationships to each of the original entities.
Transforming a Complex Relationship

www.arihantinfo.com
37
RDBMS

Eliminate redundant relationships


A redundant relationship is a relationship between two entities that is equivalent in meaning to
another relationship between those same two entities that may pass through an intermediate
entity. For example, Figure A shows a redundant relationship between DEPARTMENT and
WORKSTATION. This relationship provides the same information as the relationships
DEPARTMENT has EMPLOYEES and EMPLOYEEs assigned WORKSTATION. Figure B shows the
solution which is to remove the redundant relationship DEPARTMENT assigned WORKSTATIONS.
Removing A Redundant Relationship

www.arihantinfo.com
38
RDBMS

2.6 Multiplicity of Binary E/R Relationships

There are some data models that limit relationships to be binary. E/R model does not require
binary Relationships. To convert the multi-way relationship by
A Introducing a connecting entity set whose entities are tuples of the relationship set for the
multi-way relationship.
B introduce may-to-one relationships from the connecting entity set to each of the entity sets that
provide components of tuples in the original, multiway relationship.
C If an entity set plays more than one role, then it is the target of one relationship for each role.

2.7 Design Principles


Data integrity is one of the cornerstones of the relational model. Simply stated data integrity
means that the data values in the database are correct and consistent.
Data integrity is enforced in the relational model by entity and referential integrity rules.
Although not part of the relational model, most database software enforce attribute integrity
through the use of domain information.

Entity Integrity
The entity integrity rule states that for every instance of an entity, the value of the primary key
must exist, be unique, and cannot be null. Without entity integrity, the primary key could not
fulfill its role of uniquely identifying each instance of an entity.

Referential Integrity
The referential integrity rule states that every foreign key value must match a primary key value
in an associated table. Referential integrity ensures that we can correctly navigate between related
entities.

Insert and Delete Rules


A foreign key creates a hierarchical relationship between two associated entities. The entity
containing the foreign key is the child, or dependent, and the table containing the primary key
from which the foreign key values are obtained is the parent.
In order to maintain referential integrity between the parent and child as data is inserted or
deleted from the database certain insert and delete rules must be considered.

Insert Rules
Insert rules commonly implemented are:
(A)Dependent. The dependent insert rule permits insertion of child entity instance only if
matching parent entity already exists.
(B)Automatic. The automatic insert rule always permits insertion of child entity instance. If
matching parent entity instance does not exist, it is created.
( c)Nullify. The nullify insert rule always permits the insertion of child entity instance. If a
matching parent entity instance does not exist, the foreign key in child is set to null.
(D )Default. The default insert rule always permits insertion of child entity instance. If a matching
parent entity instance does not exist, the foreign key in the child is set to previously defined value.
(E)Customized. The customized insert rule permits the insertion of child entity instance only if
certain customized validity constraints are met.
(F)No Effect. This rule states that the insertion of child entity instance is always permitted. No
matching parent entity instance need exist, and thus no validity checking is done.

Delete Rules
(a ) Restrict. The restrict delete rule permits deletion of parent entity instance only if there are no
matching child entity instances.
(b ) Cascade. The cascade delete rule always permits deletion of a parent entity instance and
deletes all matching instances in the child entity.

www.arihantinfo.com
39
RDBMS
(C ) Nullify. The nullify delete rules always permits deletion of a parent entity instance. If any
matching child entity instances exist, the values of the foreign keys in those instances are set to
null.
(D ) Default. The default rule always permits deletion of a parent entity instance. If any matching
child entity instances exist, the value of the foreign keys are set to a predefined default value.
(E ) Customized. The customized delete rule permits deletion of a parent entity instance only if
certain validity constraints are met.
(F) No Effect. The no effect delete rule always permits deletion of a parent entity instance. No
validity checking is done.

Delete and Insert Guide


The choice of which rule to use is determined by Some basic guidelines for insert and delete rules
are given below.
a. Avoid use of nullify insert or delete rules. Generally, the parent entity in a parent-child
relationship has mandatory existence. Use of the null insert or delete rule would violate
this rule.
b. Use either automatic or dependent insert rule for generalization hierarchies. Only these
rules will keep the rule that all instances in the subtypes must also be in the supertype.
c. Use the cascade delete rule for generalization hierarchies. This rule will enforce the rule
that only instances in the supertype can appear in the subtypes.

Domains
A domain is a valid set of values for an attribute which enforce that values from an insert or
update make sense. Each attribute in the model should be assigned domain information which
includes:
(a) Data Type - Basic data types are integer, decimal, or character. Most data bases support
variants of these plus special data types for date and time.
(b) Length - This is the number of digits or characters in the value. For example, a value of 5
digits or 40 characters.
(c) Date Format - The format for date values such as dd/mm/yy or yy/mm/dd
( d) Range - The range specifies the lower and upper boundaries of the values the attribute may
legally have.
( e) Constraints - Are special restrictions on allowable values. For example, the
Beginning_Pay_Date for a new employee must always be the first work day of the month of hire.
( F ) Null support - Indicates whether the attribute can have null values
(G )Default value (if any)—The value an attribute instance will have if a value is not entered.

Primary Key Domains


The values of primary keys must be unique and nulls are not allowed.

Foreign Key Domains


The data type, length, and format of primary keys must be the same as the corresponding primary
key. The uniqueness property must be consistent with relationship type. A one-to-one relationship
implies a unique foreign key; a one-to-many relationship implies a non-unique foreign key.

2.8 Avoiding Redundancy

If you dismissed because of redundancy, this usually means that your employer has needed to
reduce his or her workforce. This may either be because the place where you work is closing
down, or because there is longer the need (or no longer expected to be the need) to carry out the
particular kind of work that you do. Normally your job must have disappeared. It is not a
plausible redundancy if your employer immediately takes on a direct replacement for you. It does
not matter, however, if your employer is recruiting more workers for work of a different kind, or in
another location (unless you were required by contract to move to the new location). The definition
of redundancy therefore covers 3 basic situations:
• Where the employer ceases to carrying on business (other than involving a transfer of an
undertaking) on a permanent or temporary basis;
• Where the employer ceases business in the place where the employee is employed;
www.arihantinfo.com
40
RDBMS
• Where the employer's business no longer requires any employees or as many employees to
do a particular kind of work (whether generally or in the place where the employee was
employed).
If you are dismissed because of a need to reduce the work force, and one of the remaining
employees moves into your job, you will still qualify for a redundancy payment so long as no
vacancy exists in the area (type of work and location) where you worked.

2.9 Simplicity Counts


Simplicity and speed make Naïve-Bayes an ideal exploratory tool. The technique is based on a
simple concept: conditional probabilities derived from observed frequencies in the training data.
For example, consider trying to predict customer turnover where it is known that 75 percent of
the customers with monthly billings between $400 and $500 have left, and 68 percent of the
customers who have made more than four calls to customer service have left. Presented with a
customer who has $480 in average monthly billings, and who has made five calls to customer
service, a Naïve-Bayes model will predict that this customer has a high likelihood of leaving. For
simplicity, assume the online component consists of several online transactions and several ad-
hoc queries. Assume the batch component has and update and a report job.Assume that the
application system being benchmarked will be a complete re-write and re-design
of an existing system.

2.10 Extended ER Models

The Entity-Relationship (ER) Model, is enjoying a remarkable popularity in industry. It has been
widely recognized that while the temporal aspects of data play a prominent role of database
applications, these aspects are difficult to capture using the ER model. Some industrial users
have responded to this deficiency by ignoring all temporal aspects in their ER diagrams and
simply supplement the diagrams with phrases akin to ``full temporal support.'' The research
community has responded by developing about a dozen proposals for temporally extended ER
models. These existing temporally extended ER models were accompanied by only few or no
specific criteria for designing them, making it is difficult to appreciate their properties and to
conduct an insightful comparison of the models. This paper defines a set of design criteria for
evaluating temporally extended ER models. These may be used for evaluating and comparing the
existing temporally extended ER models.

www.arihantinfo.com
41
RDBMS
UNIT 3

REPRESENTING DATA ELEMENTS

3.1 Data Elements and Fields


3.2 Representing Relational Database Elements
3.3 Records
3.4 Representing Block and Record Addresses
3.5 Client-Server Systems
3.6 Logical and Structured Addresses
3.7 Record Modifications
3.8 Index Structures
3.9 Indexes on Sequential Files
3.10 Secondary Indexes
3.11 B-Trees
3.12 Hash Tables

3.1 Data Elements and Fields

In order to begin constructing the basic model, the modeler must analyze the information
gathered during the requirements analysis for the purpose of:
• classifying data objects as either entities or attributes
• identifying and defining relationships between entities
• naming and defining identified entities, attributes, and relationships
• documenting this information in the data document
To accomplish these goals the modeler must analyze narratives from users, notes from meeting,
policy and procedure documents, and, if lucky, design documents from the current information
system. Although it is easy to define the basic constructs of the ER model, it is not an easy task to
distinguish their roles in building the data model. What makes an object an entity or attribute?
For example, given the statement "employees work on projects". Should employees be classified as
an entity or attribute? Very often, the correct answer depends upon the requirements of the
database. In some cases, employee would be an entity, in some it would be an attribute.

While the definitions of the constructs in the ER Model are simple, the model does not address the
fundamental issue of how to identify them. Some commonly given guidelines are:
• entities contain descriptive information
• attributes either identify or describe entities
• relationships are associations between entities
These guidelines are discussed in more detail below.
• Entities
• Attributes
• Validating Attributes
• Derived Attributes and Code Values
• Relationships
• Naming Data Objects
• Object Definition
• Recording Information in Design Document

Entities
There are various definitions of an entity:
"Any distinguishable person, place, thing, event, or concept, about which information is kept"
"A thing which can be distinctly identified"
"Any distinguishable object that is to be represented in a database"

www.arihantinfo.com
42
RDBMS
"...anything about which we store information (e.g. supplier, machine tool, employee, utility pole,
airline seat, etc.). For each entity type, certain attributes are stored".

These definitions contain common themes about entities:


Entities should not be used to distinguish between time periods. For example, the entities 1st
Quarter Profits, 2nd Quarter Profits, etc. should be collapsed into a single entity called Profits. An
attribute specifying the time period would be used to categorize by time
not every thing the users want to collect information about will be an entity. A complex concept
may require more than one entity to represent it. Others "things" users think important may not
be entities.

Attributes
Attributes are data objects that either identify or describe entities. Attributes that identify entities
are called key attributes. Attributes that describe an entity are called non-key attributes. Key
attributes will be discussed in detail in a latter section.
The process for identifying attributes is similar except now you want to look for and extract those
names that appear to be descriptive noun phrases.

Validating Attributes
Attribute values should be atomic, that is, present a single fact. Having disaggregated data allows
simpler programming, greater reusability of data, and easier implementation of changes.
Normalization also depends upon the "single fact" rule being followed. Common types of violations
include:
simple aggregation - a common example is Person Name which concatenates first name, middle
initial, and last name. Another is Address which concatenates, street address, city, and zip code.
When dealing with such attributes, you need to find out if there are good reasons for decomposing
them. For example, do the end-users want to use the person's first name in a form letter? Do they
want to sort by zip code?
complex codes - these are attributes whose values are codes composed of concatenated pieces of
information. An example is the code attached to automobiles and trucks. The code represents over
10 different pieces of information about the vehicle. Unless part of an industry standard, these
codes have no meaning to the end user. They are very difficult to process and update.

Derived Attributes and Code Values


Two areas where data modeling experts disagree is whether derived attributes and attributes
whose values are codes should be permitted in the data model. Derived attributes are those
created by a formula or by a summary operation on other attributes. Arguments against including
derived data are based on the premise that derived data should not be stored in a database and
therefore should not be included in the data model. The arguments in favor are:
derived data is often important to both managers and users and therefore should be included in
the data model it is just as important, perhaps more so, to document derived attributes just as
you would other attributes including derived attributes in the data model does not imply how they
will be implemented. A coded value uses one or more letters or numbers to represent a fact. For
example, the value Gender might use the letters "M" and "F" as values rather than "Male" and
"Female". Those who are against this practice cite that codes have no intuitive meaning to the
end-users and add complexity to processing data. Those in favor argue that many organizations
have a long history of using coded attributes, that codes save space, and improve flexibility in that
values can be easily added or modified by means of look-up tables.

Relationships
Relationships are associations between entities. Typically, a relationship is indicated by a verb
connecting two or more entities. For example:
employees are assigned to projects. As relationships are identified they should be classified in
terms of cardinality, optionality, direction, and dependence. As a result of defining the
relationships, some relationships may be dropped and new relationships added. Cardinality
quantifies the relationships between entities by measuring how many instances of one entity are
related to a single instance of another. To determine the cardinality, assume the existence of an

www.arihantinfo.com
43
RDBMS
instance of one of the entities. Then determine how many specific instances of the second entity
could be related to the first. Repeat this analysis reversing the entities. For example:
employees may be assigned to no more than three projects at a time; every project has at least
two employees assigned to it.
If a relationship can have a cardinality of zero, it is an optional relationship. If it must have a
cardinality of at least one, the relationship is mandatory. Optional relationships are typically
indicated by the conditional tense. For example: an employee may be assigned to a project
Mandatory relationships, on the other hand, are indicated by words such as must have. For
example: a student must register for at least three course each semester
In the case of the specific relationship form (1:1 and 1:M), there is always a parent entity and a
child entity. In one-to-many relationships, the parent is always the entity with the cardinality of
one. In one-to-one relationships, the choice of the parent entity must be made in the context of
the business being modeled. If a decision cannot be made, the choice is arbitrary.

Naming Data Objects


The names should have the following properties:
• unique
• have meaning to the end-user
• contain the minimum number of words needed to uniquely and accurately describe the
object
Some authors advise against using abbreviations or acronyms because they might lead to
confusion about what they mean. Other believe using abbreviations or acronyms are acceptable
provided that they are universally used and understood within the organization.
You should also take care to identify and resolve synonyms for entities and attributes. This can
happen in large projects where different departments use different terms for the same thing.

Recording Information in Design Document


The design document records detailed information about each object used in the model. As you
name, define, and describe objects, this information should be placed in this document. If you are
not using an automated design tool, the document can be done on paper or with a word
processor. There is no standard for the organization of this document but the document should
include information about names, definitions, and, for attributes, domains.
Two documents used in the IDEF1X method of modeling are useful for keeping track of objects.
These are the ENTITY-ENTITY matrix and the ENTITY-ATTRIBUTE matrix.
The ENTITY-ENTITY matrix is a two-dimensional array for indicating relationships between
entities. The names of all identified entities are listed along both axes. As relationships are first
identified, an "X" is placed in the intersecting points where any of the two axes meet to indicate a
possible relationship between the entities involved. As the relationship is further classified, the "X"
is replaced with the notation indicating cardinality.

www.arihantinfo.com
44
RDBMS
The ENTITY-ATTRIBUTE matrix is used to indicate the assignment of attributes to entities. It is
similar in form to the ENTITY-ENTITY matrix except attribute names are listed on the rows.

3.2 Representing Relational Database Elements

The relational model was formally introduced by 1970 and has evolved since then, through a
series of writings. The model provides a simple, yet rigorously defined, concept of how users
perceive data. The relational model represents data in the form of two-dimension tables. Each
table represents some real-world person, place, thing, or event about which information is
collected. A relational database is a collection of two-dimensional tables. The organization of data
into relational tables is known as the logical view of the database. That is, the form in which a
relational database presents data to the user and the programmer. The way the database software
physically stores the data on a computer disk system is called the internal view. The internal
view differs from product to product and does not concern us here.
A basic understanding of the relational model is necessary to effectively use relational database
software such as Oracle, Microsoft SQL Server, or even personal database systems such as Access
or Fox, which are based on the relational model. This document is an informal introduction to
relational concepts, especially as they relate to relational database design issues. It is not a
complete description of relational theory.

3.3 Records

Data is usually stored in the form of records. Each record consists of a collection of related data
values or items where each value is formed of one or more bytes and corresponds to a particular
field of the record. Records usually describe entities and their attributes. For example, an
EMPLOYEE and record represents an employee entity, and each field value in the record specifies
some attribute of that employee, such as NAME, BIRTHDATE, SALARY, or SUPERVISOR. A
collection of field names and their corresponding data types constitutes a record type or record
format definition. A data type, associated with each field, specifies the type of values a field can
take.

The data type of a field is usually one of the standard data types used in programming. These
include numeric (integer, long integer, or floating point), string of characters (fixed-length or
varying), Boolean (having 0 and 1 or TRUE and FALSE values only), and sometimes specially

www.arihantinfo.com
45
RDBMS
coded data and time data types. The number of bytes required for each data type is fixed for a
given computer system. An integer may require 4 bytes, a long integer 8 bytes, a real number 4
bytes, a Boolean 1 byte, a Boolean 1 byte, a date 10 bytes (assuming a format of YYYY-MM-DD),
and a fixed-length string of k characters k bytes. Variable-length strings may require, as many
bytes as there are characters in each field value. For example, an EMPLOYEE record type may be
defined–using the C programming language notation–as the following structure:

Struct employee{
char name [30];
char ssn[9];
int salary;
int jobcode;
char department[20];
};

In recent database applications, the need may arise for storing data items that consist of large
unstructured objects, which represent images, digitized video or audio streams, or free text. These
are referred to as BLOBs (Binary Large Objects). A BLOB data item is typically stored separately
from its record in a pool in a pool of disk blocks, and a pointer to the BLOB is included in the
record.

3.4 Representing Block and Record Addresses

Fields (attributes) need to be represented by fixed- or variable-length sequences of bytes, Fields


are put together in fixed- or variable-length collections called “records” (tuples, or objects). A
collection of records that forms a relation or the extent of a class is stored as a collection of
blocks, called a file.
A tuple is stored in a record. The size is the sum of sizes of the fields in the
record.The schema is stored together with records (or a pointer to it). It contains:
_ Types of fields
_ the fields within the record
_ Size of the record
_ Timestamps (last read, last updated)
The schema is used to access specific fields in the record.
Records are stored in blocks. Typically a block only contains one kind of
records.

3.5 Client-Server Systems

The principle is that the user has a client program. He asks information (or data) from the server
program. The server searches the data and sends it back to the client.[1] Putting in another
way,we can say that the user is the client, he uses a client program to start a client process,
sends message to server which is a server program, to perform a task or service. As a matter of
fact a client server system is a special case of a co-operative computer system. All such systems
are characterised by the use of multiple processes that work together to form the system solution.
(There are two types of co-operative systems client-server systems and peer-to-peer systems.), The
client and server systems consist of three major components : a server with relational database, a
client with user interface and a network hardware connection in between. Client and server is an
open system with number of advantages such as interoperability, scalability, adaptability,
affordability, data integrity, accessibility, performance and security.

WHAT DO THE CLIENT PROGRAMS DO?

www.arihantinfo.com
46
RDBMS
They usually deal with;
 managing the application's user-interface part
 confirming the data given by the user
 sending out the requests to server programs
 managing local resources, like monitor, keyboard and peripherals.
The client-based process is the application that the user interacts with. It contains solution-
specific logic and provides the interface between the user and the rest of the application system.
In this sense the graphical user interface (GUI) is one characteristic of client system. It uses tools
some are as:
 Administration Tool: for specifying the relevant server information, creation of users, roles and
privileges, definition of file formats, document type definitions (DTD) and document status
information.
 Template Editor: for creating and modifying templates of documents
 Document Editor: for editing instances of documents and for accessing component
information.
 Document Browser: for retrieval of documents from a Document Server
 Access Tool: provides the basic methods for information access.

WHAT DO THE SERVER DO?

Its purpose is fulfilling client's requests. What they do in general is;


 receive request
 execute database retrival and updates
 manage data integrity
 sent the results to client back
 act as a software engine that manage shared resources like databases, printers,
communication links..etc.
When the server do these, it can use common or complex logic. It can supply the information only
using its own resources or it can employ some other machine on the network, in master and slave
attitude. .[2]
In other words, there can be,
 single client, single server
 multiple clients, single server
 single client, multiple servers
 multiple clients, multiple servers [6]
In a client server system, we can talk about specialization of some components for particular
tasks. The specialization can provide very fast computation servers or high throughput database
servers, resilient transaction servers or for some other purpose, nevertheless what is important is
that they are optimised for that task. It is not optimal to try to perform all of the tasks
together.Actually it is this specialization, therefore optimization that gives power to client and
server systems. [10]
These different server types for different tasks can be collected in two categories:
1.Simpler ones are;
 disk server
 file server: client can only request for files, but the files are sent as they are, but this needs
large bandwidth and it slows down the network.

2.Advanced ones;
 database server
 transaction server
 application server

What Makes a Design a Client/Server Design?

There are many answers about what differentiates client/server architecture from some other
design. There is no single correct answer, but generally, an accepted definition describes a client

www.arihantinfo.com
47
RDBMS
application as the user interface to an intelligent database engine—the server. Well-designed
client applications do not hard code details of how or where date is physically stored, fetched, and
managed, nor do they perform low-level data manipulation. Instead, they communicate their data
needs at a more abstract level, the server performs the bulk of the processing, and the result set
isn't raw data but rather an intelligent answer.

Generally, a client/server application has these characteristics:

• Centralized Intelligent Servers


That is, the server in a client/server system is more than just a file repository where multiple
users share file sectors over a network. Client/server servers are intelligent, because they
carry out commands in the form of Structured Query Language (SQL) questions and return
answers in the form of result sets.
• Network Connectivity
Generally, a server is connected to a clients by way of a network. This could be a LAN, WAN,
or the Internet. However, most correctly designed client/server designs do not consume
significant resources on a wire. In general, short SQL queries are sent from a client and a
server returns brief sets of specifically focused data.
• Operationally Challenged Workstations
A client computer in a client/server system need not run a particularly complex application. It
must simply be able to display results from a query and capture criteria from a user for
subsequent requests. While a client computer is often more capable than a dumb terminal, it
need not house a particularly powerful processor—unless it must support other, more
hardware-hungry applications.
• Applications Leverage Server Intelligence
Client/server applications are designed with server intelligence in mind. They assume that the
server will perform the heavy lifting when it comes to physical data manipulation. Then, all
that is required of the client application is to ask an intelligent question—expecting an
intelligent answer from the server. The client application's job is to display the results
intelligently.
• Shared Data, Procedures, Support, Hardware
A client/server system depends on the behavior of the clients—all of them—to properly share
the server's—and the clients'— resources. This includes data pages, stored procedures, and
RAM as well as network capacity.
• System Resource Use Minimized
All client/server systems are designed to minimize the load on the overall system. This
permits an application to be scaled more easily—or at all.
• Maximize Design Tradeoffs
A well-designed client/server application should be designed to minimize overall systems load,
or designed for maximum throughput, or designed to avoid really bad response times—or
maybe a good design can balance all of these.

3.6 Logical and Structured Addresses

At the logical level, a structured document is made up of a number of different parts. Some parts
are optional, others are compulsory. Many of these document structures have a required order--
they cannot be inserted at arbitrary points in the document.
For example, a document must have a title and it must be the first element in the document. The
programming example is very good because it is very simple. At the logical level, sections are used
to break up the document into parts and sub-parts that help to assist the reader to follow the
structure of the document and to navigate their way through it.

www.arihantinfo.com
48
RDBMS
Sections are made up of a section title, followed by one or more text blocks and then, optionally,
one or more sub-sections. Sections are allowed in either a Chapter document (within chapters) or
a Simple document, and in Appendices.
Sections can contain other sections i.e. they may be nested. Only 4 levels of section nesting are
recommended. Note that once you start entering nested (sub-)sections, you cannot enter any text-
blocks after the sub-sections i.e.. all of the sections text blocks must come before any sub-
sections.
Sections are automatically numbered. A level 1 section has one number, a level two section is
numbered N.n, a level 3 section N.n.n and so on. If the section is contained within a chapter or
appendix, the section number is prefixed with the chapter number or appendix number.

3.7 Record Modifications

The domain record modification/transfer process is the complete responsibility of the domain
owner. If you need assistance modifying your domain record, please contact your domain registrar
for technical support. ColossalHost.com will not provide excessive support resources assisting
customers in the domain record modification process. It is important to understand that
ColossalHost.com has no more power to make modifications to a Subscriber's domain record than
a complete stranger would. The domain name owner is completely responsible for the information
(including the name servers) that is contained within the domain record.

3.8 Index Structures

Index structures in object-oriented database management systems should support selections not
only with respect to physical object attributes, but also with respect to derived attributes. A
simple example arises, if we assume the object types Company , Division, and Employee, with the
relationships has division from Company to Division, and employs from Division to Employee.
Index structures in object-oriented database management systems should support selections not
only with respect to physical object attributes, but also with respect to derived attributes. A
simple example arises, if we assume the object types Company , Division, and Employee, with the
relationships has division from Company to Division, and employs from Division to Employee. In
this case the index structure Should allow support queries for companies specifying the number
of employees of the company.

3.9 Indexes on Sequential Files

Data structure for locating records with given search key efficiently. Also facilitates a full scan of a
relation.

www.arihantinfo.com
49
RDBMS

3.10 Secondary Indexes

Secondary indexes to provide access to subsets of records. Both databases and tables provide
automatic secondary indexes. All secondary indexes are held on a separate database file (.sr6).
This is created when the first index is created and deleted if the last index is deleted. Each
secondary index is physically very similar to a standard database. It contains index blocks and
data blocks. The sizes of these blocks are calculated in a similar way to the block size calculations
for standard database blocks to ensure reasonably efficient processing given the size of the
secondary index key and the maximum number of records of that type. Each index potentially has
different block sizes. Each record in the data block in a secondary index has the secondary key as
the key and contains the standard database key as the data. Thus the size of these data blocks is
affected by the size of both keys

All these index files (i.e., primary, secondary, and clustering indexes) are ordered files, and
have two fields of fixed length. One field contains data (in which its value is the same as a field
from data file) and the other field is a pointer to the data file, but

• In primary indexes, the data item of the index has the value of the primary key of
the first record (the anchor record) of the block in which the pointer points to.
• In secondary indexes, the data item of the index has a value of a secondary key and
the pointer points to a block, in which a record with such secondary keys is stored
in.
• In clustering indexes, the data item on the index has a value of a non-key field, and
the pointer points to the block in which the first record with such non-key fields is
stored in.

In primary index files, for every block in a data file, one record exists in the primary index
file. The number of records in the index file is, equal to the number of blocks in the data
file. Hence, the primary indexes are non-dense.

In secondary index files, for every record in the data file, one record exists in the secondary
index file. The number of records in the secondary file is equal to the number of records in
the data file. Hence, the secondary indexes are dense.

In clustering index files, for every distinct clustering field, one record exists in the
clustering index file. The number of records in the clustering index is equal to the distinct
numbers for the clustering field in the data file. Hence, the clustering indexes are non-
dense.

www.arihantinfo.com
50
RDBMS
The Customers Table holds information on customers, such as their customer number, name
and address. Run the Database Desktop program from the Start Menu or select Tools-> Database
Desktop in Delphi. Open the Customers table copied in the previous step. By default the data in
the table is displayed as read only. Familiarise yourself with the data. Change to edit mode (Table-
>Edit Data) and add a new record. View the structure of the table (Table->Info Structure). Select
Table->Restructure to restructure the table. Add a secondary index to the table, by selecting
Secondary Indexes from the Table properties combo box. The secondary index is composed of
LastName and FirstName, in that order. Call the index CustomersNameIndex. The index will be
used to access the Customers table on customer name.

3.11 B-Trees

A B-tree is a specialized multiway tree designed especially for use on disk. In a B-tree each node
may contain a large number of keys. The number of sub trees of each node, then, may also be
large. A B-tree is designed to branch out in this large number of directions and to contain a lot of
keys in each node so that the height of the tree is relatively small. This means that only a small
number of nodes must be read from disk to retrieve an item. The goal is to get fast access to the
data, and with disk drives this means reading a very small number of records. Note that a large
node size (with lots of keys in the node) also fits with the fact that with a disk drive one can
usually read a fair amount of data at once.

A multiway tree of order m is an ordered tree where each node has at most m children. For each
node, if k is the actual number of children in the node, then k - 1 is the number of keys in the
node. If the keys and subtrees are arranged in the fashion of a search tree, then this is called a
multiway search tree of order m. For example, the following is a multiway search tree of order 4.
Note that the first row in each node shows the keys, while the second row shows the pointers to
the child nodes. Of course, in any useful application there would be a record of data associated
with each key, so that the first row in each node might be an array of records where each record
contains a key and its associated data. Another approach would be to have the first row of each
node contain an array of records where each record contains a key and a record number for the
associated data record, which is found in another file. This last method is often used when the
data records are large. The example software will use the first method.

What does it mean to say that the keys and subtrees are "arranged in the fashion of a search
tree"? Suppose that we define our nodes as follows:

typedef struct
{
int Count; // number of keys stored in the current node
www.arihantinfo.com
51
RDBMS
ItemType Key[3]; // array to hold the 3 keys
long Branch[4]; // array of fake pointers (record numbers)
} NodeType;

Then a multiway search tree of order 4 has to fulfill the following conditions related to the
ordering of the keys:

• The keys in each node are in ascending order.


• At every given node (call it Node) the following is true:
o The subtree starting at record Node.Branch[0] has only keys that are less than
Node.Key[0].
o The subtree starting at record Node.Branch[1] has only keys that are greater than
Node.Key[0] and at the same time less than Node.Key[1].
o The subtree starting at record Node.Branch[2] has only keys that are greater than
Node.Key[1] and at the same time less than Node.Key[2].
o The subtree starting at record Node.Branch[3] has only keys that are greater than
Node.Key[2].
• Note that if less than the full number of keys are in the Node, these 4 conditions are
truncated so that they speak of the appropriate number of keys and branches.

This generalizes in the obvious way to multiway search trees with other orders.

A B-tree of order m is a multiway search tree of order m such that:

• All leaves are on the bottom level.


• All internal nodes (except the root node) have at least ceil(m / 2) (nonempty) children.
• The root node can have as few as 2 children if it is an internal node, and can obviously
have no children if the root node is a leaf (that is, the whole tree consists only of the root
node).
• Each leaf node must contain at least ceil(m / 2) - 1 keys.

Note that ceil(x) is the so-called ceiling function. It's value is the smallest integer that is greater
than or equal to x. Thus ceil(3) = 3, ceil(3.35) = 4, ceil(1.98) = 2, ceil(5.01) = 6, ceil(7) = 7, etc.

A B-tree is a fairly well-balanced tree by virtue of the fact that all leaf nodes must be at the
bottom. Condition (2) tries to keep the tree fairly bushy by insisting that each node have at least
half the maximum number of children. This causes the tree to "fan out" so that the path from root
to leaf is very short even in a tree that contains a lot of data.

Example B-Tree

The following is an example of a B-tree of order 5. This means that (other that the root node) all
internal nodes have at least ceil(5 / 2) = ceil(2.5) = 3 children (and hence at least 2 keys). Of
course, the maximum number of children that a node can have is 5 (so that 4 is the maximum
number of keys). According to condition 4, each leaf node must contain at least 2 keys. In practice
B-trees usually have orders a lot bigger than 5.

www.arihantinfo.com
52
RDBMS

3.12 Hash Tables

Linked lists are handy ways of tying data structures together but navigating linked lists can be
inefficient. If you were searching for a particular element, you might easily have to look at the
whole list before you find the one that you need. Linux uses another technique, hashing to get
around this restriction. A hash table is an array or vector of pointers. An array, or vector, is
simply a set of things coming one after another in memory. A bookshelf could be said to be an
array of books. Arrays are accessed by an index, the index is an offset into the array. Taking the
bookshelf analogy a little further, you could describe each book by its position on the shelf; you
might ask for the 5th book.
A hash table is an array of pointers to data structures and its index is derived from information in
those data structures. If you had data structures describing the population of a village then you
could use a person's age as an index. To find a particular person's data you could use their age as
an index into the population hash table and then follow the pointer to the data structure
containing the person's details. Unfortunately many people in the village are likely to have the
same age and so the hash table pointer becomes a pointer to a chain or list of data structures
each describing people of the same age. However, searching these shorter chains is still faster
than searching all of the data structures.
As a hash table speeds up access to commonly used data structures, Linux often uses hash
tables to implement caches. Caches are handy information that needs to be accessed quickly and
are usually a subset of the full set of information available. Data structures are put into a cache
and kept there because the kernel often accesses them. There is a drawback to caches in that
they are more complex to use and maintain than simple linked lists or hash tables. If the data
structure can be found in the cache (this is known as a cache hit, then all well and good. If it
cannot then all of the relevant data structures must be searched and, if the data structure exists
at all, it must be added into the cache.

www.arihantinfo.com
53
RDBMS
UNIT 4

THE RELATIONAL DATA MODEL

4.1 Basics of the Relational Model


4.2 Relation Instances
4.3 Functional Dependencies
4.4 Rules About Functional Dependencies
4.5 Design of Relational Database Schemas
4.6 Normalization
4.6.1First Normal form
4.6.2Second Normal Form
4.6.3Third Normal Form
4.6.4Boyce-Codd Normal Form
4.6.5Multi-valued dependency
4.6.6Fifth Normal Form

4.1 Basics of the Relational Model Relation

In the relational model, data is represented as a two-dimensional table called a relation. Relations
have names and the columns have names called attributes. The elements in a column must be
atomic - an elementary type such as a number, string. date, or timestamp and from a single
domain.
A relation r(R) is a mathematical relation of degree n on the domains dom(A1), dom(A2), ..., dom(An)
which is a subset of the Cartesian product of the domains that define R:

r(R)⊆ (dom(A1)×dom(A2)×... ×dom(An) )


Example: An employee relation is a table of names, birth dates, social security numbers, ...
Figure 1: Relation: Employee
The attributes => Name BirthDate SS No ...
The data ... John Brown 10151934 123453434

The contents of a relation are rarely static thus the addition or deletion of a row must be
efficient.

Relational Database :
A relational database is a finite set of relation schemas (called a database schema) and a
corresponding set of relation instances (called a database instance). The relational
database model represents data as a two-dimensional tables called a relations and
consists of three basic components:

1. a set of domains and a set of relations


2. operations on relations
3. integrity rules

Database schema :
A database schema is a set of relation schemas for the relations in a design. Changes to a
schema or database schema are expensive thus careful thought must go into the design of
a database schema.

www.arihantinfo.com
54
RDBMS

1. Figure shows the deposit and customer tables for our banking example.

Figure : The deposit and customer relations.

o It has four attributes.


o For each attribute there is a permitted set of values, called the domain of that
attribute.
o E.g. the domain of bname is the set of all branch names.

Let denote the domain of bname, and , and the remaining attributes' domains
respectively.

Then, any row of deposit consists of a four-tuple where

In general, deposit contains a subset of the set of all possible rows.


That is, deposit is a subset of

In general, a table of n columns must be a subset of

2. Mathematicians define a relation to be a subset of a Cartesian product of a list of domains.


You can see the correspondence with our tables. We will use the terms relation and tuple
in place of table and row from now on.

3. Some more formalities:


o let the tuple variable refer to a tuple of the relation .
o We say to denote that the tuple is in relation .
o Then [bname] = [1] = the value of on the bname attribute.
o So [bname] = [1] = ``Downtown'',
o and [cname] = [3] = ``Johnson''.

4. We'll also require that the domains of all attributes be indivisible units.
o A domain is atomic if its elements are indivisible units.
o For example, the set of integers is an atomic domain.
o The set of all sets of integers is not.
o Why? Integers do not have subparts, but sets do - the integers comprising them.
o We could consider integers non-atomic if we thought of them as ordered lists of
digits.

4.2 Relation Instances

www.arihantinfo.com
55
RDBMS
A relation instance is a table with rows and named columns. The rows in a relation instance (or
just relation) are called tuples. The cardinality of the relation is the number of tuples in it. The
names of the columns are called attributes of the relation. The number of columns in a relation is
called the arity of the relation. The type constraint that the relation instance must satisfy is

1. the attribute names must correspond to the attribute names of the corresponding schema and
2. the tuple values must correspond to the domain values specified in the corresponding schema.

4.3 Functional Dependencies

Consider a relation R that has two attributes A and B. The attribute B of the relation is
functionally dependent on the attribute A if and only if for each value of A no more than one value
of B is associated. In other words, the value of attribute A uniquely determines the value of B and
if there were several tuples that had the same value of A then all these tuples will have an
identical value of attribute B. That is, if t1 and t2 are two tuples in the relation R and
t1(A) = t2(A) then we must have t1(B) = t2(B).

A and B need not be single attributes. They could be any subsets of the attributes of a relation R
(possibly single attributes). We may then write

R.A -> R.B

if B is functionally dependent on A (or A functionally determines B). Note that functional


dependency does not imply a one-to-one relationship between A and B although a one-to-one
relationship may exist between A and B.

A simple example of the above functional dependency is when A is a primary key of an entity (e.g.
student number) and A is some single-valued property or attribute of the entity (e.g. date of birth).
A -> B then must always hold. (why?)

Functional dependencies also arise in relationships. Let C be the primary key of an entity and D
be the primary key of another entity. Let the two entities have a relationship. If the relationship is
one-to-one, we must have C -> D and D -> C. If the relationship is many-to-one, we would have C
-> D but not D -> C. For many-to-many relationships, no functional dependencies hold. For
example, if C is student number and D is subject number, there is no functional dependency
between them. If however, we were storing marks and grades in the database as well, we would
have

(student_number, subject_number) -> marks

and we might have

marks -> grades

The second functional dependency above assumes that the grades are dependent only on the
marks. This may sometime not be true since the instructor may decide to take other
considerations into account in assigning grades, for example, the class average mark.

For example, in the student database that we have discussed earlier, we have the following
functional dependencies:

sno -> sname


sno -> address
cno -> cname

www.arihantinfo.com
56
RDBMS
cno -> instructor
instructor -> office

These functional dependencies imply that there can be only one name for each sno, only one
address for each student and only one subject name for each cno. It is of course possible that
several students may have the same name and several students may live at the same address. If
we consider cno -> instructor, the dependency implies that no subject can have more than one
instructor (perhaps this is not a very realistic assumption). Functional dependencies therefore
place constraints on what information the database may store. In the above example, one may be
wondering if the following FDs hold

sname -> sno


cname -> cno

Certainly there is nothing in the instance of the example database presented above that
contradicts the above functional dependencies. However, whether above FDs hold or not would
depend on whether the university or college whose database we are considering allows duplicate
student names and subject names. If it was the enterprise policy to have unique subject names
than cname -> cno holds. If duplicate student names are possible, and one would think there
always is the possibility of two students having exactly the same name, then sname -> sno does
not hold.

Functional dependencies arise from the nature of the real world that the database models. Often
A and B are facts about an entity where A might be some identifier for the entity and B some
characteristic. Functional dependencies cannot be automatically determined by studying one or
more instances of a database. They can be determined only by a careful study of the real world
and a clear understanding of what each attribute means.

We have noted above that the definition of functional dependency does not require that A and B
be single attributes. In fact, A and B may be collections of attributes. For example

(sno, cno) -> (mark, date)

When dealing with a collection of attributes, the concept of full functional dependence is an
important one. Let A and B be distinct collections of attributes from a relation R end let R.A ->
R.B. B is then fully functionally dependent on A if B is not functionally dependent on any subset of
A. The above example of students and subjects would show full functional dependence if mark
and date are not functionally dependent on either student number ( sno) or subject number ( cno)
alone. The implies that we are assuming that a student may have more than one subjects and a
subject would be taken by many different students. Furthermore, it has been assumed that there
is at most one enrolment of each student in the same subject.

The above example illustrates full functional dependence. However the following dependence

(sno, cno) -> instructor

is not full functional dependence because cno -> instructor


holds.

As noted earlier, the concept of functional dependency is related to the concept of candidate key of
a relation since a candidate key of a relation is an identifier which uniquely identifies a tuple and
therefore determines the values of all other attributes in the relation. Therefore any subset X of
the attributes of a relation R that satisfies the property that all remaining attributes of the relation
are functionally dependent on it (that is, on X), then X is candidate key as long as no attribute can
be removed from X and still satisfy the property of functional dependence. In the example above,

www.arihantinfo.com
57
RDBMS
the attributes (sno, cno) form a candidate key (and the only one) since they functionally determine
all the remaining attributes.

Functional dependence is an important concept and a large body of formal theory has been
developed about it. We discuss the concept of closure that helps us derive all functional
dependencies that are implied by a given set of dependencies. Once a complete set of functional
dependencies has been obtained, we will study how these may be used to build normalised
relations.

4.4 Rules About Functional Dependencies


• Let F be set of FDs specified on R
• Must be able to reason about FD’s in F
o Schema designer usually explicitly states only FD’s which are obvious
o Without knowing exactly what all tuples are, must be able to deduce other/all FD’s
that hold on R
o Essential when we discuss design of “good” relational schemas

4.5 Design of Relational Database Schemas

Database Scheme:

1. We distinguish between a database scheme (logical design) and a database instance


(data in the database at a point in time).
2. A relation scheme is a list of attributes and their corresponding domains.
3. The text uses the following conventions:
o italics for all names
o lowercase names for relations and attributes
o names beginning with an uppercase for relation schemes

These notes will do the same.

For example, the relation scheme for the deposit relation:

o Deposit-scheme = (bname, account#, cname, balance)

We may state that deposit is a relation on scheme Deposit-scheme by writing


deposit(Deposit-scheme).
If we wish to specify domains, we can write:

o (bname: string, account#: integer, cname: string, balance: integer).

Note that customers are identified by name. In the real world, this would not be allowed,
as two or more customers might share the same name.

Figure: Shows the E-R diagram for a banking enterprise.

www.arihantinfo.com
58
RDBMS

Figure: E-R diagram for the banking enterprise

4. The relation schemes for the banking example used throughout the text are:
o Branch-scheme = (bname, assets, bcity)
o Customer-scheme = (cname, street, ccity)
o Deposit-scheme = (bname, account#, cname, balance)
o Borrow-scheme = (bname, loan#, cname, amount)

Note: some attributes appear in several relation schemes (e.g. bname, cname). This is
legal, and provides a way of relating tuples of distinct relations.

5. Why not put all attributes in one relation?

Suppose we use one large relation instead of customer and deposit:

o Account-scheme = (bname, account#, cname, balance, street, ccity)


o If a customer has several accounts, we must duplicate her or his address for each
account.
o If a customer has an account but no current address, we cannot build a tuple, as
we have no values for the address.
o We would have to use null values for these fields.
o Null values cause difficulties in the database.
o By using two separate relations, we can do this without using null values

Keys:

1. The notions of superkey, candidate key and primary key all apply to the relational
model.
2. For example, in Branch-scheme,
o {bname} is a superkey.
o {bname, bcity} is a superkey.
o {bname, bcity} is not a candidate key, as the superkey {bname} is contained in it.
o {bname} is a candidate key.
o {bcity} is not a superkey, as branches may be in the same city.
o We will use {bname} as our primary key.
3. The primary key for Customer-scheme is {cname}.
4. More formally, if we say that a subset of is a superkey for , we are restricting
consideration to relations in which no two distinct tuples have the same values on all
attributes in . In other words,
o If and are in , and

www.arihantinfo.com
59
RDBMS
o ,
o then .

Anomalies:
Problems such as redundancy that occur when we try to cram too much into a single relation are
called anomalies. The principal kinds of anomalies that we encounter are:
_ Redundancy. Information may be repeated unnecessarily in several tuples.
_ Update Anomalies. We may change information in one tuples but leave the same information
unchanged in another.
_ Deletion Anomalies. If a set of values becomes empty, we may lose other information as side
effect.

4.6 Normalization

designing a database, usually a data model is translated into relational schema. The important
question is whether there is a design methodology or is the process arbitrary. A simple answer to
this question is affirmative. There are certain properties that a good database design must
possess as dictated by Codd’s rules. There are many different ways of designing good database.
One of such methodologies is the method involving ‘Normalization’. Normalization theory is built
around the concept of normal forms. Normalization reduces redundancy. Redundancy is
unnecessary repetition of data. It can cause problems with storage and retrieval of data. During
the process of normalization, dependencies can be identified, which can cause problems during
deletion and updation. Normalization theory is based on the fundamental notion of Dependency.
Normalization helps in simplifying the structure of schema and tables.
For example the normal forms, we will take an example of a database of the following logical
design:
Relation S { S#, SUPPLIERNAME, SUPPLYTATUS, SUPPLYCITY}, Primary Key{S#}
Relation P { P#, PARTNAME, PARTCOLOR, PARTWEIGHT, SUPPLYCITY}, Primary Key{P#}
Relation SP { S#, SUPPLYCITY, P#, PARTQTY}, Primary Key{S#, P#}
Foreign Key{S#} Reference S
Foreign Key{P#} Reference P
SP
S# SUPPLYCITY P# PARTQTY
S1 Bombay P1 3000
S1 Bombay P2 2000
S1 Bombay P3 4000
S1 Bombay P4 2000
S1 Bombay P5 1000
S1 Bombay P6 1000
S2 Mumbai P1 3000
S2 Mumbai P2 4000
S3 Mumbai P2 2000
S4 Madras P2 2000
S4 Madras P4 3000
S4 Madras P5 4000
Let us examine the table above to find any design discrepancy. A quick glance reveals that some
of the data are being repeated. That is data redundancy, which is of course an undesirable. The
fact that a particular supplier is located in a city has been repeated many times. This redundancy
causes many other related problems. For instance, after an update a supplier may be displayed to
be from Madras in one entry while from Mumbai in another. This further gives rise to many other
problems.

www.arihantinfo.com
60
RDBMS
Therefore, for the above reasons, the tables need to be refined. This process of refinement of a
given schema into another schema or a set of schema possessing qualities of a good database is
known as Normalization.
Database experts have defined a series of Normal forms each conforming to some specified design
quality condition(s). We shall restrict ourselves to the first five normal forms for the simple reason
of simplicity. Each next level of normal form adds another condition. It is interesting to note that
the process of normalization is reversible. The following diagram depicts the relation between
various normal forms.

1NF

2NF
3NF
4NF

5NF

th th rd
The diagram implies that 5 Normal form is also in 4 Normal form, which itself in 3 Normal
th th th
form and so on. These normal forms are not the only ones. There may be 6 , 7 and n normal
forms, but this is not of our concern at this stage.
Before we embark on normalization, however, there are a few more concepts that should be
understood.
Decomposition. Decomposition is the process of splitting a relation into two or more relations.
This is nothing but projection process. Decompositions may or may not loose information. As you
would learn shortly, that normalization process involves breaking a given relation into one or
more relations and also that these decompositions should be reversible as well, so that no
information is lost in the process. Thus, we will be interested more with the decompositions that
incur no loss of information rather than the ones in which information is lost.
Lossless decomposition: The decomposition, which results into relations without loosing any
information, is known as lossless decomposition or nonloss decomposition. The decomposition
that results in loss of information is known as lossy decomposition.
Consider the relation S{S#, SUPPLYSTATUS, SUPPLYCITY} with some instances of the entries as
shown below.
S S# SUPPLYSTATUS SUPPLYCITY
S3 100 Madras
S5 100 Mumbai
Let us decompose this table into two as shown below:
(1) SX S# SUPPLYSTATUS SY S# SUPPLYCITY
S3 100 S3 Madras
S5 100 S5 Mumbai
(2) SX S# SUPPLYSTATUS SY SUPPLYSTATUS SUPPLYCITY
S3 100 100 Madras
S5 100 100 Mumbai
Let us examine these decompositions. In decomposition (1) no information is lost. We can still say
that S3’s status is 100 and location is Madras and also that supplier S5 has 100 as its status and
location Mumbai. This decomposition is therefore lossless.
In decomposition (2), however, we can still say that status of both S3 and S5 is 100. But the
location of suppliers cannot be determined by these two tables. The information regarding the
location of the suppliers has been lost in this case. This is a lossy decomposition.
www.arihantinfo.com
61
RDBMS
Certainly, lossless decomposition is more desirable because otherwise the decomposition will be
irreversible. The decomposition process is in fact projection, where some attributes are selected
from a table. A natural question arises here as to why the first decomposition is lossless while the
second one is lossy? How should a given relation must be decomposed so that the resulting
projections are nonlossy? Answer to these questions lies in functional dependencies and may be
given by the following theorem.
Heath’s theorem: Let R{A, B, C} be a relation, where A, B and C are sets of attributes. If R satisfies
the FD A→B, then R is equal to the join of its projections on {A, B} and {A, C}.
Let us apply this theorem on the decompositions described above. We observe that relation S
satisfies two irreducible sets of FD’s
S# → SUPPLYSTATUS
S# → SUPPLYCITY
Now taking A as S#, B as SUPPLYSTATUS, and C as SUPPLYCITY, this theorem confirms that
relation S can be nonloss decomposition into its projections on {S#, SUPPLYSTATUS} and {S#,
SUPPLYCITY} . Note, however, that the theorem does not say why projections {S#,
SUPPLYSTATUS} and {SUPPLYSTATUS, SUPPLYCITY} should be lossy. Yet we can see that one of
the FD’s is lost in this decomposition. While the FD S#→SUPPLYSTATUS is still represented by
projection on {S#, SUPPLYSTATUS}, but the FD S#→SUPPLYCITY has been lost.
An alternative criteria for lossless decomposition is as follows. Let R be a relation schema, and let
F be a set of functional dependencies on R. let R1 and R2 form a decomposition of R. this
decomposition is a lossless-join decomposition of R if at least one of the following functional
+
dependencies are in F :
R1 ∩ R2 → R1
R1 ∩ R2 → R2
Functional Dependency Diagrams: This is a handy tool for representing function dependencies
existing in a relation.
The diagram is very useful for its eloquence and in visualizing the FD’s in a relation. Later in the

PARTNAME

SUPPLIERNAME

S# PARTCOLOR

S# SUPPLYSTATUS PARTQTY P#
PARTWEIGHT
P#

SUPPLYCITY
SUPPLYCITY

Unit you will learn how to use this diagram for normalization purposes.

4.6.1 First Normal Form:


st
A relation is in 1 Normal form (1NF) if and only if, in every legal value of that relation, every tuple
contains exactly one value for each attribute.
Although, simplest, 1NF relations have a number of discrepancies and therefore it not the most
desirable form of a relation.
Let us take a relation (modified to illustrate the point in discussion) as
Rel1{S#, SUPPLYSTATUS, SUPPLYCITY, P#, PARTQTY} Primary Key{S#, P#}
FD{SUPPLYCITY → SUPPLYSTATUS}
Note that SUPPLYSTATUS is functionally dependent on SUPPLYCITY; meaning that a supplier’s
status is determined by the location of that supplier – e.g. all suppliers from Madras must have
status of 100. The primary key of the relation Rel1 is {S#, P#}. The FD diagram is shown below:

www.arihantinfo.com
62
RDBMS

S# SUPPLYCITY

PARTQTY

P# SUPPLYSTATUS

For a good design the diagram should have arrows out of candidate keys only. The additional
arrows cause trouble.
Let us discuss some of the problems with this 1NF relation. For the purpose of illustration, let us
insert some sample tuples into this relation.
REL1 S# SUPPLYSTATUS SUPPLYCITY P# PARTQTY
S1 200 Madras P1 3000
S1 200 Madras P2 2000
S1 200 Madras P3 4000
S1 200 Madras P4 2000
S1 200 Madras P5 1000
S1 200 Madras P6 1000
S2 100 Mumbai P1 3000
S2 100 Mumbai P2 4000
S3 100 Mumbai P2 2000
S4 200 Madras P2 2000
S4 200 Madras P4 3000
S4 200 Madras P5 4000
The redundancies in the above relation causes many problems – usually known as update
anamolies, that is in INSERT, DELETE and UPDATE operations. Let us see these problems due to
supplier-city redundancy corresponding to FD S#→SUPPLYCITY.
INSERT: In this relation, unless a supplier supplies at least one part, we cannot insert the
information regarding a supplier. Thus, a supplier located in Kolkata is missing from the relation
because he has not supplied any part so far.
DELETE: Let us see what problem we may face during deletion of a tuple. If we delete the tuple of
a supplier (if there is a single entry for that supplier), we not only delte the fact that the supplier
supplied a particular part but also the fact that the supplier is located in a particular city. In our
case, if we delete entries corresponding to S#=S2, we loose the information that the supplier is
located at Mumbai. This is definitely undesirable. The problem here is there are too many
informations attached to each tuple, therefore deletion forces loosing too many informations.
UPDATE: If we modify the city of a supplier S1 to Mumbai from Madras, we have to make sure
that all the entries corresponding to S#=S1 are updated otherwise inconsistency will be
introduced. As a result some entries will suggest that the supplier is located at Madras while
others will contradict this fact.

4.6.2 Second Normal Form:

A relation is in 2NF if and only if it is in 1NF and every nonkey attribute is fully functionally
dependent on the primary key. Here it has been assumed that there is only one candidate key,
which is of course primary key.
A relation in 1NF can always decomposed into an equivalent set of 2NF relations. The reduction
process consists of replacing the 1NF relation by suitable projections.
We have seen the problems arising due to the less-normalization (1NF) of the relation. The remedy
is to break the relation into two simpler relations.
REL2{S#, SUPPLYSTATUS, SUPPLYCITY} and
REL3{S#, P#, PARTQTY}
The FD diagram and sample relation, are shown below.
www.arihantinfo.com
63
RDBMS

SUPPLYCITY S#

S# PARTQTY

SUPPLYSTATUS P#

REL2 REL3
S# SUPPLYSTATUS SUPPLYCITY S# P# PARTQTY
S1 200 Madras S1 P1 3000
S2 100 Mumbai S1 P2 2000
S3 100 Mumbai S1 P3 4000
S4 200 Madras S1 P4 2000
S5 300 Kolkata S1 P5 1000
S1 P6 1000
S2 P1 3000
S2 P2 4000
S3 P2 2000
S4 P2 2000
S4 P4 3000
S4 P5 4000
REL2 and REL3 are in 2NF with their {S#} and {S#, P#} respectively. This is because all nonkeys of
REL1{ SUPPLYSTATUS, SUPPLYCITY}, each is functionally dependent on the primary key that is
S#. By similar argument, REL3 is also in 2NF.
Evidently, these two relations have overcome all the update anomalies stated earlier.
Now it is possible to insert the facts regarding supplier S5 even when he is not supplied any part,
which was earlier not possible. This solves insert problem. Similarly, delete and update problems
are also over now.
These relations in 2NF are still not free from all the anomalies. REL3 is free from most of the
problems we are going to discuss here, however, REL2 still carries some problems. The reason is
that the dependency of SUPPLYSTATUS on S# is though functional, it is transitive via
SUPPLYCITY. Thus we see that there are two dependencies S#→SUPPLYCITY and SUPPLYCITY→
SUPPLYSTATUS. This implies S#→SUPPLYSTATUS. This relation has a transitive dependency. We
will see that this transitive dependency gives rise to another set of anomalies.
INSERT: We are unable to insert the fact that a particular city has a particular status until we
have some supplier actually located in that city.
DELETE: If we delete sole REL2 tuple for a particular city, we delete the information that that city
has that particular status.
UPDATE: The status for a given city still has redundancy. This causes usual redundancy problem
related to updataion.

4.6.3 Third Normal Form:

A relation is in 3NF if only if it is in 2NF and every non-key attribute is non-transitively dependent
on the primary key.
To convert the 2NF relation into 3NF, once again, the REL2 is split into two simpler relations –
REL4 and REL5 as shown below.
REL4{S#, SUPPLYCITY} and
REL5{SUPPLYCITY, SUPLLYSTATUS}
The FD diagram and sample relation, is shown below.

www.arihantinfo.com
64
RDBMS

S# SUPPLYCITY SUPPLYCITY SUPPLYCITY REL4

REL5
S# SUPPLYCITY SUPPLYCITY SUPPLYSTATUS
S1 Madras Madras 200
S2 Mumbai Mumbai 100
S3 Mumbai Kolakata 300
S4 Madras
S5 Kolkata
Evidently, the above relations REL4 and REL5 are in 3NF, because there is no transitive
dependencies. Every 2NF can be reduced into 3NF by decomposing it further and removing any
transitive dependency.

Dependency Preservation
The reduction process may suggest a variety of ways in which a relation may be decomposed in
lossless decomposition. Thus, REL2 can be in which there was a transitive dependency and
therefore, we split it into two 3NF projections, i.e.
REL4{S#, SUPPLYCITY} and
REL5{SUPPLYCITY, SUPLLYSTATUS}
Let us call this decomposition as decompositio-1. An alternative decomposition may be:
REL4{S#, SUPPLYCITY} and
REL5{S#, SUPPLYSTATUS}
Which we will call decomposition-2.
Both the decompositions decomposition-1 and decomposition-2 are 3NF and lossless. However,
decomposition-2 is less satisfactory than decomposition-1. For example, it is still not possible to
insert the information that a particular city has a particular status unless some supplier is
located in the city.
In the decomposition-1, the two projections are independent of each other but the same is not
true in the second decomposition. Here independence is in the sense that updates are made into
the relations without regard of the other provided the insertion is legal. Also independent
decompositions preserve the dependencies of the database and no dependence is lost in the
decomposition process.
The concept of independent projections provides for choosing a particular decomposition when
there is more than one choice.

4.6.4 Boyce-Codd Normal Form:

The previous normal forms assumed that there was just one candidate key in the relation and
that key was also the primary key. Another class of problems arises when this is not the case.
Very often there will be more candidate keys than one in practical database designing situation.
To be precise the 1NF, 2NF and 3NF did not deal adequately with the case of relations that
Had two or more candidate keys, and that
The candidate keys were composite, and
They overlapped (i.e. had at least one attribute common).
A relation is in BCNF (Boyce-Codd Normal Form) if and only if every nontrivial, left-irreducible FD
has a candiadte key as its determinant.
Or
A relation is in BCNF if and only if all the determinants are candidate keys.
In other words, the only arrows in the FD diagram are arrows out of candidate keys. It has already
been explained that there will always be arrows out of candidate keys; the BCNF definition says
there are no others, meaning there are no arrows that can be eliminated by the normalization
procedure.
These two definitions are apparently different from each other. The difference between the two
BCNF definitions is that we tacitly assume in the former case determinants are "not too big" and
that all FDs are nontrivial.

www.arihantinfo.com
65
RDBMS
It should be noted that the BCNF definition is conceptually simpler than the old 3NF definition, in
that it makes no explicit reference to first and second normal forms as such, nor to the concept of
transitive dependence. Furthermore, although BCNF is strictly stronger than 3NF, it is still the
case that any given relation can be nonloss decomposed into an equivalent collection of BCNF
relations.
Thus, relations REL1 and REL2 which were not in 3NF, are not in BCNF either; also that relations
REL3, REL4, and REL5, which were in 3NF, are also in BCNF. Relation REL1 contains three
determinants, namely {S#}, {SUPPLYCITY}, and {S#, P#}; of these, only {S#, P#} is a candidate key,
so REL1 is not in BCNF. Similarly, REL2 is not in BCNF either, because the determinant
{SUPPLYCITY} is not a candidate key. Relations REL3, REL4, and REL5, on the other hand, are
each in BCNF, because in each case the sole candidate key is the only determinant in the
respective relations.
We now consider an example involving two disjoint - i.e., nonoverlapping - candidate keys.
Suppose that in the usual suppliers relation REL1{S#, SUPPLIERNAME, SUPPLYSTATUS,
SUPPLYCITY}, {S#} and {SUPPLIERNAME} are both candidate keys (i.e., for all time, it is the case
that every supplier has a unique supplier number and also a unique supplier name). Assume,
however, that attributes SUPPLYSTATUS and SUPPLYCITY are mutually independent - i.e., the
FD SUPPLYCITY→SUPPLYSTATUS no longer holds. Then the FD diagram is as shown below.

S# SUPPLYSTATUS

Relation REL1 is in BCNF. Although


SUPPLIERNAME SUPPLYCITY
the FD diagram does look "more
complex" than a 3NF diagram, it is
nevertheless still the case that the only determinants are candidate keys; i.e., the only arrows are
arrows out of candidate keys. So the message of this example is just that having more than one
candidate key is not necessarily bad.
For illustration we will assume that in our relations supplier names are unique. Consider REL6.
REL6{ S#, SUPPLIERNAME, P#, PARTQTY }.
Since it contains two determinants, S# and SUPPLIERNAME that are not candidate keys for the
relation, this relation is not in BCNF. A sample snapshot of this relation is shown below:
REL6 S# SUPPLIERNAME P# PARTQTY
S1 Pooran P1 3000
S1 Anupam P2 2000
S1 Vishal P3 4000
S1 Vinod P4 2000
As is evident from the figure above, relation REL6 involves the same kind of redundancies as did
relations REL1 and REL2, and hence is subject to the same kind of update anomalies. For
example, changing the name of suppliers from Vinod to Rahul leads, once again, either to search
problems or to possibly inconsistent results. Yet REL6 is in 3NF by the old definition, because
that definition did not require an attribute to be irreducibly dependent on each candidate key if it
was itself a component of some candidate key of the relation, and so the fact that SUPPLIERNAME
is not irreducibly dependent on {S#, P#} was ignored.
The solution to the REL6 problems is, of course, to break the relation down into two projections,
in this case the projections are:
REL7{S#, SUPPLIERNAME} and
REL8{S#, P#, PARTQTY}
Or
REL7{S#, SUPPLIERNAME} and
REL8{SUPPLIERNAME, P#, PARTQTY}
Both of these projections are in BCNF. The original design, consisting of the single relation REL1,
is clearly bad; the problems with it are intuitively obvious, and it is unlikely that any competent
database designer would ever seriously propose it, even if he or she had no exposure to the ideas
of BCNF etc. at all.

www.arihantinfo.com
66
RDBMS
Comparison of BCNF and 3NF

We have seen two normal forms for relational-database schemas: 3NF and BCNF. There is an
advantage to 3NF in that we know that it is always possible to obtain a 3NF design without
sacrificing a lossless join or dependency preservation. Nevertheless, there is a disadvantage to
3NF. If we do not eliminate all transitive dependencies, we may have to use null values to
represent some of the possible meaningful relationship among data items, and there is the
problem of repetition of information. The other difficulty is the repetition of information.
If we are forced to choose between BCNF and dependency preservation with 3NF, it is generally
preferable to opt for 3NF. If we cannot test for dependency preservation efficiently, we either pay a
high penalty in system performance or risk the integrity of the data in our database. Neither of
these alternatives is attractive. With such alternatives, the limited amount of redundancy imposed
by transitive dependencies allowed under 3NF is the lesser evil. Thus, we normally choose to
retain dependency preservation and to sacrifice BCNF.

4.6.5 Multi-valued dependency:

Multi-valued dependency may be formally defined as:


Let R be a relation, and let A, B, and C be subsets of the attributes of R. Then we say that B is
multi-dependent on A - in symbols,
A →→B
(read "A multi-determines B," or simply "A double arrow B") - if and only if, in every possible legal
value of R, the set of B values matching a given A value, C value pair depends only on the A value
and is independent of the C value.
To elucidate the meaningof the above statement, let us take one example relation REL8 as shown
beolw:

REL8 COURSE TEACHERS BOOKS


Computer TEACHER BOOK
Dr. Wadhwa Graphics
Prof. Mittal UNIX
Mathematics TEACHER BOOK
Prof. Saxena Relational Algebra
Prof. Karmeshu Discrete Maths

Assume that for a given course there can exist any number of corresponding teachers and any
number of corresponding books. Moreover, let us also assume that teachers and books are quite
independent of one another; that is, no matter who actually teaches any particular course, the
same books are used. Finally, also assume that a given teacher or a given book can be associated
with any number of courses.
Let us try to eliminate the relation-valued attributes. One way to do this is simply to replace
relation REL8 by a relation REL9 with three scalar attributes COURSE, TEACHER, and BOOK as
indicated below.

REL9 COURSE TEACHER BOOK


Computer Dr. Wadhwa Graphics
Computer Dr. Wadhwa UNIX
Computer Prof. Mittal Graphics
Computer Prof. Mittal UNIX
Mathematics Prof. Saxena Relational Algebra
Mathematics Prof. Karmeshu Disrete Maths
Mathematics Prof. Karmeshu Relational Algebra

www.arihantinfo.com
67
RDBMS
As you can see from the relation, each tuple of REL8 gives rise to m * n tuples in REL9, where m
and n are the cardinalities of the TEACHERS and BOOKS relations in that REL8 tuple. Note that
the resulting relation REL9 is "all key".
The meaning of relation REL9 is basically as follows: A tuple {COURSE:c, TEACHER:t, BOOK:x}
appears in REL9 if and only if course c can be taught by teacher t and uses book x as a reference.
Observe that, for a given course, all possible combinations of teacher and book appear: that is,
REL9 satisfies the (relation) constraint
if tuples (c, t1, x1), (c, t2, x2) both appear
then tuples (c, t1, x2), (c, t2, x1) both appear also
Now, it should be apparent that relation REL9 involves a good deal of redundancy, leading as
usual to certain update anomalies. For example, to add the information that the Computer course
can be taught by a new teacher, it is necessary to insert two new tuples, one for each of the two
books. Can we avoid such problems? Well, it is easy to see that:
1. The problems in question are caused by the fact that teachers and books are completely
independent of one another;
2. Matters would be much improved if REL9 were decomposed into its two projections call them
REL10 and REL11 - on {COURSE, TEACHER} and {COURSE, BOOK}, respectively.
To add the information that the Computer course can be taught by a new teacher, all we have to
do now is insert a single tuple into relation REL10. Thus, it does seem reasonable to suggest that
there should be a way of "further normalizing" a relation like REL9.
It is obvious that the design of REL9 is bad and the decomposition into REL10 and REL11 is
better. The trouble is, however, these facts are not formally obvious. Note in particular that REL9
satisfies no functional dependencies at all (apart from trivial ones such as COURSE → COURSE);
in fact, REL9 is in BCNF, since as already noted it is all key-any "all key" relation must
necessarily be in BCNF. (Note that the two projections REL10 and REL11 are also all key and
hence in BCNF.) The ideas of the previous normalization are therefore of no help with the problem
at hand.
The existence of "problem" BCNF relation like REL9 was recognized very early on, and the way to
deal with them was also soon understood, at least intuitively. However, it was not until 1977 that
these intuitive ideas were put on a sound theoretical footing by Fagin's introduction of the notion
of multi-valued dependencies, MVDs. Multi-valued dependencies are a generalization of functional
dependencies, in the sense that every FD is an MVD, but the converse is not true (i.e., there exist
MVDs that are not FDs). In the case of relation REL9 there are two MVDs that hold:
COURSE →→ TEACHER
COURSE →→ BOOK
Note the double arrows; the MVD A→→B is read as "B is multi-dependent on A" or, equivalently,
"A multi-determines B." Let us concentrate on the first MVD, COURSE→→TEACHER. Intuitively,
what this MVD means is that, although a course does not have a single corresponding teacher -
i.e., the functional dependence COURSE→TEACHER does not hold-nevertheless, each course
does have a well-defined set of corresponding teachers. By "well-defined" here we mean, more
precisely, that for a given course c and a given book x, the set of teachers t matching the pair (c,
x) in REL9 depends on the value c alone - it makes no difference which particular value of x we
choose. The second MVD, COURSE→→BOOK, is interpreted analogously.
It is easy to show that, given the relation R{A, B, C), the MVD A→→B holds if and only if the MVD
A→→C also holds. MVDs always go together in pairs in this way. For this reason it is common to
represent them both in one statement, thus:
COURSE→→TEACHER | TEXT
Now, we stated above that multi-valued dependencies are a generalization of functional
dependencies, in the sense that every FD is an MVD. More precisely, an FD is an MVD in which
the set of dependent (right-hand side) values matching a given determinant (left-hand side) value
is always a singleton set. Thus, if A→B. then certainly A→→B.
Returning to our original REL9 problem, we can now see that the trouble with relation such as
REL9 is that they involve MVDs that are not also FDs. (In case it is not obvious, we point out that
it is precisely the existence of those MVDs that leads to the necessity of – for example - inserting
two tuples to add another Computer teacher. Those two tuples are needed in order to maintain
the integrity constraint that is represented by the MVD.) The two projections REL10 and REL11
do not involve any such MVDs, which is why they represent an improvement over the original

www.arihantinfo.com
68
RDBMS
design. We would therefore like to replace REL9 by those two projections, and an important
theorem proved by Fagin in reference allows us to make exactly that replacement:
Theorem (Fagin): Let R{A, B, C} be a relation, where A, B, and C are sets of attributes. Then R is
equal to the join of its projections on {A, B} and {A, C} if and only if R satisfies the MVDs A→→B |
C.
At this stage we are equipped to define fourth normal form:
Fourth normal form: Relation R is in 4NF if and only if, whenever there exist subsets A and B of
the attributes of R such that the nontrivial (An MVD A→→B is trivial if either A is a superset of B
or the union of R and B is the entire heading) MVD A→→B is satisfied, then all attributes of R are
also functionally dependent on A.
In other words, the only nontrivial dependencies (FDs or MVDs) in R are of the form Y→X (i.e.,
functional dependency from a superkey Y to some other attribute X). Equivalently: R is in 4NF if it
is in BCNF and all MVDs in R are in fact "FDs out of keys." Therefore, that 4NF implies BCNF.
Relation REL9 is not in 4NF, since it involves an MVD that is not an FD at all, let alone an FD
"out of a key." The two projections REL10 and REL11 are both in 4NF, however. Thus 4NF is an
improvement over BCNF, in that it eliminates another form of undesirable dependency. What is
more, 4NF is always achievable; that is, any relation can be nonloss decomposed into an
equivalent collection of 4NF relations.
You may recall that a relation R{A, B, C} satisfying the FDs A→B and B→C is better decomposed
into its projections on (A, B) and {B, C} rather than into those on {A, B] and {A, C). The same holds
true if we replace the FDs by the MVDs A→→B and B→→C.

4.6.6 Fifth Normal Form:

It seems from our discussion so far in that the sole operation necessary or available in the further
normalization process is the replacement of a relation in a nonloss way by exactly two of its
projections. This assumption has successfully carried us as far as 4NF. It comes perhaps as a
surprise, therefore, to discover that there exist relations that cannot be nonloss-decomposed into
two projections but can be nonloss-decomposed into three (or more). An unpleasant but
convenient term, we will describe such a relation as "n-decomposable" (for some n > 2) - meaning
that the relation in question can be nonloss-decomposed into n projections but not into m for any
m < n.
A relation that can be nonloss-decomposed into two projections we will call "2-decomposable" and
similarly term “n-decomposable” may be defined. The phenomenon of n-decomposability for n > 2
was first noted by Aho, Been, and Ullman. The particular case n = 3 was also studied by Nicolas.
Consider relation REL12 from the suppliers-parts-projects database ignoring attribute QTY for
simplicity for the moment. A sample snapshot of the same is shown below. It may be pointed out
that relation REL12 is all key and involves no nontrivial FDs or MVDs at all, and is therefore in
4NF. The snapshot of the relations also shows:
a. The three binary projections REL13, REL14, and REL15 corresponding to the REL12 relation
value displayed on the section of the adjoining diagram;
b. The effect of joining the REL13 and REL14 projections (over P#);
c. The effect of joining that result and the REL15 projection (over J# and S#).
REL12 S# P# J#
S1 P1 J2
S1 P2 J1
S2 P1 J1
S1 P1 J1
REL13 S# P# REL14 P# J# REL15 J# S#
S1 P1 P1 J2 J2 S1
S1 P2 P1 J1 J1 S1
S2 P1 P1 J1 J1 S2

Join Dependency:
Let R be a relation, and let A, B, ..., Z be subsets of the attributes of R. Then we say that R
satisfies the Join Dependency (JD)
*{ A, B, ..., Z}

www.arihantinfo.com
69
RDBMS
(read "star A, K ..., Z") if and only if every possible legal value of R is equal to the join of its
projections on A, B,..., Z.
For example, if we agree to use SP to mean the subset (S#, P#} of the set of attributes of REL12,
and similarly for FJ and JS, then relation REL12 satisfies the JD * {SP, PJ, JS}.
We have seen, then, that relation REL12, with its JD * {REL13, REL14, REL15}, can be 3-
decomposed. The question is, should it be? And the answer is "Probably yes." Relation REL12
(with its JD) suffers from a number of problems over update operations, problems that are
removed when it is 3-decomposed.
Fagin's theorem, to the effect that R{A, B, C} can be non-loss-decomposed into its projections on
{A, B} and {A, C] if and only if the MVDs A→→B and A→→C hold in R, can now be restated as
follows:
R{A, B, C} satisfies the JD*{AB, AC} if and only if it satisfies the MVDs A→→B | C.
Since this theorem can be taken as a definition of multi-valued dependency, it follows that an
MVD is just a special case of a JD, or (equivalently) that JDs are a generalization of MVDs.
Thus, to put it formally, we have
A→→B | C ≡ * {AB, AC}
Note that joint dependencies are the most general form of dependency possible (using, of course,
the term "dependency" in a very special sense). That is, there does not exist a still higher form of
dependency such that JDs are merely a special case of that higher form - so long as we restrict
our attention to dependencies that deal with a relation being decomposed via projection and
recomposed via join.
Coming back to the running example, we can see that the problem with relation REL12 is that it
involves a JD that is not an MVD, and hence not an FD either. We have also seen that it is
possible, and probably desirable, to decompose such a relation into smaller components - namely,
into the projections specified by the join dependency. That decomposition process can be repeated
until all resulting relations are in fifth normal form, which we now define:
Fifth normal form: A relation R is in 5NF - also called projection-join normal torn (PJ/NF) - if and
only if every nontrivial* join dependency that holds for R is implied by the candidate keys of R.
Let us understand what it means for a JD to be "implied by candidate keys."
Relation REL12 is not in 5NF, it satisfies a certain join dependency, namely Constraint 3D, that is
certainly not implied by its sole candidate key (that key being the combination of all of its
attributes). Stated differently, relation REL12 is not in 5NF, because (a) it can be 3-decomposed
and (b) that 3-decomposability is not implied by the fact that the combination {S#, P#, J#} is a
candidate key. By contrast, after 3-decomposition, the three projections SP, PJ, and JS are each
in 5NF, since they do not involve any (nontrivial) JDs at all.
Now let us understand through an example, what it means for a JD to be implied by candidate
keys. Suppose that the familiar suppliers relation REL1 has two candidate keys, {S#} and
{SUPPLIERNAME}. Then that relation satisfies several join dependencies - for example, it satisfies
the JD
*{ { S#, SUPPLIERNAME, SUPPLYSTATUS }, { S#, SUPPLYCITY } }
That is, relation REL1 is equal to the join of its projections on {S#, SUPPLIERNAME,
SUPPLYSTATUS} and {S#, SUPPLYCITY), and hence can be nonloss-decomposed into those
projections. (This fact does not mean that it should be so decomposed, of course, only that it
could be.) This JD is implied by the fact that {S#} is a candidate key (in fact it is implied by
Heath's theorem) Likewise, relation REL1 also satisfies the JD
* {{S#, SUPPLIERNAME}, {S#, SUPPLYSTATUS}, {SUPPLIERNAME, SUPPLYCITY}}
This JD is implied by the fact that {S#} and { SUPPLYNAME} are both candidate keys.
To conclude, we note that it follows from the definition that 5NF is the ultimate normal form with
respect to projection and join (which accounts for its alternative name, projection-join normal
form). That is, a relation in 5NF is guaranteed to be free of anomalies that can be eliminated by
taking projections. For a relation is in 5NF the only join dependencies are those that are implied
by candidate keys, and so the only valid decompositions are ones that are based on those
candidate keys. (Each projection in such a decomposition will consist of one or more of those
candidate keys, plus zero or more additional attributes.) For example, the suppliers relation
REL15 is in 5NF. It can be further decomposed in several nonloss ways, as we saw earlier, but
every projection in any such decomposition will still include one of the original candidate keys,
and hence there does not seem to be any particular advantage in that further reduction.

www.arihantinfo.com
70
RDBMS

www.arihantinfo.com
71
RDBMS
Unit 5

Relational Algebra

5.1 Basics of Relational Algebra


5.2 Set Operations on Relations
5.3 Extended Operators of Relational Algebra
5.4 Constraints on Relations
5.5 Modification of the Database
5.6 Views
5.7 Relational Calculus
5.7.1Tuple Relational Calculus
5.7.2Domain Relational Calculus

5.1 Basics of Relational Algebra

Relational algebra is a procedural query language, which consists of a set of operations that take
one or two relations as input and produce a new relation as their result. The fundamental
operations that will be discussed in this tutorial are: select, project, union, and set difference.

Each operation will be applied to tables of a sample database. Each table is otherwise known as a
relation and each row within the table is refered to as a tuple. The sample database consists of
tables in which one might see in a bank. The sample database consists of the following 6
relations:

The account relation

branch-name account-number balance


Downtown A-101 500
Mianus A-215 700
Perryridge A-102 400
Round HillA-305 350
Brighton A-201 900
Redwood A-222 700
Brighton A-217 750

In addition to defining the database structure and constraints, a data model must include a set of
operations to manipulate the data. Basic sets of relational model operations constitute the
relational algebra. These operations enable the user to specify basic retrieval requests. The result
of retrieval is a new relation, which may have been formed from one or more relations. The
algebra operations thus produce new relations, which can be further manipulated using
operations of the same algebra. A sequence of relational algebra operations forms a relational
algebra expression, whose result will also be a relation.
The relational algebra operations are usually divided into two groups. One group includes set
operations from mathematical set theory; these are applicable because each relation is defined to
be a set of tuples. Set operations include UNION, INTERSECTION, SET DIFFERENCE, and
CARTESIAN PRODUCT. The other group consists of operations developed specifically for
relational databases; these include SELECT, PROJECT, and JOIN, among others. The SELECT
and PROJECT operations are discussed first, because they are the simplest. Then we discuss set
www.arihantinfo.com
72
RDBMS
operations. Finally, we discuss JOIN and other complex operations. The relational database
shown in Figure 2.2 is used for our examples.
Some common database requests cannot be performed with the basic relational algebra
operations, so additional operations are needed to, express the requests.
EMPLOYEE

FNAME MINT LNAME SSN BDATE ADDRESS SEX SALARY SUPERSSN


DNO
DEPARTMENT

DNAME DNUMBNER MGRSSN MGRSTARDATE

DEPT_LOCATIONS

DNUMER DLOCATION

PROJECT

PNAME PNUMBER PLOCATION DNUM

WORKS ON

ESSN PNO HOURS

ESSN DEPENDENT_NAME SEX BDATE RELATIONSHIP

Figure : Schema diagram for the COMPANY relational database schema; the primary keys
are underlined.

5.2 Set Operations on Relations

a) Select Operation

The select operation is a unary operation, which means it operates on one relation. Its
function is to select tuples that satisfy a given predicate. To denote selection, the lowercase
Greek letter sigma ( ) is used. The predicate appears as a subscript to . The argument
relation is given in parentheses following the .

For example, to select those tuples of the loan relation where the branch is "Perryridge," we
write:

branch-home = "Perryridge" (loan)

The results of the query are the following:

Branch-name loan-number amount


Perryridge L-15 1500
Perryridge L-16 1300

www.arihantinfo.com
73
RDBMS
Comparisons like =, , <, >, can also be used in the selection predicate. An example query
using a comparison is to find all tuples in which the amount lent is more than $1200 would
be written:

amount > 1200 (loan)

Let Figure be the borrow and branch relations in the banking example.

Figure : The borrow and branch relations.

The new relation created as the result of this operation consists of one tuple:
.

We allow comparisons using =, , <, , > and in the selection predicate.

We also allow the logical connectives (or) and (and). For example:

Figure : The client relation.

Suppose there is one more relation, client, shown in Figure 3.4, with the scheme

we might write

to find clients who have the same name as their banker.

b) Project Operation

The project operation is a unary operation that returns its argument relation with certain
attributes left out. Since a relation is a set, any duplicate rows are eliminated. Projection is
denoted by the Greek letter pi ( ). The attributes that wish to be appear in the result are listed
as a subscript to . The argument relation follows in parentheses. For example, the query to
list all loan numbers and the amount of the loan is written as:

loan-number, amount (loan)

www.arihantinfo.com
74
RDBMS

The result of the query is the following:

loan-number amount
L-17 1000
L-23 2000
L-15 1500
L-14 1500
L-93 500
L-11 900
L-16 1300

Another more complicated example query is to find those customers who live in Harrison
is written as:

customer-name ( customer-city = "Harrison" (customer))

For example, to obtain a relation showing customers and branches, but ignoring amount
and loan#, we write

We can perform these operations on the relations resulting from other operations.

To get the names of customers having the same name as their bankers,

Think of select as taking rows of a relation, and project as taking columns of a relation.

c) Union

The union operation yields the results that appear in either or both of two relations. It is a
binary operation denoted by the symbol .

An example query would be to find the name of all bank customers who have either an
account or a loan or both. To find this result we will need the information in the depositor
relation and in the borrower relation. To find the names of all customers with a loan in the
bank we would write:

customer-name (borrower)

and to find the names of all customers with an account in the bank, we would write:

customer-name (depositor)

Then by using the union operation on these two queries we have the query we need to obtain
the wanted results. The final query is written as:

www.arihantinfo.com
75
RDBMS
customer-name (borrower) customer-name (depositor)

The result of the query is the following:

Customer-name
Johnson
Smith
Hayes
Turner
Jones
Lindsay
Jackson
Curry
Williams
Adams

The union operation is denoted as in set theory. It returns the union (set union) of two
compatible relations.

For a union operation to be legal, we require that

o and must have the same number of attributes.


o The domains of the corresponding attributes must be the same.

To find all customers of the SFU branch, we must find everyone who has a loan or an
account or both at the branch.
We need both borrow and deposit relations for this:

As in all set operations, duplicates are eliminated, giving the relation of Figure (a).

Figure : The union and set-difference operations.

d) Set Difference
The set-difference operation, denoted by the -, results in finding tuples taht are in one relation
but are not in another. The expression r - s results in a relation containing those tuples in r
abut not in s.

For example, a) the query to find all customers of the bank who have an account but not a
loan, is written as:
customer-name (depositor) - customer-name (borrower)

The result of the query is the following:


Customer-name
Johnson
Turner
Lindsay
For a set difference to be valid, it must be taken between compatible operations just as in the
union operation.

b) To find customers of the SFU branch who have an account there but no loan, we write

www.arihantinfo.com
76
RDBMS
The result is shown in Figure (b).

We can do more with this operation. Suppose we want to find the largest account balance
in the bank.
Strategy:

o Find a relation containing the balances not the largest.


o Compute the set difference of and the deposit relation.

To find , we write

This resulting relation contains all balances except the largest one. (See Figure (a)).
Now we can finish our query by taking the set difference:

Figure (b) shows the result.

Figure : Find the largest account balance in the bank.

e) Cartesian Product Operation

The cartesian product of two relations is denoted by a cross ( ), written

The result of is a new relation with a tuple for each possible pairing of tuples from
and .

In order to avoid ambiguity, the attribute names have attached to them the name of the
relation from which they came. If no ambiguity will result, we drop the relation name.

The result is a very large relation. If has tuples, and has tuples,
then will have tuples.

The resulting scheme is the concatenation of the schemes of and , with relation names
added as mentioned.

To find the clients of banker Johnson and the city in which they live, we need information
in both client and customer relations. We can get this by writing

However, the customer.cname column contains customers of bankers other than Johnson.
(Why?)

We want rows where client.cname = customer.cname. So we can write

to get just these tuples.

www.arihantinfo.com
77
RDBMS
Finally, to get just the customer's name and city, we need a projection:

f) The Rename Operation

The rename operation solves the problems that occurs with naming when performing the
cartesian product of a relation with itself.

Suppose we want to find the names of all the customers who live on the same street and in
the same city as Smith.

We can get the street and city of Smith by writing

To find other customers with the same information, we need to reference the customer
relation again:

where is a selection predicate requiring street and ccity values to be equal.

Problem: how do we distinguish between the two street values appearing in the Cartesian
product, as both come from a customer relation?
Solution: use the rename operator, denoted by the Greek letter rho ( ).

We write

to get the relation under the name of .

If we use this to rename one of the two customer relations we are using, the ambiguities
will disappear.

5.3 Extended Operators of Relational Algebra

General expressions are formed out of smaller subexpressions using

o select (p a predicate)
o project (s a list of attributes)
o rename (x a relation name)
o union

www.arihantinfo.com
78
RDBMS
o set difference
o cartesian product

1. The Set Intersection Operation

Set intersection is denoted by , and returns a relation that contains tuples that are in
both of its argument relations.

It does not add any power as

To find all customers having both a loan and an account at the SFU branch, we write

2. The Natural Join Operation

Often we want to simplify queries on a cartesian product.

For example, to find all customers having a loan at the bank and the cities in which they
live, we need borrow and customer relations:

Our selection predicate obtains only those tuples pertaining to only one cname.

This type of operation is very common, so we have the natural join, denoted by a sign.
Natural join combines a cartesian product and a selection into one operation. It performs a
selection forcing equality on those attributes that appear in both relation schemes.
Duplicates are removed as in all relation operations.

To illustrate, we can rewrite the previous query as

The resulting relation is shown in Figure 3.7.

Figure: Joining borrow and customer relations.

We can now make a more formal definition of natural join.

o Consider and to be sets of attributes.


o We denote attributes appearing in both relations by .
o We denote attributes in either or both relations by .
o Consider two relations and .
o The natural join of and , denoted by is a relation on scheme .

www.arihantinfo.com
79
RDBMS
o It is a projection onto of a selection on where the predicate requires
for each attribute in .

Formally,

where .
To find the assets and names of all branches which have depositors living in Stamford, we
need customer, deposit and branch relations:

Note that is associative.


To find all customers who have both an account and a loan at the SFU branch:

This is equivalent to the set intersection version we wrote earlier. We see now that there
can be several ways to write a query in the relational algebra.
If two relations and have no attributes in common, then , and
.

3. The Division Operation

Division, denoted , is suited to queries that include the phrase ``for all''.

Suppose we want to find all the customers who have an account at all branches located in
Brooklyn.

Strategy: think of it as three steps.

We can obtain the names of all branches located in Brooklyn by

We can also find all cname, bname pairs for which the customer has an account by

Now we need to find all customers who appear in with every branch name in .

The divide operation provides exactly those customers:

which is simply .

Formally,

o Let and be relations.


o Let .
o The relation is a relation on scheme .
o A tuple is in if for every tuple in there is a tuple in satisfying both of the
following:

www.arihantinfo.com
80
RDBMS
o These conditions say that the portion of a tuple is in if and only if there
are tuples with the portion and the portion in for every value of the
portion in relation .

We will look at this explanation in class more closely.


The division operation can be defined in terms of the fundamental operations.

Read the text for a more detailed explanation.

4. The Assignment Operation

Sometimes it is useful to be able to write a relational algebra expression in parts using a


temporary relation variable (as we did with and in the division example).

The assignment operation, denoted , works like assignment in a programming language.

We could rewrite our division definition as

No extra relation is added to the database, but the relation variable created can be used in
subsequent expressions. Assignment to a permanent relation would constitute a
modification to the database

The set intersection operation is denoted by the symbol . It is not a fundamental


operation, however it is a more convenient way to write r - (r - s).
An example query of the operation to find all customers who have both a loan and and
account can be written as:
customer-name (borrower) customer-name (depositor)
The results of the query are the following:

Customer-name
Hayes
Jones
Smith

It has been shown that the set of relational algebra operations {σ, π, U, –, x} is a complete set;
that is, any of the other relational algebra operations can be expressed as a sequence of
operations from this set. For example, the INTERSECTION operation can be expressed by
using UNION and DIFFERENCE as follows:
∩ S ≡ ( R ∪ S ) – ((R – S ) ∪ ( S – R ))
lthough, strictly speaking, INTERSECTION is not required, it is inconvenient to specify this
complex expression every time we wish to specify an intersection. As another example, a
JOIN operation can be specified as a CARTESIAN PRODUCT followed by a SELECT operation,
as we discussed:
<condition>S ≡ σ <condition> (R x S )
Similarly, a NATURAL JOIN can be specified as a CARTESIAN PRODUCT proceeded by
RENAME and followed by SELECT and PROJECT operations. Hence, the various JOIN
operations are also not strictly necessary for the expressive power of the relational algebra;
however, they are very important because they are convenient to use and are very commonly
applied in database applications

www.arihantinfo.com
81
RDBMS

Assignment - the operation denoted by which is used to assign expressions to a


temporary relation variable.

Cartesian product - the operation denoted by a cross (X) allows for combination of
information from any two relations.

Division - the operation denoted by and used in queries wanting to find results
including the phrase "for all".

natural join - the operation that pertains to a query that involves a Cartestian
product includes a selection operation on the result of the Cartesian
product.

Project - the operation denoted by the Greek letter pi ( ), which is used to


return an argument with certain attributes left out.

Rename - the operation denoted by the Greek letter rho ( ), which allows the
results of a relational-algebra expression to be assigned a name, which
can later be used to refer to them.
Select - the operation denoted by the Greek letter sigma (), which enables a
selection of tuples that satisfy a given predicate.

Set difference - the operation denoted by - allows for finding tuples that are in one
relation but are not in another.

Set-intersection - the operation denoted by which results in the tuples that are in both
relations the operation is applying to.
Union - an operation on relations that yields the relation of all tuples shared by
two or more relations. Denoted by the symbol:

5.4 Constraints on Relations

Representation of Relations

We can regard a relation in two ways: as a set of values and as a set of maps from attributes from
values.
Let be the schema of a relation R, and let be the domain of
the relation. Then R is a subset of and each tuple of the relation contains a set of values, one
drawn from each of the domains , each of which contains a unique null element, denoted .

We can also regard each element of the relation as a map from R to , so that if is

an element of an instance r of R, we can write and get the expected result.


Integrity Constraints on Relations

We define a (candidate) key on a relation schema R to be a subset of the

attributes of R such that for any instance for all we have for

. A primary key is a candidate key in which none of the . We designate one candidate

www.arihantinfo.com
82
RDBMS

key to be the primary key of R, . We write to signify the projection of t onto the primary

key . Where there is no chance of confusion we will write for .


We require every relation to satisfy its primary key, and all its candidates keys. Let C be the set of

all candidate keys of R and let be the primary key of R: we require to satisfy

and

Operations on Relations:

Taking the view of a relation as a set we can apply the normal set operations to relations over the
same domain. If the domain differ this is not possible. We have, of course, the normal algebraic
structure to the operations: the null relation over a domain is well defined, and the null tuple is
the sole element of the null relation.
We also have three relational operators we wish to consider: select, join and project. First we

define for each relation R the domain of conditional expressions on relations, which map

an element of an instance to true or false.

Select:

Now we define by

is a relation over the same domain as R and is a subset of R. We notice that we can use

the same primary key for both R and and that must satisfy this key, since if there

exist then R does not satisfy .

Join:

The join operation is defined by

where we have

What is the schema for this? The key? Does it satisfy it?

Project:

We define by the post-condition

and by requiring that if is the domain of R then the domain of

is .
If we view R as a set of map we can view the projection operator as restricting each of
these maps to a smaller domain.

www.arihantinfo.com
83
RDBMS

The schema of is A. If then forms a key for A. Otherwise, if there are no


nulls in any tuple of the projection of R we can use A as a primary key. Otherwise we can
not in general identify a primary key.

Insertion:

For each relation R we define an insertion operation:

The insertion operation satisfies the invariant, since it will refuse to break it.

Update:
For each relation R we define an update operation.

Update also preserves the invariant.

Deletion:

We define the deletion of a tuple by:

5.5 Modification of the Database

The operations of the relational model can be categorised into retrievals and updates. But we will
discuss update operation here.
There are three basic update operations on relations
(1) Insert, (2) delete, and (3) modify.

A) The Insert Operation :


It is used to insert a new tuple (row) or tuples in a relation. It can violate any types of constraint
(Primary, referential). Domain constraints can also be violated if an attribute value is given that
does not appear in the corresponding attribute domain. Key constraint can be violated if entering
a key value in the new tuple ahead, exists in another tuple in the relation r(k). Referential
constraint can be violated if the value of any foreign key it refers to, does not exist in the
referenced relation.
Suppose we have a table student(std_id number(4), std_name varchar2(10), 1std_course
varchar2(5), std_fee number (7,2)). Then, to insert values in this table, we will use Insert
operation as:
Insert into student values( 1, ‘John’, ‘Msc’, 5000.50)
The relational algebra expresses an insertion by
r r U E
where r is relation and E is relational algebra expression. We express the insertion of a single
tuple by letting E be a constant relation containing one tuple. Suppose that we wish to insert
the fact that Smith has $1200 in account A-973 at Perryridge branch we write
account  account U {(A-973,”Perryridge”,1200)}
www.arihantinfo.com
84
RDBMS
depositor depositor U {(“Smith”, A-973)}

1. To insert a tuple for Smith who has $1200 in account 9372 at the SFU branc.

2. To provide all loan customers in the SFU branch with a $200 savings account.

B) The Delete Operation


The Delete operation is used to delete tuples. The Delete operation can violate only referential
integrity, if the tuple being deleted is referenced by the foreign keys from other tuple in the
database. To specify deletion, conditions on the attributes of the relation select the tuple (or
tuples) to be deleted.
Delete the student tuple with std_no = 2;
Delete from student where std_id = 2;
We can delete only whole tuples; we cannot delete values on only particular attributes. In
relational algebra a deletion is expressed by
r r – E
where r is a relation and E is a relational algebra query.

Some examples:

1. Delete all of Smith's account records.

2. Delete all loans with loan numbers between 1300 and 1500.

3. Delete all accounts at Branches located in Needham.

c) The Update Operation


Update (or Modify) is used to change the values of some attributes in existing tuples.
Whenever update operation is applied, the integrity constraints specified on the relational
datbase scheme should not be violated. It is necessary to specify a condition on the attributes
of the relation to select the tuple (or tuples) to be modified.
Update the std_fee of the student tuple with std_no. = 1 to 3000.50.

www.arihantinfo.com
85
RDBMS
In SQL
Update student set std_fee = 3000.50 where std_id = 1

Updating allows us to change some values in a tuple without necessarily changing all.

We use the update operator, , with the form

where is a relation with attribute , which is assigned the value of expression .

The expression is any arithmetic expression involving constants and attributes in


relation .

Some examples:

1. To increase all balances by 5 percent.

This statement is applied to every tuple in deposit.

2. To make two different rates of interest payment, depending on balance amount:

Domain Constraint: Data types help determine what values are valid for a particular column.
Referential constraint: It refers to the maintenance of relationships of data rows in multiple
tables.
Entity Constraint: It means that we can uniquely identify every row in a table.

5.6 Views

1. We have assumed up to now that the relations we are given are the actual relations stored
in the database.
2. For security and convenience reasons, we may wish to create a personalized collection of
relations for a user.
3. We use the term view to refer to any relation, not part of the conceptual model, that is
made visible to the user as a ``virtual relation''.
4. As relations may be modified by deletions, insertions and updates, it is generally not
possible to store views. (Why?) Views must then be recomputed for each query referring to
them.

View Definition

1. A view is defined using the create view command:

www.arihantinfo.com
86
RDBMS
where <query expression> is any legal query expression.

The view created is given the name .

2. To create a view all-customer of all branches and their customers:

3. Having defined a view, we can now use it to refer to the virtual relation it creates. View
names can appear anywhere a relation name can.
4. We can now find all customers of the SFU branch by writing

Updates Through Views and Null Values

1. Updates, insertions and deletions using views can cause problems. The modifications on a
view must be transformed to modifications of the actual relations in the conceptual model
of the database.
2. An example will illustrate: consider a clerk who needs to see all information in the borrow
relation except amount.

Let the view loan-info be given to the clerk:

3. Since SQL allows a view name to appear anywhere a relation name may appear, the clerk
can write:

This insertion is represented by an insertion into the actual relation borrow, from which
the view is constructed.

However, we have no value for amount. A suitable response would be

o Reject the insertion and inform the user.


o Insert (``SFU'',3,``Ruth'',null) into the relation.

The symbol null represents a null or place-holder value. It says the value is unknown or
does not exist.

4. Another problem with modification through views: consider the view

This view lists the cities in which the borrowers of each branch live.

Now consider the insertion

www.arihantinfo.com
87
RDBMS

Using nulls is the only possible way to do this (see Figure 3.22 in the textbook).

If we do this insertion with nulls, now consider the expression the view actually
corresponds to:

As comparisons involving nulls are always false, this query misses the inserted tuple.

To understand why, think about the tuples that got inserted into borrow and customer.
Then think about how the view is recomputed for the above query.

5.7 Relational Calculus


Relational calculus is an alternative to relational algebra. In contrast to the algebra, which is
procedural, the calculus is nonprocedural, or declarative, in that it allows us to describe the set of
answers without being explicit about how they should be computed. Relational calculus has had
a bid influence on the design of commercial query languages such as SQL and, especially, Query-
by-Example (QBE).
The Variant of the calculus that we present in detail is called the tuple relational calculus (TRC),
variables in TRC take on tuples as values. In another variant, called the domain relational
calculus (DRC), the variables range over field values. TRC has had more of an influence on SQL,
while DRC has strongly influenced QBE.

5.7.1 Tuple Relational Calculus


A tuple variable is a variable that takes on tuples of a particular relation schema as values. That
is, every value assigned to a given tuple variable has the same number and type of fields. A tuple
relational calculus query has the form (T p(T) ), where T is a tuple variable and p(T) denotes a
formula that describes T; we will shortly define formulas and queries rigorously. The result of this
query is the set of all tuples t for which the formula p(T) is thus at the heart of TRC and is
essentially a simple subset of first-order logic. As a simple example, consider the following query.
(Q1) Find all sailors with a rating above 7.
(S ! S ε Sailors ^ S. rating >7)

Sid sname Rating Age


22 Dustin 7 45.o
31 Lubber 8 55.5
52 Horatio 5 40.0

Instance S3 of Sailors
When this query is evaluated on instance of the Sailors relation, the tuple variable S is
instantiated successively with each tuple, and the test S.rating>7 is applied. The answer contains
those instances of S that pass this test. On instance S3 of Sailors, the answer contains Sailors
tuples with sid 31.

Syntax of TRC Queries

www.arihantinfo.com
88
RDBMS
We now define these concepts formally, beginning with the notion of a formula. Let Rel be a
relation name, R and S be tuple variable, a an attribute of R, and b an attribute of S. Let op
denote an operator in the set (<,>, =, ≤,≥ , ≠). An atomic formula, is one of the following:

 R ε Rel
 R.a op S.b
 R.a op constant, or constant op R.a
A formula is recursively defined to be one of the following, where p and q are themselves formulas,
and p(R) denotes a formula in which the variable R appears:

• Any atomic formula

• ¬p, p ∧ q, p v q, or p ⇒ q

• ∃R (p(R)), where R is a tuple variable

• ∀ R (pR), where R is a tuple variable

In the last two clauses above, the quantifiers ∃ and ∀ are said to blind the variable R. A variable is
said to be free in a formula or sub formula (a formula contained in a larger formula) if the (sub)
formula does not contain an occurrence of a quantifier that binds it.
We observe that every variable in a TRC formula appears in a sub formula that is atomic, and
every relation schema specifies a domain for each field; this observation ensures that each
variable in a TRC formula has a well-defined domain from which values for the variable are
drawn. That is, each variable has a well-defined type, in the programming language sense.
Informally, an atomic formula R є Rel gives R the type of tuples in Rel, and comparisons such as
R.a op S.b and R.a op constant induce type restrictions on the field R.a. If a variable R does not
appear in an atomic formula of the form R є Rel (i.e., it appears only in atomic formulas that are
comparisons), we will follow the convention that the type of R is a type whose fields include all
(and only) fields of R that appear in the formula.
We will not define types of variables formally, but the type of a variable should be clear in most
cases, and the important point to note is that comparisons of values having different types should
always fail. (In discussions of relational calculus, the simplifying assumption is often made that
there is a single domain or constants and that this is the domain associated with each field of
each relation.)
A TRC query is defined to be expression of the form (T! p (T)), where T is the only free variable in
the formula p.

Semantics of TRC Queries


What does a TRC query mean? More precisely, what is the set of answer tuples for a given TRC
query the answer to a TRC (T!p(T)), as we noted earlier, is the set of all tuples t for which the
formula p(T) evaluates to true with assignments of tuple values to the free variables in formula
make the formula evaluate to true.
A query is evaluated on a given instance of the database. Let each free variable in a formula F be
bound to a tuple value. For the given assignment of tuples to variables, with respect to the given
database instance, F evaluates to (or simply ‘is’) true if one of the following holds:

• F is an atomic formula R є Rel, and R is assigned a tuple in the instance of relation Rel.

• F is a comparison R.a op S.b, R.a op constant, or constant op R.a, and the tuples assigned to
R and S have field values R.a and S.b that make the comparison true.

www.arihantinfo.com
89
RDBMS
• F is of the form ¬p, and p is not true; or of the form p ∧q, and both p and q are true; or of the
form p v q, and one of them is true, or of the form p ⇒ q and q is true whenever p is true.

• F is of the form ∃R(p(R)), and there is some assignment of tuples to the free variables in p(R),
including the variable R, that makes the formula p(R) true.

• F is of the form ∀ R (p(R)), and there is some assignment of tuples to the free variables in p(R)
that makes the formula p (R) true no matter what tuple is assigned to R.
Examples of TRC Queries
We now illustrate the calculus through several examples, using the instances B1 of Boats, R2 of
Reserves, and S3 of Sailors as shown below:

Sid sname Rating Age


22 Dustin 7 45.o
31 Lubber 8 55.5
52 Horatio 5 40.0

Instance S3 of Sailors

Bid Bname Btype


101 Titanic Carrier
102 Ganga Passenger
103 Love Racing
104 Manoj Racing
Instance B1 of Boats

Sid Bid Day

22 101 10/10/98
52 101 9/5/98
22 102 10/10/98
52 102 9/8/98
31 102 11/10/98
22 103 10/8/98
52 103 9/8/98
31 103 11/6/98
22 104 10/7/98
31 104 11/12/98

Instance R2 of Reserves

We will use parentheses as needed to make our formulas unambiguous. Often, a formula p (R)
includes a condition R ∈ Rel, and the meaning of the phrases some tuple R and for all tuples R is
intuitive. We will use the notation ∃ R ∈ Rel (p(R) for ∃R(R ∈Rel ∧ p(R)).

www.arihantinfo.com
90
RDBMS
Similarly, we use the notation ∀R ∈ Rel (p(R)) for ∀ R (R∈Rel ⇒p(R)).
(Q2) find the names and ages of sailors with a rating above 7.

(P ! ∃ S ∈ Sailors (S.rating > 7 ∧ fP. Name = S. sname ∧ P.age = S. age))


This query illustrates a useful convention: P is considered to be a tuple variable with exactly two
fields, which are called name and age, because these are the only fields of FP that are mentioned
and P does not range over any of the relations in the query; that is, there is no subformula of the
form P ∈ Relname. The result of this query is a relation with two fields, name and age. The atomic
formula P. name = S.sname and P.age = S. age give values to the fields of answer tuple P. On
instances B1, R2, and S3, the answer is the set of tuples (Lubber, 55.5), (Andy, 25.5), (Rusty,
35.0), (Zorba, 16.0), and (Horatio,35.0).
(Q3) Find the sailor name, boat id, and reservation date for each reservation.

{P  ∃ R ∈ Reserves ∃ S ∈ Sailors

(R.sid = S.sid ∧ P.bid = R.bid ∧ P.day = R.day ∧ P.sname = S.sname)


For each Reserves tuple, we look for a tuple in Sailors with the same sid. Given a pair of such
tuples, we construct an answer tuple P with fields sname, bid, and day by copying the
corresponding fields from these two tuples. This query illustrates how we can combine values
from different relations in each answer tuple. The answer to this query on instances B1, R2, and
S3 is shown in figure given below:

Sname Bid Day

Dustin 101 10/11/98


Dustin 102 10/10/98
Dustin 103 10/08/98
Dustin 104 10/07/98
Lubber 102 11/10/98
Lubber 103 11/06/98
Lubber 104 11/12/98
Horatio 101 09/05/98
Horatio 102 09/08/98
Horatio 103 09/08/98

(Q4) find the names of sailors who have reserved boat 103.

{p ∃ S ∈ Sailors ∃ R ∈ Reserves (R.sid = S.sid ∧R.bid= 103 ∧P.sname = S.sname)}


This query can be read as follows: “Retrieve all sailor tuples for which there exists a tuple in
Reserves, having the same value in the sid field, and with bid = 103”. That is, for each sailor
tuple, we look for a tuple in Reserves that shows that this sailor has reserved boat 103. The
answer tuple P contains just one field, sname.
(Q5) Find the names of sailors who have reserved a red boat.

{(P  ∃S ∈ Sailors ∃ R ∈Reserves (R.sid = S. sid ∆∧ FP. Sname = S. sname

∧ ∃ B ∈Boads (B.bid = R.bid ∧ B.color = ‘red’))}

www.arihantinfo.com
91
RDBMS
This query can be read as follows: “Retrieve all sailor tuples S for which there exist tuples R in
Reserves and B in Boats such that S.sid = R.sid, R.bid = B.bid, and B.color =’red’.” Another way to
write this query, which corresponds more closely to this reading, is as follows:

{(P | ∃ S x2 Sailors ∃ R ∈Reserves ∃ B ∈ Boats

(R.sid = S.sid ∧ B.bid = R.bid ∧ B.color = ‘red’ ∧ P.sname = S.sname)}


(Q6) Find the names of sailors who have reserved at least two boats.

{P | ∃ S ∈Sailors ∃ R1 ∈ Reserves ∃ R2 ∈ Reserves

(S.sid = R1.sid ∧ R1.sid = R2.sid ∧ R1.bid ≠ R2.bid ∧ P.sname = S.sname) }


Contrast this query with the algebra version and see how much simpler the calculus version is. In
past, this difference is due to the cumbersome renaming of fields in the algebra version, but the
calculus version really is simple.
(Q7) Find the names of sailors who have reserved all boats.

{P | ∃ S ∈ Sailors ∀B ∈ Boats

(∃R ∈ Reserves (S.sid = R.sid ∧R.bid = B.bid ∧ P.sname = S.sname)) }


This query was expressed using the division operator in relational algebra. Notice how easily it is
expressed in the calculus. The calculus query directly reflects how we might express the query in
English: “Find sailors S such that for all boats B there is a Reserves tuple showing that sailor S
has reserved boat B.”
(Q 8 ) Find sailors who have reserved all red boats.

{ S | S ∈ Sailors ∧ ∀ B ∈ Boats

(B.color =’red’ ⇒ (∃R ∈ Reserves (S.sid = R.sid ∧ R.bid = B.bid))}


This query can be read as follows: For each candidate (sailor), if a boat is red, the sailor must
have reserved it. That is, for a candidate sailor, a boat being red must imply the sailor having
reserved it. Observe that since we can return an entire sailor tuple as the answer instead of just
the sailor’s name, we have avoided introducing a new free variable (e.g., the variable P in the
previous example) to hold the answer values. On instances B1, R2, and S3, the answer contains
the Sailors tuples with sid 22 and 31.
We can write this query without using implication, by observing that an expression of the form p
⇒q is logically equivalent to y3 p v q:

{ S | S ∈ Sailors ∧∀ B ∈Boats

(B.color ≠ ‘red’ v (∃ R ∈ Reserves (S.sid = R.sid ∧R.bid = B.bid)))}


This query should be read as follows: “Find sailors S such that for all boats B, either the boat is
not red or a Reserves tuple shows that sailor S has reserved boat B.”

5.7.2 Domain Relational Calculus


A domain variable is a variable that ranges over the values in the domain of some attribute (e.g.,
the variable can be assigned an integer if it appears in an attribute whose domain is the set of
integers). A DRC query has the form { (∃, x2,……,xn) |p ((∃, x2,…..,xn)) }, where each xi is either a
domain variable or a constant and p ((∃, x2,…..,xn)) denotes a DRC formula whose only free
variables are the variables among the xi, 1 ≤ i ≤ n. The result of this query is the set of all tuples
(∃,x2,…,xn) for which the formula evaluates to true.

www.arihantinfo.com
92
RDBMS
A DRC formula is defined in a manner that is very similar to the definition of a TRC formula. The
main difference is that the variables are now domain variables. Let op denote an operator in the
set { <,>, = ≤, ≠} and let X and Y be domain variables. An atomic formula in DRC is one of the
following:

• (∃, x2,…,xn) ∈Rel, where Rel is a relation with n attributes; each xi, 1 ≤ i n is either a variable
or a constant.

• X op Y

• X op constant, or constant op X
A formula is recursively defined to be one of the following, where p and q are themselves formulas,
and p(X) denotes a formula in which the variable X appears:

• Any atomic formula

• p, p ∧ q, p V q, or p⇒ q

• ∃ X (p(X)), where X is a domain variable

• ∀ X (p(X), where X is a domain variable


the reader is invited to compare this definition with the definition of TRC formulas and see how
closely these two definitions correspond. We will not define the semantics of DRC formulas
formally; this is left as an exercise for the reader.
Examples of DRC Queries
We now illustrate DRC through several examples. The reader is invited to compare these with the
TRC versions.
(Q2) Find all sailors with a rating above 7.

{ (I,N,T,A) | (I, N, T, A) ∈ Sailors ∧ T > 7 }


This differs from the TRC version in giving each attribute a (variable) name. The condition (I, N, T,
A) ∈ Sailors ensures that the domain variables I, N, T, and A are restricted to be fields of the same
tuple. In comparison with the TRC query, we can say T>7 instead of S.rating > 7, but we must
specify the tuple (I, N,T, A) in the result, rather than just S.
(Q4) Find the names of sailors who have reserved boat 103.

{ (N) | ∃ I, T, A ((I,N,T,A) ∈ Sailors

∧∃ Ir, Br, D((Ir, Br, D ) ∈ Reserves ∧ Ir = I ∧ Br = 103)) }


Notice that only the sname field is retained in the answer and that only N is a free variable. We
use the notation ∃ Ir, Br, D(..) as a shorthand for x1 Ir (∃Br (∃D(..)))).
Very often, all the quantified variables appear in a single relation, as in this example. an even
more compact notation in this case is ∃ (Ir, Br, D) ∈ Reserves. With this notation, which we will
use henceforth, the above query would be as follows:

{ (N) || ∃ I, T, A((I,N,T, A ) ∈ Sailors

∧ ∃(Ir, Br, D) ∈ Reserves (Ir = I ∧ Br = 103 )) }


The comparison with the corresponding TRC formula should now be straightforward. This query
can also be written as follows; notice the repetition of variable I and the use of the constant 103:

{ (N) | ∃ I, T, A((I,N,T,A) ∈Sailors

∧ ∃D ((I, 103, D ) ∈Reserves))}

www.arihantinfo.com
93
RDBMS
(Q5) Find the names of sailors who have reserved a red boat.

{ (N) | ∃ I,T, A((I, N, T, A) ∈ Sailors

∧ ∃ (I, Br, D) ∈ Reserves ∧ ∃ (Br, BN, ‘red) ∈Boats) }


(Q6) Find the names of sailors who have reserved at least two boats.

{(N) | ∃I, T, A((I, N, T, A) ∈ Sailors ∧ ∃ Brl, Br2 , D1, D2 ((I, Brl, D1) ∈Reserves ∧ (I,Br2, D2) ∈
Reserves ∧ Brl ≠ Br2))}
Notice how the repeated use of variable I insures that the same sailor has reserved both the boats
in question.
(Q7) Find the names of sailors who have reserved all boats.
{ (N) | ∃I, T, A((I,N, T, A) ∈ Sailors ∧
∀B, BN, C(¬((B, BN, C) ∈ Boats) V
(∃(Ir, Br, D) ∈Reserves (I= Ir ∧ Br = B))))}
This query can be read as follows: “Find all values of N such that there is some tuple (I,N,T,A) in
Sailors satisfying the following condition: for every (B,BN,C), either this is not a tuple in Boats or
there is some tuple (Ir, Br, D ) in Reserves that proves that Sailor I has reserved boat B.” The ∀
quantifier allows the domain variables B, BN, and C to range over all values in their respective
attribute domains, and the pattern ‘ ¬((B, BN, C) ∈ Boats) V’ is necessary to restrict attention to
those values that appear in tuples of boats. This pattern is common in DRC formulas, and the
notation ∀ (B, BN, C) ∈ Boats can be used shorthand instead. This is similar to the notation
introduced earlier for ∃. With this notation the query would be written as follows:

{ (N) | ∃ I, T, A((I,N,T,A) ∈ Sailors ∧ ∀ (B,BN,C ) ∈ Boats

(∃( Ir, Br, D) ∈ Reserves ( I = Ir ∧ Br = B)))}


(Q8) Find sailors who have reserved all red boats.

{ (I, N, T, A) | (I,N,T,A) ∈Sailors ∧∀( B, BN, C) ∈Boats

(C =’red’ ⇒ ∃( Ir, Br, D) ∈ Reserves (I = Ir ∧ Br = B))}


Here, we find all sailors such that for every red boat there is a tuple in Reserves that shows the
sailor has reserved it.
SQL language is a " Query language", it contains many other capabilities besides querying a data
base. It includes features for defining the structure of the data, for modifying the data in data
base, and for specifying security constrains. SQL has clearly established itself as the standard
relational-data base language.

www.arihantinfo.com
94
RDBMS
UNIT 6

SQL

6.1 Use Of SQL


6.2 DDL Statements
6.3 DML Statements
6.4 View Definitions
6.5 Constraints and Triggers
6.6 Keys and Foreign Keys
6.7 Constraints on Attributes and Tuples
6.8 Modification of Constraints
6.9 Cursors
6.10Dynamic SQL

6.1 Introduction Of SQL

At the heart of every DBMS is a language that is similar to a programming language, but different
in that it is designed specifically for communicating with a database. One powerful language is
SQL. IBM developed SQL in the late 1970s and early 1980s as a way to standardize query
language across the many mainframe and microcomputer platforms that company produced.
SQL differs significantly from programming languages. Most programming languages are still
procedural. Procedural language consists of commands that tell the computer what to do --
instruction by instruction, step by step. SQL is not a programming language itself, it is a data
access language. SQL may be embedded in traditional procedural programming languages (like
COBOL). SQL statement is not really command to the computer. Rather, it is a description of
some of the data contained in a database. SQL is nonprocedural because it does not give step-by-
step commands to the computer or database. SQL describes data, and instructs the database to
do something with the data.
For example:

SELECT [Name], [Company_Name]


FROM Contacts
WHERE ((City = "Kansas City", and ([Name] ="R.."))

6.2 DDL Statements

Data Definition Language is a set of SQL commands used to create, modify and delete database
structures (not data). These commands wouldn't normally be used by a general user, who should
be accessing the database via an application. They are normally used by the DBA (to a limited
extent), a database designer or application developer. These statements are immediate, they are
not susceptible to ROLLBACK commands. You should also note that if you have executed several
DML updates then issuing any DDL command will COMMIT all the updates as every DDL
command implicitly issues a COMMIT command to the database. Anybody using DDL must have
the CREATE object privilege and a Table space area in which to create objects.
In an Oracle database objects can be created at any time, whether users are on-line or not. Table
space need not be specified as Oracle will pick up the user defaults (defined by the DBA) or the
system defaults. Tables will expand automatically to fill disk partitions (provided this has been set
up in advance by the DBA). Table structures may be modified on-line although this can have dire
effects on an application so be careful.

These examples use example data from here, you may want to print this data for convenience.

www.arihantinfo.com
95
RDBMS
Creating our two example tables

CREATE TABLE BOOK (


ISBN NUMBER(10),
TITLE VARCHAR2(200),
AUTHOR VARCHAR2(50),
COST NUMBER(8,2),
LENT_DATE DATE,
RETURNED_DATE DATE,
TIMES_LENT NUMBER(6),
SECTION_ID NUMBER(3)
)
CREATE TABLE SECTION (
SECTION_ID NUMBER(3),
SECTION_NAME CHAR(30),
BOOK_COUNT NUMBER(6)
)

The two commands above create our two sample tables and demonstrate the basic table
creation command. The CREATE keyword is followed by the type of object that we want
created (TABLE, VIEW, INDEX etc.), and that is followed by the name we want the object to
be known by. Between the outer brackets lie the parameters for the creation, in this case
the names, data-types and sizes of each field.

A NUMBER is a numeric field, the size is not the maximum externally displayed number
but the size of the internal binary field set aside for the field (10 can hold a very large
number). A number size split with a comma denotes the field size followed by the number
of digits following the decimal point (in this case a currency field has two significant digits)

A VARCHAR2 is a variable length string field from 0-n where n is the specified size. Oracle
only takes up the space required to hold any value in the field, it doesn't allocate the entire
storage space unless required to by a maximum sized field value (Max size 2000).

A CHAR is a fixed length string field (Max size 255).

A DATE is an internal date/time field (normally 7 bytes long).

A LONG or LONG RAW field (not shown) is used to hold large binary objects (Word
documents, AVI files etc.). No size is specified for these field types. (Max size 2Gb).

Creating our two example tables with constraints

Constraints are used to enforce table rules and prevent data dependent deletion (enforce
database integrity). You may also use them to enforce business rules (with some
imagination).

Our two example tables do have some rules which need enforcing, specifically both tables
need to have a prime key (so that the database doesn't allow replication of data). And the
Section ID needs to be linked to each book to identify which library section it belongs to
(the foreign key). We also want to specify which columns must be filled in and possibly
some default values for other columns. Constraints can be at the column or table level.

Constraint Description

www.arihantinfo.com
96
RDBMS

NOT NULL specifies that a column must have some value. NULL
NULL / NOT NULL
(default) allows NULL values in the column.

DEFAULT Specifies some default value if no value entered by user.


UNIQUE Specifies that column(s) must have unique values
Specifies that column(s) are the table prime key and must have
PRIMARY KEY
unique values. Index is automatically generated for column.
Specifies that column(s) are a table foreign key and will use
referential uniqueness of parent table. Index is automatically
FOREIGN KEY
generated for column. Foreign keys allow deletion cascades and
table / business rule validation.
CHECK Applies a condition to an input column value.
You may suffix DISABLE to any other constraint to make Oracle
ignore the constraint, the constraint will still be available to
DISABLE
applications / tools and you can enable the constraint later if
required.

CREATE TABLE SECTION (


SECTION_ID NUMBER(3) CONSTRAINT S_ID CHECK (SECTION_ID > 0),
SECTION_NAME CHAR(30) CONSTRAINT S_NAME NOT NULL,
BOOK_COUNT NUMBER(6),
CONSTRAINT SECT_PRIME PRIMARY KEY (SECTION_ID))

CREATE TABLE BOOK (


ISBN NUMBER(10) CONSTRAINT B_ISBN CHECK (ISBN BETWEEN 1 AND 2000),
TITLE VARCHAR2(200) CONSTRAINT B_TITLE NOT NULL,
AUTHOR VARCHAR2(50) CONSTRAINT B_AUTH NOT NULL,
COST NUMBER(8,2) DEFAULT 0.00 DISABLE,
LENT_DATE DATE,
RETURNED_DATE DATE,
TIMES_LENT NUMBER(6),
SECTION_ID NUMBER(3),
CONSTRAINT BOOK_PRIME PRIMARY KEY (ISBN),
CONSTRAINT BOOK_SECT FOREIGN KEY (SECTION_ID) REFERENCES SECTION(SECTION_ID))

We have now created our tables with constraints. Column level constraints go directly after the
column definition to which they refer, table level constraints go after the last column definition.
Table level constraints are normally used (and must be used) for compound (multi column) foreign
and prime key definitions, the example table level constraints could have been placed as column
definitions if that was your preference (there would have been no difference to their function). The
CONSTRAINT keyword is followed by a unique constraint name and then the constraint definition.
The constraint name is used to manipulate the constraint once the table has been created, you
may omit the CONSTRAINT keyword and constraint name if you wish but you will then have no
easy way of enabling / disabling the constraint without deleting the table and rebuilding it, Oracle
does give default names to constraints not explicitly name - you can check these by selecting from
the USER_CONSTRAINTS data dictionary view. Note that the CHECK constraint implements any
clause that would be valid in a SELECT WHERE clause (enclosed in brackets), any value inbound
to this column would be validated before the table is updated and accepted / rejected via the
CHECK clause. Note that the order that the tables are created in has changed, this is because we
now reference the SECTION table from the BOOK table. The SECTION table must exist before we
create the BOOK table else we will receive an error when we try to create the BOOK table. The
foreign key constraint cross references the field SECTION_ID in the BOOK table to the field (and
primary key) SECTION_ID in the SECTION table (REFERENCES keyword).

www.arihantinfo.com
97
RDBMS
If we wish we can introduce cascading validation and some constraint violation logging to our
tables.

CREATE TABLE AUDIT (


ROWID ROWID,
OWNER VARCHAR2,
TABLE_NAME VARCHAR2,
CONSTRAINT VARCHAR2)

CREATE TABLE SECTION (


SECTION_ID NUMBER(3) CONSTRAINT S_ID CHECK (SECTION_ID > 0),
SECTION_NAME CHAR(30) CONSTRAINT S_NAME NOT NULL,
BOOK_COUNT NUMBER(6),
CONSTRAINT SECT_PRIME PRIMARY KEY (SECTION_ID), EXCEPTIONS INTO AUDIT)

CREATE TABLE BOOK (


ISBN NUMBER(10) CONSTRAINT B_ISBN CHECK (ISBN BETWEEN 1 AND 2000),
TITLE VARCHAR2(200) CONSTRAINT B_TITLE NOT NULL,
AUTHOR VARCHAR2(50) CONSTRAINT B_AUTH NOT NULL,
COST NUMBER(8,2) DEFAULT 0.00 DISABLE,
LENT_DATE DATE,
RETURNED_DATE DATE,
TIMES_LENT NUMBER(6),
SECTION_ID NUMBER(3),
CONSTRAINT BOOK_PRIME PRIMARY KEY (ISBN),
CONSTRAINT BOOK_SECT FOREIGN KEY (SECTION_ID) REFERENCES SECTION(SECTION_ID)
ON DELETE CASCADE)

Oracle (and any other decent RDBMS) would not allow us to delete a section which had books
assigned to it as this breaks integrity rules. If we wanted to get rid of all the book records assigned
to a particular section when that section was deleted we could implement a DELETE CASCADE.
The delete cascade operates across a foreign key link and removes all child records associated
with a parent record (we would probably want to reassign the books rather than delete them in
the real world).

To log constraint violations I have created a new table (AUDIT) and stated that all exceptions on
the SECTION table should be logged in this table, you can then view the contents of this table
with standard SELECT statements. The AUDIT table must have the shown structure but can be
called anything.

It is possible to record a description or comment against a newly created or existing table or


individual column by using the COMMENT command. The comment command writes your table /
column description into the data dictionary. You can query column comments by selecting against
dictionary views ALL_COL_COMMENTS and USER_COL_COMMENTS.

You can query table comments by selecting against dictionary views ALL_TAB_COMMENTS and
USER_TAB_COMMENTS. Comments can be up to 255 characters long.

Altering tables and constraints

Modification of database object structure is executed with the ALTER statement.

You can modify a constraint as follows :-

Add new constraint to column or table.

www.arihantinfo.com
98
RDBMS
Remove constraint.

Enable / disable constraint.

You cannot change a constraint definition.

You can modify a table as follows :-

Add new columns.

Modify existing columns.

You cannot delete an existing column.

An example of adding a column to a table is given below :-

ALTER TABLE JD11.BOOK ADD (REVIEW VARCHAR2(200))

This statement adds a new column (REVIEW) to our book table, to enable library members to
browse the database and read short reviews of the books.

If we want to add a constraint to our new column we can use the following ALTER statement :-

ALTER TABLE JD11.BOOK MODIFY(REVIEW NOT NULL)

Note that we can't specify a constraint name with the above statement. If we wanted to further
modify a constraint (other than enable / disable) we would have to drop the constraint and then
re apply it specifying any changes.

Assuming that we decide that 200 bytes is insufficient for our review field we might then want to
increase its size. The statement below demonstrates this :-

ALTER TABLE JD11.BOOK MODIFY (REVIEW VARCHAR2(400))

We could not decrease the size of the column if the REVIEW column contained any data.

ALTER TABLE JD11.BOOK DISABLE CONSTRAINT B_AUTH


ALTER TABLE JD11.BOOK ENABLE CONSTRAINT B_AUTH

The above statements demonstrate disabling and enabling a constraint, note that if, between
disabling a constraint and re enabling it, data was entered to the table that included NULL values
in the AUTHOR column, then you wouldn't be able to re enable the constraint. This is because the
existing data would break the constraint integrity. You could update the column to replace NULL
values with some default and then re enable the constraint.

Dropping (deleting) tables and constraints

To drop a constraint from a table we use the ALTER statement with a DROP clause. Some
examples follow :-

ALTER TABLE JD11.BOOK DROP CONSTRAINT B_AUTH

The above statement will remove the not null constraint (defined at table creation) from the
AUTHOR column. The value following the CONSTRAINT keyword is the name of constraint.
www.arihantinfo.com
99
RDBMS
ALTER TABLE JD11.BOOK DROP PRIMARY KEY

The above statement drops the primary key constraint on the BOOK table.

ALTER TABLE JD11.SECTION DROP PRIMARY KEY CASCADE

The above statement drops the primary key on the SECTION table. The CASCADE option drops
the foreign key constraint on the BOOK table at the same time.

Use the DROP command to delete database structures like tables. Dropping a table removes the
structure, data, privileges, views and synonyms associated with the table (you cannot rollback the
DROP so be careful). You can specify a CASCADE option to ensure that constraints refering to the
dropped table within other tables (foreign keys) are also removed by the DROP.

DROP TABLE SECTION

The above statement drops the table SECTION but leaves the foreign key reference within the
BOOK table.

DROP TABLE SECTION CASCADE CONSTRAINTS

6.3 DML Statements

Data manipulation language is the area of SQL that allows you to change data within the
database. It consists of only three command statement groups, they are INSERT, UPDATE and
DELETE.

Inserting new rows into a table

We insert new rows into a table with the INSERT INTO command. A simple example is given
below.

INSERT INTO JD11.SECTION VALUES (SECIDNUM.NEXTVAL, 'Computing', 0)

The INSERT INTO command is followed by the name of the table (and owning schema if
required), this in turn is followed by the VALUES keyword which denotes the start of the
value list. The value list comprises all the values to insert into the specified columns. We
have not specified the columns we want to insert into in this example so we must provide a
value for each and every column in the correct order. The correct order of values can be
determined by doing a SELECT * or DESCRIBE against the required table, the order that
the columns are displayed is the order of the values that you specify in the value list. If we
want to specify columns individually (when not filling all values in a new row) we can do
this with a column list specified before the VALUES keyword. Our example is reworked
below, note that we can specify the columns in any order - our values are now in the order
that we specified for the column list.

INSERT INTO JD11.SECTION (SECTION_NAME, SECTION_ID) VALUES ('Computing',


SECIDNUM.NEXTVAL)

In the above example we haven't specified the BOOK_COUNT column so we don't provide a value
for it, this column will be set to NULL which is acceptable since we don't have any constraint on
the column that would prevent our new row from being inserted.

The SQL required to generate the data in the two test tables is given below.

www.arihantinfo.com
100
RDBMS
INSERT INTO JD11.SECTION
(SECTION_NAME, SECTION_ID)
VALUES
('Fiction', 10);

INSERT INTO JD11.SECTION


(SECTION_NAME, SECTION_ID)
VALUES
('Romance', 5);

INSERT INTO JD11.SECTION


(SECTION_NAME, SECTION_ID)
VALUES
('Science Fiction', 6);

INSERT INTO JD11.SECTION


(SECTION_NAME, SECTION_ID)
VALUES
('Science', 7);

INSERT INTO JD11.SECTION


(SECTION_NAME, SECTION_ID)
VALUES
('Reference', 9);

INSERT INTO JD11.SECTION


(SECTION_NAME, SECTION_ID)
VALUES
('Law', 11);

INSERT INTO JD11.BOOK


(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT, SECTION_ID)
VALUES
(21, 'HELP', 'B.Baker', 20.90, '20-AUG-97', NULL, 10, 9);

INSERT INTO JD11.BOOK


(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT, SECTION_ID)
VALUES
(87, 'Killer Bees', 'E.F.Hammond', 29.90, NULL, NULL, NULL, 9);

INSERT INTO JD11.BOOK


(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT, SECTION_ID)
VALUES
(90, 'Up the creek', 'K.Klydsy', 15.95, '15-JAN-97', '21-JAN-97', 1, 10);

INSERT INTO JD11.BOOK


(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT, SECTION_ID)
VALUES
(22, 'Seven seas', 'J.J.Jacobs', 16.00, '21-DEC-97', NULL, 19, 5);

INSERT INTO JD11.BOOK


(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT, SECTION_ID)
VALUES
(91, 'Dirty steam trains', 'J.SP.Smith', 8.25, '14-JAN-98', NULL, 98, 9);

INSERT INTO JD11.BOOK


www.arihantinfo.com
101
RDBMS
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT, SECTION_ID)
VALUES
(101, 'The story of trent', 'T.Wilbury', 17.89, '10-JAN-98', '16-JAN-98', 12, 6);

INSERT INTO JD11.BOOK


(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT, SECTION_ID)
VALUES
(8, 'Over the past again', 'K.Jenkins', 19.87, NULL, NULL, NULL, 10);

INSERT INTO JD11.BOOK


(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT, SECTION_ID)
VALUES
(79, 'Courses for horses', 'H.Harriot', 10.34, '17-JAN-98', NULL, 12, 9);

INSERT INTO JD11.BOOK


(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT, SECTION_ID)
VALUES
(989, 'Leaning on a tree', 'M.Kilner', 19.41, '12-NOV-97', '22-NOV-97', 56, 11);

Changing row values with UPDATE

The UPDATE command allows you to change the values of rows in a table, you can include a
WHERE clause in the same fashion as the SELECT statement to indicate which row(s) you want
values changed in. In much the same way as the INSERT statement you specify the columns you
want to update and the new values for those specified columns. The combination of WHERE
clause (row selection) and column specification (column selection) allows you to pinpoint exactly
the value(s) you want changed. Unlike the INSERT command the UPDATE command can change
multiple rows so you should take care that you are updating only the values you want changed
(see the transactions discussion for methods of limiting damage from accidental updates).

An example is given below, this example will update a single row in our BOOK table :-

UPDATE JD11.BOOK SET TITLE = 'Leaning on a wall', AUTHOR = 'J.Killner', TIMES_LENT =


0, LENT_DATE = NULL, RETURNED_DATE = NULL WHERE ISBN = 989

We specify the table to be updated after the UPDATE keyword. Following the SET keyword we
specify a comma delimited list of column names / new values, each column to be updated must
be specified here (note that you can set columns to NULL by using the NULL keyword instead of a
new value). The WHERE clause follows the last column / new value specification and is
constructed in the same way as for the SELECT statement, use the WHERE clause to pinpoint
which rows to be updated. If you don't specify a WHERE clause on an UPDATE command all rows
will be updated (this may or may not be the desired result).

Deleting rows with DELETE

The DELETE command allows you to remove rows from a table, you can include a WHERE clause
in the same fashion as the SELECT statement to indicate which row(s) you want deleted - in
nearly all cases you should specify a WHERE clause, running a DELETE without a WHERE clause
deletes ALL rows from the table. Unlike the INSERT command the DELETE command can change
multiple rows so you should take great care that you are deleting only the rows you want removed
(see the transactions discussion for methods of limiting damage from accidental deletions).

An example is given below, this example will delete a single row in our BOOK table :-

DELETE FROM JD11.BOOK WHERE ISBN = 989

www.arihantinfo.com
102
RDBMS
The DELETE FROM command is followed by the name of the table from which a row will be
deleted, followed by a WHERE clause specifying the column / condition values for the deletion.

DELETE FROM JD11.BOOK WHERE ISBN <> 989

This delete removes all records from the BOOK table except the one specified. Remember that if
you omit the WHERE clause all rows will be deleted.

6.4 View Definitions

DBMaker provides several convenient methods of customizing and speeding up access to your
data. Views and synonyms are supported to allow user-defined views and names for database
objects. Indexes provide a much faster method of retrieving data from a table when you use a
column with an index in a query.

Managing Views:

DBMaker provides the ability to define a virtual table, called a view, which is based on existing
tables and is stored in the database as a definition and a user-defined view name. The view
definition is stored persistently in the database, but the actual data that you will see in the view is
not physically stored anywhere. Rather, the data is stored in the base tables from which the view's
rows are derived. A view is defined by a query which references one or more tables (or other
views).
Views are a very helpful mechanism for using a database. For example, you can define complex
queries once and use them repeatedly without having to re-invent them over and over.
Furthermore, views can be used to enhance the security of your database by restricting access to
a predetermined set of rows and/or columns of a table.
Since views are derived from querying tables, you can not determine the rows of the tables to
update. Due to this limitation views can only be queried. Users can not update, insert into, or
delete from views.

Creating Views :

Each view is defined by a name together with a query that references tables or other views. You
can specify a list of column names for the view different from those in the original table when
creating a view. If you do not specify any new column names, the view will use the column names
from the underlying tables.

For example, if you want users to see only three columns of the table Employees, you can create a
view with the SQL command shown below. Users can then view only the FirstName, LastName
and Telephone columns of the table Employees through the view empView.

dmSQL> create view empView (FirstName, LastName, Telephone) as


select FirstName, LastName, Phone from Employees;

The query that defines a view cannot contain the ORDER BY clause or UNION operator.

Dropping Views :

You can drop a view when it is no longer required. When you drop a view, only the definition
stored in system catalog is removed. There is no effect on the base tables that the view was
derived from. To drop a view, execute the following command:
dmSQL> DROP VIEW empView;

Managing Synonyms
www.arihantinfo.com
103
RDBMS

A synonym is an alias, or alternate name, for any table or view. Since a synonym is simply an
alias, it requires no storage other than its definition in the system catalog.
Synonyms are useful for simplifying a fully qualified table or view name. DBMaker normally
identifies tables and views with fully qualified names that are composites of the owner and object
names. By using a synonym anyone can access a table or view through the corresponding
synonym without having to use the fully qualified name. Because a synonym has no owner name,
all synonyms in the database must be unique so DBMaker can identify them.

Creating Synonyms

You can create a synonym with the following SQL command:


dmSQL> create synonym Employees for Employees;
If the owner of the table Employees is the SYSADM, this command creates an alias named
Employees for the table SYSADM.Employees. All database users can directly reference the table
SYSADM.Employees through the synonym Employees.

Dropping Synonyms

You can drop a synonym that is no longer required. When you drop a synonym, only its definition
is removed from the system catalog.
The following SQL command drops the synonym Employees:
dmSQL> drop synonym Employees;

Managing Indexes

An index provides support for fast random access to a row. You can build indexes on a table to
speed up searching. For example, when you execute the query SELECT NAME FROM
EMPLOYEES WHERE NUMBER = 10005, it is possible to retrieve the data in a much shorter time
if there is an index created on the NUMBER column.
An index can be composed of more than one column, up to a maximum of 16 columns. Although
a table can have up to 252 columns, only the first 127 columns can be used in an index.
An index can be unique or non-unique. In a unique index, no more than one row can have the
same key value, with the exception that any number of rows may have NULL values. If you create
a unique index on a non-empty table, DBMaker will check whether all existing keys are distinct or
not. If there are duplicate keys, DBMaker will return an error message. After creating a unique
index on a table, you can insert a row in this table and DBMaker will certify that there is no
existing row that already has the same key as the new row.
When creating an index, you can specify the sort order of each index column as ascending or
descending. For example, suppose there are five keys in a table with the values 1, 3, 9, 2, and 6.
In ascending order the sequence of keys in the index is 1, 2, 3, 6, and 9, and in descending order
the sequence of keys in the index is 9, 6, 3, 2, and 1.
When you implement a query, the index order will occasionally affect the order of the data output.
For example, if you have a table name friends with NAME and AGE columns, the output will
appear as below when you execute the query SELECT NAME, AGE FROM FRIEND_TABLE
WHERE AGE > 20 using a descending index on the AGE column.
name age
---------------- ----------------
Jeff 49
Kevin 40
Jerry 38
Hughes 30
Cathy 22

As for tables, when you create an index you can specify the fillfactor for it. The fill factor denotes
how dense the keys will be in the index pages. The legal fill factor values are in the range from 1%
to 100%, and the default is 100%. If you often update data after creating the index, you can set a
www.arihantinfo.com
104
RDBMS
loose fill factor in the index, for example 60%. If you never update the data in this table, you can
leave the fill factor at the default value of 100%.
Before creating indexes on a table, it is recommended that you load all your data first, especially if
you have a large amount of data for that table. If you create an index before loading the data into
a table, the indexes will be updated each time you load a new row. As you can see, it is far more
efficient to create an index after loading a large amount of data than to create an index before
loading the data.

Creating Indexes :

To create an index on a table, you must specify the index name and index columns. You can
specify the sort order of each column as ascending (ASC) or descending (DESC).

The default sort order is ascending.


For example, the following SQL command creates an index IDX1 on the column NUMBER of table
EMPLOYEES in descending order.

dmSQL> create index idx1 on Employees (Number desc);

Also, if you want to create a unique index you have to explicitly specify it. Otherwise DBMaker
implicitly creates non-unique indexes. The following example shows you how
to create a unique index idx1 on the column Number of the table Employees:
dmSQL> create unique index idx1 on Employees (Number);

The next example shows you how to create an index with a specified fill factor:
dmSQL> create index idx2 on Employees(Number, LastName DESC) fillfactor 60;

Dropping Indexes:

You can drop indexes using the DROP INDEX statement. In general, you might need to drop an
index if it becomes fragmented, which reduces its efficiency. Rebuilding the index will create a
denser, unfragmented index.
If the index is a primary key and is referred to by other tables, it cannot be dropped.
The following SQL command drops the index idx1 from the table Employees.
dmSQL> drop index idx1 from Employees;

6.5 Constraints and Triggers

Constraints are declaractions of conditions about the database that must remain true. These
include attributed-based, tuple-based, key, and referential integrity constraints. The system
checks for the violation of the constraints on actions that may cause a violation, and aborts the
action accordingly. Information on SQL constraints can be found in the textbook. The Oracle
implementation of constraints differs from the SQL standard, as documented in Oracle 9i SQL
versus Standard SQL.

Triggers are a special PL/SQL construct similar to procedures. However, a procedure is executed
explicitly from another block via a procedure call, while a trigger is executed implicitly whenever
the triggering event happens. The triggering event is either INSERT or DELETE, or UPDATE
command. The timing can be either BEFORE or AFTER. The trigger can be either row-level or
statement-level, where the former fires once for each row affected by the triggering statement and
the latter fires once for the whole statement

Constraints are declarations of conditions about the database that must remain true. These
include attributed-based, tuple-based, key, and referential integrity constraints. The system
checks for the violation of the constraints on actions that may cause a violation, and aborts the
action accordingly. Information on SQL constraints can be found in the textbook. The Oracle

www.arihantinfo.com
105
RDBMS
implementation of constraints differs from the SQL standard, as documented in Oracle 9i SQL
versus Standard SQL.

Triggers are a special PL/SQL construct similar to procedures. However, a procedure is executed
explicitly from another block via a procedure call, while a trigger is executed implicitly whenever
the triggering event happens. The triggering event is either a INSERT, DELETE, or UPDATE
command. The timing can be either BEFORE or AFTER. The trigger can be either row-level or
statement-level, where the former fires once for each row affected by the triggering statement and
the latter fires once for the whole statement.

Deferring Constraint Checking

Sometimes it is necessary to defer the checking of certain constraints, most commonly in the
"chicken-and-egg" problem. Suppose we want to say:

CREATE TABLE chicken (cID INT PRIMARY KEY, eID INT REFERENCES egg(eID));

CREATE TABLE egg(eID INT PRIMARY KEY, cID INT REFERENCES chicken(cID));

But if we simply type the above statements into Oracle, we'll get an error. The reason is that the
CREATE TABLE statement for chicken refers to table egg, which hasn't been created yet! Creating
egg won't help either, because egg refers to chicken.

To work around this problem, we need SQL schema modification commands. First, create chicken
and egg without foreign key declarations:

CREATE TABLE chicken(cID INT PRIMARY KEY, eID INT);

CREATE TABLE egg(eID INT PRIMARY KEY, cID INT);

Then, we add foreign key constraints:

ALTER TABLE chicken ADD CONSTRAINT chickenREFegg


FOREIGN KEY (eID) REFERENCES egg(eID)
INITIALLY DEFERRED DEFERRABLE;

ALTER TABLE egg ADD CONSTRAINT eggREFchicken


FOREIGN KEY (cID) REFERENCES chicken(cID)
INITIALLY DEFERRED DEFERRABLE;
INITIALLY DEFERRED DEFERRABLE tells Oracle to do deferred constraint checking. For
example, to insert (1, 2) into chicken and (2, 1) into egg, we use:

INSERT INTO chicken VALUES(1, 2);

INSERT INTO egg VALUES(2, 1);

COMMIT;

Because we've declared the foreign key constraints as "deferred", they are only checked at the
commit point. (Without deferred constraint checking, we cannot insert anything into chicken and
egg, because the first INSERT would always be a constraint violation.)

Finally, to get rid of the tables, we have to drop the constraints first, because Oracle won't allow
us to drop a table that's referenced by another table.

www.arihantinfo.com
106
RDBMS
ALTER TABLE egg DROP CONSTRAINT eggREFchicken;

ALTER TABLE chicken DROP CONSTRAINT chickenREFegg;

DROP TABLE egg;

DROP TABLE chicken;

Basic Trigger Syntax:

Below is the syntax for creating a trigger in Oracle (which differs slightly from standard SQL
syntax):
CREATE [OR REPLACE] TRIGGER <trigger_name>

{BEFORE|AFTER} {INSERT|DELETE|UPDATE} ON <table_name>

[REFERENCING [NEW AS <new_row_name>] [OLD AS <old_row_name>]]

[FOR EACH ROW [WHEN (<trigger_condition>)]]

<trigger_body>
Some important points to note:

• You can create only BEFORE and AFTER triggers for tables. (INSTEAD OF triggers are only
available for views; typically they are used to implement view updates.)

• You may specify up to three triggering events using the keyword OR. Furthermore,
UPDATE can be optionally followed by the keyword OF and a list of attribute(s) in
<table_name>. If present, the OF clause defines the event to be only an update of the
attribute(s) listed after OF. Here are some examples:
• ... INSERT ON R ...

• ... INSERT OR DELETE OR UPDATE ON R ...

... UPDATE OF A, B OR INSERT ON R ...

• If FOR EACH ROW option is specified, the trigger is row-level; otherwise, the trigger is
statement-level.

• Only for row-level triggers:


o The special variables NEW and OLD are available to refer to new and old tuples
respectively. Note: In the trigger body, NEW and OLD must be preceded by a colon
(":"), but in the WHEN clause, they do not have a preceding colon! See example
below.
o The REFERENCING clause can be used to assign aliases to the variables NEW and
OLD.
o A trigger restriction can be specified in the WHEN clause, enclosed by parentheses.
The trigger restriction is a SQL condition that must be satisfied in order for Oracle
to fire the trigger. This condition cannot contain subqueries. Without the WHEN
clause, the trigger is fired for each row.

• <trigger_body> is a PL/SQL block, rather than sequence of SQL statements. Oracle has
placed certain restrictions on what you can do in <trigger_body>, in order to avoid
situations where one trigger performs an action that triggers a second trigger, which then
triggers a third, and so on, which could potentially create an infinite loop. The restrictions
on <trigger_body> include:
www.arihantinfo.com
107
RDBMS
o You cannot modify the same relation whose modification is the event triggering the
trigger.
o You cannot modify a relation connected to the triggering relation by another
constraint such as a foreign-key constraint.

Trigger Example

We illustrate Oracle's syntax for creating a trigger through an example based on the following two
tables:
CREATE TABLE T4 (a INTEGER, b CHAR(10));

CREATE TABLE T5 (c CHAR(10), d INTEGER);


We create a trigger that may insert a tuple into T5 when a tuple is inserted into T4. Specifically,
the trigger checks whether the new tuple has a first component 10 or less, and if so inserts the
reverse tuple into T5:

CREATE TRIGGER trig1


AFTER INSERT ON T4
REFERENCING NEW AS newRow
FOR EACH ROW
WHEN (newRow.a <= 10)
BEGIN
INSERT INTO T5 VALUES(:newRow.b, :newRow.a);
END trig1;
.
run;
Notice that we end the CREATE TRIGGER statement with a dot and run, as for all PL/SQL
statements in general. Running the CREATE TRIGGER statement only creates the trigger; it does
not execute the trigger. Only a triggering event, such as an insertion into T4 in this example,
causes the trigger to execute.

Displaying Trigger Definition Errors

As for PL/SQL procedures, if you get a message


Warning: Trigger created with compilation errors.
you can see the error messages by typing
show errors trigger <trigger_name>;
Alternatively, you can type, SHO ERR (short for SHOW ERRORS) to see the most recent
compilation error. Note that the reported line numbers where the errors occur are not accurate.

Viewing Defined Triggers

To view a list of all defined triggers, use:

select trigger_name from user_triggers;

For more details on a particular trigger:

select trigger_type, triggering_event, table_name, referencing_names, trigger_body


from user_triggers
where trigger_name = '<trigger_name>';

Dropping Triggers

To drop a trigger:
drop trigger <trigger_name>;
www.arihantinfo.com
108
RDBMS
Disabling Triggers

To disable or enable a trigger:

alter trigger <trigger_name> {disable|enable};

Aborting Triggers with Error

Triggers can often be used to enforce contraints. The WHEN clause or body of the trigger can
check for the violation of certain conditions and signal an error accordingly using the Oracle built-
in function RAISE_APPLICATION_ERROR. The action that activated the trigger (insert, update, or
delete) would be aborted. For example, the following trigger enforces the constraint Person.age >=
0:

create table Person (age int);

CREATE TRIGGER PersonCheckAge

AFTER INSERT OR UPDATE OF age ON Person

FOR EACH ROW

BEGIN
IF (:new.age < 0)

THEN

RAISE_APPLICATION_ERROR(-20000, 'no negative age allowed');

END IF;

END;
.
RUN;

If we attempted to execute the insertion:

insert into Person values (-3);

we would get the error message:

ERROR at line 1:

ORA-20000: no negative age allowed

ORA-06512: at "MYNAME.PERSONCHECKAGE", line 3

ORA-04088: error during execution of trigger 'MYNAME.PERSONCHECKAGE'

and nothing would be inserted. In general, the effects of both the trigger and the triggering
statement are rolled back.
www.arihantinfo.com
109
RDBMS
6.6 Keys and Foreign Keys

The word "key" is much used and abused in the context of relational database design. In pre-
relational databases (hierarchtical, networked) and file systems (ISAM, VSAM, et al) "key" often
referred to the specific structure and components of a linked list, chain of pointers, or other
physical locator outside of the data. It is thus natural, but unfortunate, that today people often
associate "key" with a RDBMS "index". We will explain what a key is and how it differs from an
index.

According to Codd, Date, and all other experts, a key has only one meaning in relational theory: it
is a set of one or more columns whose combined values are unique among all occurrences in a
given table. A key is the relational means of specifying uniqueness.

Why Keys Are Important?


Keys are crucial to a table structure for the following reasons:
They ensure that each record in a table is precisely identified. As you already know, a table
represents a singular collection of similar objects or events. (For example, a CLASSES table
represents a collection of classes, not just a single class.) The complete set of records within the
table constitutes the collection, and each record represents a unique instance of the table's
subject within that collection. You must have some means of accurately identifying each instance,
and a key is the device that allows you to do so.
They help establish and enforce various types of integrity. Keys are a major component of table-
level integrity and relationship-level integrity. For instance, they enable you to ensure that a table
has unique records and that the fields you use to establish a relationship between a pair of tables
always contain matching values.
They serve to establish table relationships. As you'll learn in Chapter 10, you'll use keys to
establish a relationship between a pair of tables.

There are only three types of relational keys (foreign keys are another issue and discussed
separately):

Candidate Key

As stated above, a candidate key is any set of one or more columns whose combined values are
unique among all occurrences (i.e., tuples or rows). Since a null value is not guaranteed to be
unique, no component of a candidate key is allowed to be null.
There can be any number of candidate keys in a table (as demonstrated elsewhere). Relational
pundits are not in agreement whether zero candidate keys is acceptable, since that would
contradict the (debatable) requirement that there must be a primary key.

Primary Key

The primary key of any table is any candidate key of that table which the database designer
arbitrarily designates as "primary". The primary key may be selected for convenience,
comprehension, performance, or any other reasons. It is entirely proper (albeit often inconvenient)
to change the selection of primary key to another candidate key.

Alternate Key

The alternate keys of any table are simply those candidate keys which are not currently selected
as the primary key. According to {Date95} (page 115), "... exactly one of those candidate keys [is]
chosen as the primary key [and] the remainder, if any, are then called alternate keys." An
alternate key is a function of all candidate keys minus the primary key.

6.7 Constraints on Attributes and Tuples

www.arihantinfo.com
110
RDBMS
Not Null Constraints
presC# INT REFERENCES MovieExec(cert#) NOT NULL

Attribute-Based CHECK Constraints


presC# INT REFERENCES MovieExec(cert#)
CHECK (presC# >= 100000)
gender CHAR(1) CHECK (gender IN (‘F’, ‘M’)),
presC# INT CHECK (presC# IN (SELECT cert# FROM MovieExec))

Tuple-Based CHECK Constraints


CREATE TABLE MovieStar(name CHr(30) PRIMARY KEY,address VARCHAR(255),
gender CHAR(1),birthdate DATE,CHECK(gender = ‘F’ OR name NOT ‘Ms.%’));

6.8 Modification of Constraints

Constraints can be considered as part of the corresponding ER models; constraint definitions are
stored in meta data tables and separated from stored procedures (in fact, the SQL Server stores
the Transact-SQL creation script in the syscomments table for each view, rule, default, trigger,
CHECK constraint, DEFAULT constraint, and stored procedure); for instance, the CHECK column
constraint on column f1 will be stored in syscomments.text field as a SQL statement: ([f1] > 1) ;
constraints implementation can be modified independently from stored procedures
implementation and, by providing a proper design, modification of constraints does not affect
implementation of stored procedures (or related Transact-SQL scripts).

Moreover, our ER model and corresponding constraints can be mapped to any other RDBMS that
supports a similar metadata format (which is, basically, true for most of the database

6.9 Cursors

cursor is a bit image on the screen that indicates either the movement of a pointing device or the
place where text will next appear. Xlib enables clients to associate a cursor with each window they
create. After making the association between cursor and window, the cursor is visible whenever it
is in the window. If the cursor indicates movement of a pointing device, the movement of the
cursor in the window automatically reflects the movement of the device.

Xlib and VMS DECwindows provide fonts of predefined cursors. Clients that want to create their
own cursors can either define a font of shapes and masks or create cursors using pixmaps.

This section describes the following:

• Creating cursors using the Xlib cursor font, a font of shapes and masks, and pixmaps
• Associating cursors with windows
• Managing cursors
• Freeing memory allocated to cursors when clients no longer need them

Create CURSOR

Xlib enables clients to use predefined cursors or to create their own cursors. To create a
predefined Xlib cursor, use the CREATE FONT CURSOR routine. Xlib cursors are predefined in
ECW$INCLUDE:CURSORFONT.H. See the X and Motif Quick Reference Guide for a list of the
constants that refer to the predefined Xlib cursors.
The following example creates a sailboat cursor, one of the predefined Xlib cursors, and associates
the cursor with a window:

www.arihantinfo.com
111
RDBMS
Cursor fontcursor;
.
.
.

fontcursor = XCreateFontCursor(dpy, XC_sailboat);

XDefineCursor(dpy, win, fontcursor);

The DEFINE CURSOR routine makes the sailboat cursor automatically visible when the pointer is
in window win.

To create client-defined cursors, either create a font of cursor shapes or define cursors using
pixmaps. In each case, the cursor consists of the following components:

• Shape---Defines the cursor as it appears without modification in a window


• Mask---Acts as a clip mask to define how the cursor actually appears in a window
• Background color---Specifies RGB values used for the cursor background
• Foreground color---Specifies RGB values used for the cursor foreground
• Hotspot---Defines the position on the cursor that reflects movements of the pointing device

6.10 Dynamic SQL

Dynamic SQL is an enhanced form of Structured Query Language (SQL) that, unlike standard (or
static) SQL, facilitates the automatic generation and execution of program statements. This can be
helpful when it is necessary to write code that can adjust to varying databases, conditions, or
servers. It also makes it easier to automate tasks that are repeated many times.
Dynamic SQL statements are stored as strings of characters that are entered when the program
runs. They can be entered by the programmer or generated by the program itself, but unlike static
SQL statements, they are not embedded in the source program. Also in contrast to static SQL
statements, dynamic SQL statements can change from one execution to the next.

Let's go back and review the reasons we use stored procedure and what happens when we use
dynamic SQL. As a starting point we will use this procedure:

CREATE PROCEDURE general_select @tblname nvarchar(127),


@key key_type AS -- key_type is char(3)
EXEC('SELECT col1, col2, col3
FROM ' + @tblname + '
WHERE keycol = ''' + @key + '''')
The SELECT statement in client code and send this directly to SQL Server.

1. Permissions

If you cannot give users direct access to the tables, you cannot use dynamic SQL, it is as simple
as that. In some environments, you may assume that users can be given SELECT access. But
unless you know for a fact that permissions is not an issue, don't use dynamic SQL for INSERT,
UPDATE and DELETE statements. I should hasten to add this applies to permanent tables. If you
are only accessing temp tables, there are never any permission issues.

2. Caching Query Plans


As we have seen, SQL Server caches the query plans for both bare SQL statements and stored
procedures, but is somewhat more accurate in reusing query plans for stored procedures. In
SQL 6.5 you could clearly say that dynamic SQL was slower, because there was a recompile each
time. In later versions of SQL Server, the waters are more muddy.
www.arihantinfo.com
112
RDBMS
3. Minimizing Network Traffic
In the two previous sections we have seen that dynamic SQL in a stored procedure is not any
better than bare SQL statements from the client. With network traffic it is a different matter.
There is never any network cost for dynamic SQL in a stored procedure. If we look at our example
procedure general_select, neither is there much to gain. The bare SQL code takes about as many
bytes as the procedure call.
But say that you have a complex query which joins six tables with complex conditions, and one of
the table is one of sales0101, sales0102 etc depending on which period the user wants data
about. This is a bad table design, that we will return to, but assume for the moment that you are
stuck with this. If you solve this with a stored procedure with dynamic SQL, you only need to pass
the period as a parameter and don't have to pass the query each time. If the query is only passed
once an hour the gain is negligible. But if the query is passed every fifth second and the network
is so-so, you are likely to notice a difference.

4. Using Output Parameters


If you write a stored procedure only to gain the benefit of an output parameter, you do not in any
way affect this by using dynamic SQL. Then again, you can get OUTPUT parameters without
writing your own stored procedures, since you can call sp_executesql directly from the client.

5. Encapsulating Logic
There is not much to add to what we said in our first round on stored procedures. I like to point
out, however, that once you have decided to use stored procedure, you should have all secrets
about SQL in stored procedures, so passing table names as in general select is not a good idea.
(The exception here being sysadmin utilities.)

6. Keeping Track of what Is Used


Dynamic SQL is contradictory to this aim. Any use of dynamic SQL will hide a reference, so that it
will not show up in sysdepends. Neither will the reference reveal itself when you build the
database without the referenced object. Still, if you refrain from passing table or column names as
parameters, you at least only have to search the SQL code to find out whether a table is used.
Thus, if you use dynamic SQL, confine table and column names to the procedure code.

www.arihantinfo.com
113
RDBMS
UNIT 7

NORMAL FORMS

7.1 First Normal Form


7.2 Second Normal Form
7.3 Third Normal Form
7.4 BCNF
7.5 Fourth Normal Form
7.6 Fifth Normal Form
7.7 Difference between 4NF and 5NF

Normalize to Reduce Data Redundancy:

Data normalization is a process in which data attributes within a data model are organized to
increase the cohesion of entity types. In other words, the goal of data normalization is to reduce
and even eliminate data redundancy, an important consideration for application developers
because it is incredibly difficult to stores objects in a relational database that maintains the same
information in several places. Summarizes the three most common normalization rules
describing how to put entity types into a series of increasing levels of normalization. Higher levels
of data normalization (Date 2000) are beyond the scope of this book. With respect to terminology,
a data schema is considered to be at the level of normalization of its least normalized entity type.
For example, if all of your entity types are at second normal form (2NF) or higher then we say that
your data schema is at 2NF.

Data Normalization Rules.

Level Rule
First normal form (1NF) An entity type is in 1NF when it contains no repeating groups of data.
Second normal form An entity type is in 2NF when it is in 1NF and when all of its non-key
(2NF) attributes are fully dependent on its primary key.
Third normal form (3NF) An entity type is in 3NF when it is in 2NF and when all of its attributes are
directly dependent on the primary key.

7.1 First Normal Form (1NF)

Let’s consider an example. An entity type is in first normal form (1NF) when it contains no
repeating groups of data. For example, you see that there are several repeating attributes in the
data Order0NF table – the ordered item information repeats nine times and the contact
information is repeated twice, once for shipping information and once for billing information.
Although this initial version of orders could work, what happens when an order has more than
nine order items? Do you create additional order records for them? What about the vast majority
of orders that only have one or two items? Do we really want to waste all that storage space in the
database for the empty fields? Likely not. Furthermore, do you want to write the code required to
process the nine copies of item information, even if it is only to marshal it back and forth between
the appropriate number of objects. Once again, likely not.

www.arihantinfo.com
114
RDBMS
7.2 Second Normal Form (2NF)

It can be normalized further. presents the data schema of 8 in second normal form (2NF). an
entity type is in second normal form (2NF) when it is in 1NF and when every non-key attribute,
any attribute that is not part of the primary key, is fully dependent on the primary key. This was
definitely not the case with the OrderItem1NF table, therefore we need to introduce the new table
Item2NF. The problem with OrderItem1NF is that item information, such as the name and price of
an item, do not depend upon an order for that item. For example, if Hal Jordan orders three
widgets and Oliver Queen orders five widgets, the facts that the item is called a “widget” and that
the unit price is $19.95 is constant. This information depends on the concept of an item, not the
concept of an order for an item, and therefore should not be stored in the order items table –
therefore the Item2NF table was introduced. OrderItem2NF retained the TotalPriceExtended
column, a calculated value that is the number of items ordered multiplied by the price of the item.
The value of the SubtotalBeforeTax column within the Order2NF table is the total of the values of
the total price extended for each of its order items.

7.3 Third Normal Form (3NF)

An entity type is in third normal form (3NF) when it is in 2NF and when all of its attributes are
directly dependent on the primary key. A better way to word this rule might be that the attributes
of an entity type must depend on all portions of the primary key, therefore 3NF is only an issue
only for tables with composite keys. In this case there is a problem with the OrderPayment2NF
table, the payment type description (such as “Mastercard” or “Check”) depends only on the
payment type, not on the combination of the order id and the payment type.

Beyond 3NF

The data schema of 10 can still be improved upon, at least from the point of view of data
redundancy, by removing attributes that can be calculated/derived from other ones. In this case
we could remove the SubtotalBeforeTax column within the Order3NF table and the
TotalPriceExtended column of OrderItem3NF, as you see in 11.

Why data normalization? The advantage of having a highly normalized data schema is that
information is stored in one place and one place only, reducing the possibility of inconsistent
data. Furthermore, highly-normalized data schemas in general are closer conceptually to object-
oriented schemas because the object-oriented goals of promoting high cohesion and loose
coupling between classes results in similar solutions (at least from a data point of view). This
generally makes it easier to map your objects to your data schema. Unfortunately, normalization
usually comes at a performance cost. With the data schema of 7 all the data for a single order is
stored in one row (assuming orders of up to nine order items), making it very easy to access. With
the data schema of 7 you could quickly determine the total amount of an order by reading the
single row from the Order0NF table. To do so with the data schema of 11 you would need to read
data from a row in the Order table, data from all the rows from the OrderItem table for that order
and data from the corresponding rows in the Item table for each order item. For this query, the
data schema of 7 very likely provides better performance.

Denormalize to Improve Performance

Normalized data schemas, when put into production, often suffer from performance problems.
This makes sense – the rules of data normalization focus on reducing data redundancy, not on
improving performance of data access. An important part of data modeling is to denormalize
portions of your data schema to improve database access times. For example, the data model of

www.arihantinfo.com
115
RDBMS
12 looks nothing like the normalized schema of 11. To understand why the differences between
the schemas exist you must consider the performance needs of the application. The primary goal
of this system is to process new orders from online customers as quickly as possible. To do this
customers need to be able to search for items and add them to their order quickly, remove items
from their order if need be, then have their final order totaled and recorded quickly. The
secondary goal of the system is to the process, ship, and bill the orders afterwards.

A Denormalized Order Data Schema (UML notation).

To denormalize the data schema the following decisions were made:

1. To support quick searching of item information the Item table was left alone.
2. To support the addition and removal of order items to an order the concept of an
OrderItem table was kept, albeit split in two to support outstanding orders and fulfilled
orders. New order items can easily be inserted into the OutstandingOrderItem table, or
removed from it, as needed.
3. To support order processing the Order and OrderItem tables were reworked into pairs to
handle outstanding and fulfilled orders respectively. Basic order information is first stored
in the OutstandingOrder and OutstandingOrderItem tables and then when the order has
been shipped and paid for the data is then removed from those tables and copied into the
FulfilledOrder and FulfilledOrderItem tables respectively. Data access time to the two
tables for outstanding orders is reduced because only the active orders are being stored
there. On average an order may be outstanding for a couple of days, whereas for financial
reporting reasons may be stored in the fulfilled order tables for several years until
archived. There is a performance penalty under this scheme because of the need to delete
outstanding orders and then resave them as fulfilled orders, clearly something that would
need to be processed as a transaction.
4. The contact information for the person(s) the order is being shipped and billed to was also
denormalized back into the Order table, reducing the time it takes to write an order to the
database because there is now one write instead of two or three. The retrieval and deletion
times for that data would also be similarly improved.

7.4 Boyce-Codd Normal Form

The relation student(sno, sname, cno, cname) has all attributes participating in candidate keys
since all the attributes are assumed to be unique. We therefore had the following candidate keys:
(sno, cno)
(sno, cname)
(sname, cno)
(sname, cname)

Since the relation has no non-key attributes, the relation is in 2NF and also in 3NF, in spite of the
relation suffering the problems that we discussed at the beginning of this chapter.
The difficulty in this relation is being caused by dependence within the candidate keys. The
second and third normal forms assume that all attributes not part of the candidate keys depend
on the candidate keys but does not deal with dependencies within the keys. BCNF deals with such
dependencies.
A relation R is said to be in BCNF if whenever X A holds in R, and A is not in X, then X is a
candidate key for R.
It should be noted that most relations that are in 3NF are also in BCNF. Infrequently, a 3NF
relation is not in BCNF and this happens only if

www.arihantinfo.com
116
RDBMS
1. the candidate keys in the relation are composite keys (that is, they are not single attributes),
2. there is more than one candidate key in the relation, and
3. the keys are not disjoint, that is, some attributes in the keys are common.

The BCNF differs from the 3NF only when there are more than one candidate keys and the keys
are composite and overlapping. Consider for example, the relationship
enrol (sno, sname, cno, cname, date-enrolled)

Let us assume that the relation has the following candidate keys:

(sno, cno)
(sno, cname)
(sname, cno)
(sname, cname)
(we have assumed sname and cname are unique identifiers). The relation is in 3NF but not in
BCNF because there are dependencies

sno -> sname


cno -> cname

where attributes that are part of a candidate key are dependent on part of another candidate key.
Such dependencies indicate that although the relation is about some entity or association that is
identified by the candidate keys e.g. (sno, cno), there are attributes that are not about the whole
thing that the keys identify. For example, the above relation is about an association (enrolment)
between students and subjects and therefore the relation needs to include only one identifier to
identify students and one identifier to identify subjects. Providing two identifiers about students
(sno, sname) and two keys about subjects (cno, cname) means that some information about
students and subjects that is not needed is being provided. This provision of information will
result in repetition of information and the anomalies that we discussed at the beginning of this
chapter. If we wish to include further information about students and courses in the database, it
should not be done by putting the information in the present relation but by creating new
relations that represent information about entities student and subject.
These difficulties may be overcome by decomposing the above relation in the following three
relations:

(sno, sname)
(cno, cname)
(sno, cno, date-of-enrolment)

We now have a relation that only has information about students, another only about subjects
and the third only about enrolments. All the anomalies and repetition of information have been
removed.
The formal definition of BCNF appears in the beginning of subsection of the text book. Functional
dependencies in a BCNF relation schema may be classified into two categories:

1. the ones whose left side is a candidate key and


2. trivial ones.

Following the definition, the textbook gives a database of several relations and determines
whether they are in the BCNF. The discussion may help you to gain more concrete understanding
of the BCNF.

It explains how to decompose a non-BCNF schema into BCNF schemas. It is relatively easy to
understand. You should read it carefully.
A database design may change over time due to real world demands. The original database design
might allow that each loan be taken by only one customer. Then, the functional dependency
becomes

www.arihantinfo.com
117
RDBMS

The loan-number now is a superkey and the schema Borrow-schema is BCNF. Suppose now that
the database design is changed so that a loan may be taken by several customers, as the example
in the textbook. The schema now is not a BCNF. The above discussion shows that when
definitions of a database are changed, its normal form may also change. Thus, it is essential that
the person who is allowed to change the database definitions, especially the database
+
administrator, understand database design principles. It is rather difficult to compute F . You

+
may obtain in the loop first, then test whether it is in the F .

www.arihantinfo.com
118
RDBMS
7.5 Fourth Normal Form

We have considered an example of Programmer(Emp name, qualification, languages) and discussed


the problems that may arise if the relation is not normalised further. We also saw how the relation
could be decomposed into P1(Emp name, qualifications) and P2(Emp name, languages) to overcome
these problems. The decomposed relations are in fourth normal form (4NF) which we shall now
define.

We are now ready to define 4NF. A relation R is in 4NF if, whenever a multivalued dependency
X -> Y holds then either

1.1.10.1.1 the dependency is trivial, or

(b) X is a candidate key for R.

As noted earlier, the dependency X ->> ø or X ->> Y in a relation R (X, Y) is trivial since they must
hold for all R (X, Y). Similarly (X, Y) -> Z must hold for all relations R (X, Y, Z) with only three
attributes.

In fourth normal form, we have a relation that has information about only one entity. If a relation
has more than one multivalue attribute, we should decompose it to remove difficulties with
multivalued facts.

Intuitively R is in 4NF if all dependencies are a result of keys. When multivalued dependencies
exist, a relation should not contain two or more independent multivalued attributes. The
decomposition of a relation to achieve 4NF would normally result in not only reduction of
redundancies but also avoidance of anomalies.

We have considered an example of Programmer(Emp name, qualification, languages) and discussed


the problems that may arise if the relation is not normalised further. We also saw how the relation
could be decomposed into P1(Emp name, qualifications) and P2(Emp name, languages) to overcome
these problems. The decomposed relations are in fourth normal form (4NF) which we shall now
define.

We are now ready to define 4NF. A relation R is in 4NF if, whenever a multivalued dependency
X -> Y holds then either

(a) the dependency is trivial, or

(b) X is a candidate key for R.

As noted earlier, the dependency X ->> ø or X ->> Y in a relation R (X, Y) is trivial since they must
hold for all R (X, Y). Similarly (X, Y) -> Z must hold for all relations R (X, Y, Z) with only three
attributes.

In fourth normal form, we have a relation that has information about only one entity. If a relation
has more than one multivalue attribute, we should decompose it to remove difficulties with
multivalued facts.

Intuitively R is in 4NF if all dependencies are a result of keys. When multivalued dependencies
exist, a relation should not contain two or more independent multivalued attributes. The
decomposition of a relation to achieve 4NF would normally result in not only reduction of
redundancies but also avoidance of anomalies.

www.arihantinfo.com
119
RDBMS
1. We saw that BC-schema was in BCNF, but still was not an ideal design as it suffered from
repetition of information. We had the multivalued dependency cname street ccity, but
no non-trivial functional dependencies.
2. We can use the given multivalued dependencies to improve the database design by
decomposing it into fourth normal form.
3. A relation schema R is in 4NF with respect to a set D of functional and multivalued

dependencies if for all multivalued dependencies in of the form , where


and , at least one of the following hold:
o is a trivial multivalued dependency.
o is a superkey for schema R.
4. A database design is in 4NF if each member of the set of relation schemas is in 4NF.
5. The definition of 4NF differs from the BCNF definition only in the use of multivalued
dependencies.
o Every 4NF schema is also in BCNF.
o To see why, note that if a schema is not in BCNF, there is a non-trivial functional
dependency holding on R, where is not a superkey.
o Since implies , by the replication rule, R cannot be in 4NF.
6. We have an algorithm similar to the BCNF algorithm for decomposing a schema into 4NF:

result := ;
done := false;

compute ;

while (not done) do


if (there is a schema in result that is not in 4NF)

then begin
let be a nontrivial multivalued

dependency that holds on such that

is not in , and

result =

end

else done = true;

7. If we apply this algorithm to BC-schema:

cname loan# is a nontrivial multivalued dependency and cname is not a superkey for the
schema.
We then replace BC-schema by two schemas:
Cust-loan-schema=(cname, loan#)

Customer-schema=(cname, street, ccity)


www.arihantinfo.com
120
RDBMS

These two schemas are in 4NF.

8. We can show that our algorithm generates only lossless-join decompositions.

Let R be a relation schema and D a set of functional and multivalued dependencies on R.


Let and form a decomposition of R.
This decomposition is lossless-join if and only if at least one of the following multivalued
dependencies is in :

We saw similar criteria for functional dependencies. This says that for every lossless-join
decomposition of R into two schemas and , one of the two above dependencies must
hold. You can see, by inspecting the algorithm, that this must be the case for every
decomposition.

9. Dependency preservation is not as simple to determine as with functional dependencies.

Let R be a relation schema.


Let be a decomposition of R.
Let D be the set of functional and multivalued dependencies holding on R.
The restriction of D to is the set consisting of:
All functional dependencies in that include only attributes of .
All multivalued dependencies of the form where and is in .
A decomposition of schema R is dependency preserving with respect to a set D of functional and

multivalued dependencies if for every set of relations such that for all i,

satisfies , there exists a relation r(R) that satisfies D and for which for all i.

10. What does this formal statement say? It says that a decomposition is dependency
preserving if for every set of relations on the decomposition schema satisfying only the
restrictions on D there exists a relation r on the entire schema R that the decomposed
schemas can be derived from, and that r also satisfies the functional and multivalued
dependencies.
11. We'll do an example using our decomposition algorithm and check the result for
dependency preservation.

Let R=(A,B,C,G,H,I).
Let D be

R is not in 4NF, as we have and A is not a superkey.The algorithm causes us to


decompose using this dependency into

is now in 4NF, but is not.


Applying the multivalued dependency (how did we get this?), our algorithm then
decomposes into

www.arihantinfo.com
121
RDBMS

is now in 4NF, but is not.


Why? As is in (why?) then the restriction of this dependency to gives us .
Applying this dependency in our algorithm finally decomposes into

The algorithm terminates, and our decomposition is and .

12. Let's analyze the result.

Projection of relation r onto a 4NF decomposition of R.

This decomposition is not dependency preserving as it fails to preserve .


shows four relations that may result from projecting a relation onto the four schemas of our
decomposition. The restriction of D to (A,B) is and some trivial dependencies. We can see
that satisfies as there are no pairs with the same A value. Also, satisfies all functional
and multivalued dependencies since no two tuples have the same value on any attribute. We can
say the same for and . So our decomposed version satisfies all the dependencies in the
restriction of D. However, there is no relation r on (A,B,C,G,H,I) that satisfies D and decomposes
into and . shows . Relation r does not satisfy . Any

relation s containing r and satisfying must include the tuple .

However, includes a tuple that is not in . Thus our decomposition fails to


detect a violation of .

A relation r(R) that does not satisfy .

13. We have seen that if we are given a set of functional and multivalued dependencies, it is
best to find a database design that meets the three criteria:
o 4NF.
o Dependency Preservation.
o Lossless-join.
14. If we only have functional dependencies, the first criteria is just BCNF.
15. We cannot always meet all three criteria. When this occurs, we compromise on 4NF, and
accept BCNF, or even 3NF if necessary, to ensure dependency preservation

7.6 Fifth Normal Form (5NF)

The normal forms discussed so far required that the given relation R if not in the given normal
form be decomposed in two relations to meet the requirements of the normal form. In some rare
www.arihantinfo.com
122
RDBMS
cases, a relation can have problems like redundant information and update anomalies because of
it but cannot be decomposed in two relations to remove the problems. In such cases it may be
possible to decompose the relation in three or more relations using the 5NF.

The fifth normal form deals with join-dependencies which is a generalisation of the MVD. The aim
of fifth normal form is to have relations that cannot be decomposed further. A relation in 5NF
cannot be constructed from several smaller relations.

A relation R satisfies join dependency (R1, R2, ..., Rn) if and only if R is equal to the join of R1, R2, ...,
Rn where Ri are subsets of the set of attributes of R. A relation R is in 5NF (or project-join normal
form, PJNF) if for all join dependencies at least one of the following holds.

(a) (R1, R2, ..., Rn) is a trivial join-dependency (that is, one of Ri is R)
(b) Every Ri is a candidate key for R.

An example of 5NF can be provided by the example below that deals with departments, subjects
and students.

Dept subject student


Comp. Sc. CP1000 John Smith
Mathematics MA1000 John Smith
Comp. Sc. CP2000 Arun Kumar
Comp. Sc. CP3000 Reena Rani
Physics PH1000 Raymond Chew
Chemistry CH2000 Albert Garcia

The above relation says that Comp. Sc. offers subjects CP1000, CP2000 and CP3000 which are
taken by a variety of students. No student takes all the subjects and no subject has all students
enrolled in it and therefore all three fields are needed to represent the information.

The above relation does not show MVDs since the attributes subject and student are not
independent; they are related to each other and the pairings have significant information in them.
The relation can therefore not be decomposed in two relations

(dept, subject), and

(dept, student)

without loosing some important information. The relation can however be decomposed in the
following three relations

(dept, subject), and

(dept, student)
(subject, student)

and now it can be shown that this decomposition is lossless.

7.7 Difference between Fourth and fifth normal form

The fourth normal form states, no one-to-many relationships should exist between primary key
columns and non-key columns. The fifth normal form carries this process to its logical
conclusion, breaking table into the smallest possible pieces in order to eliminate all redundant

www.arihantinfo.com
123
RDBMS
data in the table. Tables normalized to this extreme consist of little more than a primary key and
one or two dependant data keys.

www.arihantinfo.com
124
RDBMS
UNIT 8

QUERY EXECUTION

8.1.Introduction to Physical-Query-Plan Operators


8.2.One-Pass Algorithms for Database Operations
8.3.Nested-Loop Joins
8.4.Two-Pass Algorithms Based on Sorting
8.5.Two-Pass Algorithms Based on Hashing
8.6.Index-Based Algorithms
8.7.Buffer Management
8.8.Parallel Algorithms for Relational Operations
8.9.Using Heuristics in Query Optimization
8.10.Basic Algorithms for Executing Query Operations

8.1. Introduction to Physical-Query-Plan Operators

Request a tuple at a time from its children, Performs some operation, Returns the result to the
parent, The “tuples” are evaluations

example :

Chain
“A.B x, x.C y”
Discover (A,”B”,x)
Discover (x,”C”,y)
NLJ
Lindex (x,”C”,y)
Name (t,”A”)
NLJ
Lindex (t,”B”,x)
Name (t,”A”)
Bindex (t,”B”,x)
Scan (x,”C”,y)

Logical query plan

Lindex plan 1
Bindex plan 2

Physical Query Plan –


Example
Pindex (“A.B x, x.C y”,y)
NLJ
Scan (A,”B”,x)
Scan (x,”C”,y)
Scan plan
Pindex plan

Physical plans originated from the Logical plan:


Chain
Discover (A,”B”,x)
Discover (x,”C”,y)

www.arihantinfo.com
125
RDBMS
Logical query plan

Physcal Query Plan - operators :


Project
Select
NLJ -Nested Loop Join
HashJoin
Compound
Scan -Get child
Lindex
Pindex
Bindex
Vindex
Name
Once
CreateTemp
Set
Deconstruct
ForEach
Aggregation -exists etc.

8.2. One-Pass Algorithms for Database Operations

Read data from disk only once. Usually, at least one operand must fit in memory (exceptions:σ, π).

One-pass Duplicate Elimination:

1. Keep a main-memory temporary data structure T for tuples


2. Read next tuple from input table
3. If a tuple is not in T , add it to output table and to T
4. Otherwise, do nothing
5. Go to Step2

Complexity: O(n2) for primitive data structure T , can speed up to O(n log n) (binary search
tree) or O(n) (hash table).Memory requirements: Must have enough main memory
space for |δ(R)| tuples.

Other
Most operators can be implemented as one pass operators, as long as there is enough
main memory space.
• grouping
• set union/intersection/difference
• product
• natural join

Implementing Joins as one pass algorithms:

To join R S:
1. Read S into memory, and store in a searchable data structure (e.g., search tree)
2. Read one tuple of R at a time. For each tuple t (a) find the tuples of S that match (join)
with t (b) for each match, add the joined tuple to the output table Memory requirement:
B(R) + B(S) disk I/Os. B(S) blocks (plus one tuple from R) must fit in main memory.

8.3. Nested-Loop Joins

www.arihantinfo.com
126
RDBMS
Join Operation:

• Join operations bring together two relations and combine their attributes and tuples in
a specific fashion.

• The generic join operator is:


• It takes as arguments the attributes from the two relations that are to be joined.
• For example assume we have the EMP relation as above and a separate DEPART
relation with (Dept, MainOffice, Phone) :
EMP Dept = Dept DEPART

• The join condition can be


• When the join condition operator is =, then we call this an Equi-Join
• Note that the attributes in common are repeated.

Join Examples:

Assume we have the EMP relation from above and the following DEPART relation:
Dept MainOffice Phone

CS 404 555-1212

Econ 200 555-1234

Fin 501 555-4321

Hist 100 555-9876

• Find all information on every employee including their department info:


EMP Dept = Dept DEPART
Results:
Name Office Dept Salary Dept MainOffice Phone

Smith 400 CS 45000 CS 404 555-1212

Jones 220 Econ 35000 Econ 200 555-1234

Green 160 Econ 50000 Econ 200 555-1234

Brown 420 CS 65000 CS 404 555-1212

Smith 500 Fin 60000 Fin 501 555-4321


• Find all information on every employee including their department info where the
employee works in an office numbered less than the department main office:

EMP (office < mainoffice) (dept = dept) DEPART


Results:

Name Office Dept Salary Dept MainOffice Phone

Smith 400 CS 45000 CS 404 555-1212

Green 160 Econ 50000 Econ 200 555-1234

www.arihantinfo.com
127
RDBMS

Smith 500 Fin 60000 Fin 501 555-4321

Natural Join:

• Notice in the generic join operation, any attributes in common (such as dept above) are
repeated.
• The Natural Join operation removes these duplicate attributes.
• The natural join operator is: *
• We can also assume using * that the join condition will be = on the two attributes in
common.
• Example: EMP * DEPART
Results:

Name Office Dept Salary MainOffice Phone

Smith 400 CS 45000 404 555-1212

Jones 220 Econ 35000 200 555-1234

Green 160 Econ 50000 200 555-1234

Brown 420 CS 65000 404 555-1212

Smith 500 Fin 60000 501 555-4321

Outer Join:

• In the Join operations so far, only those tuples where an attribute value matches are
included in the output relation.
• The Outer join includes other tuples as well according to a few rules.
• Three types of outer joins:

1. Left Outer Join includes all tuples in the left hand relation and includes only
those matching tuples from the right hand relation.

2. Right Outer Join includes all tuples in the right hand relation and includes
only those matching tuples from the left hand relation.

3. Full Outer Join includes all tuples in the left hand relation and from the right
hand relation.

• Examples:

Assume we have two relations: PEOPLE and MENU:

PEOPLE: MENU:

www.arihantinfo.com
128
RDBMS

Name Age Food Food Day

Alice 21 Hamburger Pizza Monday

Bill 24 Pizza Hamburger Tuesday

Carl 23 Beer Chicken Wednesday

Dina 19 Shrimp Pasta Thursday

Tacos Friday

• PEOPLE MENU
Name Age Food Day

Alice 21 Hamburger Tuesday

Bill 24 Pizza Monday

Carl 23 Beer NULL

Dina 19 Shrimp NULL

• PEOPLE MENU
Name Age Food Day

Bill 24 Pizza Monday

Alice 21 Hamburger Tuesday

NULL NULL Chicken Wednesday

NULL NULL Pasta Thursday

NULL NULL Tacos Friday

• PEOPLE MENU

Name Age Food Day

Alice 21 Hamburger Tuesday

Bill 24 Pizza Monday

Carl 23 Beer NULL

Dina 19 Shrimp NULL

NULL NULL Chicken Wednesday

NULL NULL Pasta Thursday

www.arihantinfo.com
129
RDBMS

NULL NULL Tacos Friday

Outer Union

• The Outer Union operation is applied to partially union compatible relations.


• Operator is: *
• Example: PEOPLE * MENU

Name Age Food Day

Alice 21 Hamburger NULL

Bill 24 Pizza NULL

Carl 23 Beer NULL

Dina 19 Shrimp NULL

NULL NULL Hamburger Monday

NULL NULL Pizza Tuesday

NULL NULL Chicken Wednesday

NULL NULL Pasta Thursday

NULL NULL Tacos Friday

8.4. Two-pass algorithms based on sorting

Binary operations: R ∩ S, R U S, R – S
Idea: sort R, sort S, then do the right thing

A closer look:
Step 1: split R into runs of size M, then split S into runs of size M. Cost: 2B(R) + 2B(S)
Step 2: merge M/2 runs from R; merge M/2 runs from S; ouput a tuple on a case by cases basis
Total cost: 3B(R)+3B(S)
Assumption: B(R)+B(S)<= M2

Join RS

Start by sorting both R and S on the join attribute:

Cost: 4B(R)+4B(S) (because need to write to disk) Read both relations in sorted order, match
tuples
Cost: B(R)+B(S)

Difficulty: many tuples in R may match many in S If at least one set of tuples fits in M, we are OK
Otherwise need nested loop, higher cost

Total cost: 5B(R)+5B(S)


Assumption: B(R) <= M2, B(S) <= M2

www.arihantinfo.com
130
RDBMS
Algorithm

If the number of tuples in R matching those in S is small (or vice versa) we can compute the join
during the merge phase
Total cost: 3B(R)+3B(S)
Assumption: B(R) + B(S) <= M2
This algorithm is known as sort join, merge join, sort-merge join
Recall: (R) duplicate elimination
Step 1. Partition R into buckets
Step 2. Apply  to each bucket (may read in main memory)
Cost: 3B(R)
Assumption: B(R) <= M2

8.5. Two-Pass Algorithms Based on Hashing

It is easy to store records in files. It is much harder to find what we are looking for. Speaking as
someone whose desk is always untidy (!) I closely identify with this problem. As I mark students'
assignments, I tend to add my copy of the PT3 document to a growing pile. This is easy. Then
when someone phones me, I like to find their latest PT3 so I can remember how they got on. This
is not difficult but it is time consuming. I have to work my way through the pile - we say
sequentially or serially - until I find it.
Now what I could do is buy one of these concertina-type filing folders which has a pocket for each
letter of the alphabet. All the PT3s for students starting with the letter A could go in the first
pocket, B in the second and so on. What we are doing is storing the records in what we call
buckets. We decide which bucket by looking at the data. We say that we generate the bucket
number by applying a hashing algorithm to the data. So my 'algorithm' is to take the first letter of
the surname and turn that into a number from 0 to 25. The Pascal to do this might be:

bucket_number := ord('A') - ord(surname[1]);

However I am sure that you can see that this 'hashing algorithm' is not a very good one. It is easy
to work out, but it will probably mean that some buckets become very full and others (like Z!)
rarely get used. Does this matter? Well, having figured out which bucket to look in - we then need
to search through it looking for our record. If a bucket becomes very full, this search could be
lengthy and we haven't really gained anything. What we need is an algorithm which distributes
the records as evenly as possible.
I used to work in a busy hospital. There were hundreds of thousands of patients. Each patient
had a bulky paper file which was stored in the records office. Now - how could they store the
records? Well, each patient was given a six digit number and the records were filed in what they
called 'terminal digit' order. So if my number was 123456 then my notes would be in 'bucket' 56.
Can you see that this algorithm ensured a very even distribution of the files? The mathematical
term for this technique is called 'modulo' arithmetic. So, 123456 modulo 100 is 56. Effectively we
divide the number by 100 and use the remainder. This technique of modulo arithmetic is often
used to distribute records as evenly as possible. The bucket that we aim to store the record in is
called the 'home' bucket.

Example: Lets imagine we have a small file with 5 buckets and we only allow two records per
bucket. Each record belongs to a person and we will store records depending on the first letter of
the surname as follows - A to E in bucket zero, F to J in bucket one, K to O in bucket two, P to T
in bucket three and U to Z in bucket four. If we now store Adam, Minto and Smith it should look
like:

www.arihantinfo.com
131
RDBMS
If we now store Penny and Steven, then Penny can into bucket 3. When we try to store Steven, the
home bucket is 3 but it is full. So Steven will overflow into bucket 4:

Adding Yule and Queen now gives us:

Note how the 'next' bucket to 4 is actually 0.

8.6. Index-Based Algorithms

If there is an index it can be used when implementing relational operations. Clustering indexes:
All tuples with the same index key appear together (in as few blocks as possible).
Example: Index based selection: σ C (R), where the condition C is of the form a i= x. Easy with an
index on attribute a I , and very efficient if it is a clustering index. B-trees support efficient
selection for range conditions, like 7 ≤ a I ≤ 47.

Index based algorithms:

Assume we have a sorted clustering index on the join-value, e.g., a B-tree. Join using an index
can be implemented similarly to the sorting based join algorithm, but now we have the sorted
order to start with. Therefore pass 1is not necessary. Algorithm: Just go through the two sorted
lists and join the tuples. Works in one pass as long as there are at most (about) M blocks of tuples
with equal join-values.

8.7. Buffer Management

A buffer is best described as a temporary file that holds changes you make to a saved file on disk.
When you save the file, Emacs overwrites the file with the contents of the buffer. So, when you
open a file in Emacs, you are actually opening a buffer that holds the changes. You can revert to a
saved version of a buffer by choosing "Revert Buffer" from the "Files" menu. This discards any
changes since the last save.
As the Instrumented Kernel intercepts events, it stores them in a circular linked list of buffers. As
each buffer fills, the Instrumented Kernel sends a signal to the data-capturing program that the
buffer is ready to be read.

Buffer specifications:
Each buffer is of a fixed size and is divided into a fixed number of slots:
Event buffer slots per buffer 1024
Event buffer slot size 16 bytes
Buffer size 16 K

Although the size of the buffers is fixed, the maximum number of buffers used by a system is
limited only by the amount of memory. (The tracelogger utility uses a default setting of 32 buffers,
or about 500 K of memory.) The buffers share kernel memory with the application(s) and the
kernel automatically allocates memory at the request of the data-capture utility. The kernel
allocates the buffers contiguous physical memory space. If the data-capture program requests a

www.arihantinfo.com
132
RDBMS
larger block than is available contiguously, the Instrumented Kernel will return an error message.
For all intents and purposes, the number of events the Instrumented Kernel generates is infinite.
Except for severe filtering or logging for only a few seconds, the Instrumented Kernel will probably
exhaust the circular linked list of buffers, no matter how large it is. To allow the Instrumented
Kernel to continue logging indefinitely, the data-capture program must continuously pipe (empty)
the buffers.

Full buffers and the high-water mark

As each buffer becomes full (more on that shortly), the Instrumented Kernel sends a signal to the
data-capturing program to save the buffer. Because the buffer size is fixed, the kernel sends only
the buffer address; the length is constant.
The Instrumented Kernel can't flush a buffer or change buffers within an interrupt. If the
interrupt wasn't handled before the buffer became 100% full, some of the events may be lost. To
ensure this never happens, the Instrumented Kernel requests a buffer flush at the high-water
mark.
The high-water mark is set at an efficient, yet conservative, level of about 70%. Most interrupt
routines require fewer than 300 event buffer slots (approximately 30% of 1024 event buffer slots),
so there's virtually no chance that any events will be lost. (The few routines that use extremely
long interrupts should include a manual buffer-flush request in their code.)
Therefore, in a normal system, the kernel logs about 715 events of the fixed maximum of 1024
events before notifying the capture program.

Buffer overruns

The Instrumented Kernel is both the very core of the system and the controller of the event
buffers.
When the Instrumented Kernel is busy, it logs more events. The buffers fill more quickly and the
Instrumented Kernel requests buffer-flushes more often. The data-capture program handles each
buffer-flush request; the Instrumented Kernel switches to the next buffer and continues logging
events. In an extremely busy system, the data-capture program may not be able to flush the
buffers as quickly as the Instrumented Kernel fills them.

8.8. Parallel Algorithms for Relational Operations

One of the important issues concerning the implementation of parallel Data Base Management
Systems (DBMS) is the issue of query execution parallelization. This paper describes organization
of parallel query executor in the prototype of the parallel. The Omega system has a three level
hierarchical hardware architecture. This hardware architecture is characterized by reliability and
high1data.
This model utilizes the producer/consumer paradigm and data drive/data flow mechanism for
efficient data exchange between operators. Each operation of the query tree is represented as a
single lightweight process (a thread). In the Omega System each process is taken as a root thread
(only one process can run on each processor module).
Any thread may initialize any number of daughter threads. Thus, the threads form a hierarchy,
which is supported by the thread manager. A value of dynamic priority calculated with the help of
factor function of a thread is used to pass control over among the threads. In order to implement
intra operation parallelism, stream model utilizes a special exchange operator. It encapsulates all
the parallelism of the query executor.

Query executor of the Omega system


Query executor of the Omega system is a virtual machine, which is capable to execute physical
queries, that are the queries expressed in terms of physical algebra. On the level of physical
algebra, any kind of parallelism in query execution is implemented explicitly. In particular the
arguments and results of operation of physical algebra are fragments of relations. Parallel
operations based on relations partitioning are implemented on the higher levels of system
hierarchy.
www.arihantinfo.com
133
RDBMS

8.9. Using Heuristics in Query Optimization


In this lesson, we discuss optimization techniques that apply heuristic rules to modify the internal
representation of a query, which is usually in the form of a query tree or a query graph data
structure to improve its expected performance. The parser of a high-level query first generates an
initial internal representation, which is then optimized according to heuristic rules. Following
that, query execution plan is generated to execute groups of operations based on the access paths
available on the files involved in the query.
One of the main heuristic rules is to apply SELECT and PROJECT operations before applying the
JOIN or other binary operations. This is because the size of the file resulting from a binary
operation, such as JOIN, is usually a multiplicative function of the sizes of the input files. The
SELECT and PROJECT operations reduce the size of a file and hence, should be applied before a
join or other binary operation.
Notation for Query Trees and Query Graphs
A query is a tree data structure that corresponds to a relational algebra expression. It represents
the input relations of the query as leaf nodes of the tree, and represents the relational algebra
operations as internal nodes. An execution of the query tree consists of execution of an internal
node operation whenever its operands are available and then replacing that internal node by the
relation that results from executing the operation. The execution terminates when the root node is
executed and produces the result relation for the query.

Figure shows a query tree for query block Q2(given later). For every project located in ‘Stafford’,
retrieve the project number, the controlling department number, and the department manager’s
last name, address, and birth date. This query is specified on the relational schema of Figure
4.1(a) and corresponds to the following relational algebraic expression:

πP.PNUMBER, P.DNAME, P.DNUM, E.NAME, E.ADDRESS, E.BDATE

D.MGRSSN=E.SSN

(2 E
P.DNUM=D.DNUMBER
R

(1
P.DNUM=D.DNUMBER
D

P
www.arihantinfo.com
134
RDBMS

(a) πP.PNUMBER, P.DNAM, P.DNUM, E.NAME, E.ADDRESS, EBDATE

P.DNUM=D.D.NUMBER AND E.MGRSSN=ESSN AND P.PLOCATION =’Stafford’

x
E

D
P
Figure: Two query trees for the query Q2. (a) Query tree corresponding to the relational algebraic
expression for Q2. (b) Initial (canonical) query tree for SQL query Q2.
[P.NUMEBR,P.DNUM] [E.L.NAME, E.ADDRESS, E.BDATE]
P.DNUM=D.NUMBER D.MGRSSN=E.SSN
D E
P
P.PLOCATION=’Stafford

‘Staffor

Figure( c ) Query graph for Q2.


πPNUMBER, DNUMBER, LNAME, ADDRESS, BDATE (((σ PLOCATION= ‘Stafford’ (PROJECT))
DNUM=DNUMBER (DEPARTMENT)) MGRSSN=SSN(EMPLOYEE))
This corresponds to the following SQL query:
Q2: SELECT P.NUMBERM, P.DNUMN, E.LNAME, E.ADDRESS, E.BDATE
FROM PROJECT AS P, DEPARMENT AS D, EMPLOYEE AS E
WHERE P.DNUM=D.DNUMBER AND D.MGRSSN=E.SSN AND P.PLOCATION=’Stafford’
In figure 4.1c, the three relations Project, Department and Employee are represented by leaf nodes
P, D and E, while the internal tree nodes represent operations of the expression. When this query
tree is executed, the node marked (1) in Figure 4.1 (a) must begin execution before node (2)
because some resulting tuples of operation (1) must be available before we can begin execution
operation (2). Similarly, node (2) must begin executing and producing results before node (3) can
start execution, and so on.
As we can see, the query tree represents a specific order of operations for executing a query. A
more neutral representation of a query is the query graph notation. Figure(c) shows the query
graph for query Q2. Relations in the query are represented by relation nodes, which are displayed
as single circles. Constant values, typically from the query selection conditions are represented by
the constant nodes, which are displayed as double circles. Selection and join conditions are
represented by the graph edges, as shown in Fig.(c). Finally, the attributes to be retrieved from
each relation are displayed in square brackets above each relation.

www.arihantinfo.com
135
RDBMS
The query graph representation does not indicate an order in which operations perform. There is
only a single graph corresponding to each query. Although some optimization techniques were
based on query graphs, it is now generally accepted that query trees are preferable because, in
practice, the query optimizer needs to show the order of operations for query execution, which is
not possible in query graphs.
Heuristic Optimization of Query Trees
In general, many different relational algebra expressions and hence many different query trees
can be equivalent; that is, they can correspond to the same query. The query parser will typically
generate a standard initial query tree to correspond to an SQL query, without doing any
optimization. In Fig4.1(b) the CARTESIAN PRODUCT of the relations specified in the FROM
clause is first applied; then the selection and join conditions of the WHERE clause are applied,
followed by the projection on the SELECT clause attributes. Such a canonical query tree
represents a relational algebraic expression that is very inefficient if executed directly, because of
the CARTESIAN PRODUCT (x) operations. For example, if the PROJECT, DEPARTMENT, and
EMPLOYEE relations had record sizes of 100, 50, and 150 bytes and contained 100, 200 and 500
tuples, respectively, the result of the CARTESIAN PRODUCT would contain 10 million tuples of
record size 300 bytes each. However, the query tree in Figure 4.1(b) is in a simple standard form
that can be easily created. It is now the job of the heuristic query optimizer to transform this
initial query tree into a final query tree that is efficient to execute.
The optimizer must include rules for equivalence among relational algebra expressions that can
be applied to the initial tree. The heuristic query optimization rules then utilize these equivalence
expressions to transform the initial tree into the final, optimized query tree. We discuss general
transformation rules and show how they may be used in an algebraic heuristic optimizer.
Example of Transforming a Query. Consider the following query Q on the database of Figure
2.1(chapter 2). “Find the last names of employees born after 1957 who work on a project named
‘Aquarius’ “. This query can be specified in SQL as follows:
Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE PNAME = ‘Aquarious’ AND PNUMBER = PNO AND ESSN = SSN
AND BDATE > ‘1957-12-31’;
The initial query tree for Q is shown in Figure 4.2(a). Executing this tree directly first creates a
very large file containing the CARTESIAN PRODUCT of the entire EMPLOYEE, WORKS_ON, and
PROJECT files. However, this query needs only one record from the PROJECT relation for the
‘Aquarius’ project and only the EMPLOYEE records for the those whose date of birth is after
‘1957-12-31’. Figure 4.2(b) shows an improved query tree that first applies the SELECT operations
to reduce the number of tuples that appear in the CARTESIAN PRODUCT.
A further improvement is achieved by switching the positions of the EMPLOYEE and PROJECT
relations in the tree, as shown in Figure 4.2( c). This uses the information that PNUMBER
is a key attribute of the PROJECT relation, and hence the SELECT operation on the PROJECT
relation will retrieve a single record only. We can further improve the query tree by replacing any
CARESTIAN PRODUCT operation that is followed πLNAME by a join condition with a JOIN operation, as
shown in Figure 4.2(d). Another improvement is to keep only the attributes needed by subsequent
operations in the intermediate relation, by including PROJECTION (π) operations as early as
possible in the query tree, as shown in Figure 4.2(e). This reduces the attributes (columns) of the
σ whereas the SELECT operations reduce the number of tuples (records).
intermediate relations,
PNAME = ‘Aquarius’ AND PNUMBER=PNO AND ESSN=SSN AND BDATE> ‘1957-12-31
As the preceding example demonstrates, a query tree can be transformed step by step into
another query tree that is more efficient to execute. However, we must make sure that the
transformation steps always lead to an equivalent query tree. To do this, the query optimizer must
know which transformation rules preserve this equivalence. We discuss some of these
transformation rules next. X PROJE

X www.arihantinfo.com
136
EMPLOY WORKS_

Ff. 4.2 Steps in converting a query tree during heuristic optimisation. (a) Initial
Figure 4.2 (a) Simple tree for the query Q
(canonical) query tree for SQL query Q. (b) Moving SELECT operations down the
RDBMS

www.arihantinfo.com
137

Fig. 4.2 Steps in converting a query tree during heuristic optimization. (a) Initial (canonical) query
tree for SQL query Q. (b) Moving SELECT operations down the query tree.
RDBMS
πLNAME

σESSN=SSN

σPNUMBER=PNO
σBDATE>’1967-12-31’

X
EMPLOYEE
PNUMBER-PNO
PNUMBER=PN
σPNUMBER=’Aquarius’
WORKS_ON

PROJECT

Figure 4.2 (b) improved query tree for Q

πLNAME

σESSN=SSN

σPNUMBER=PNO
σBDATE>’1957-12-31

σPNAME=’Aquarius’
WORKS_O EMPLOY

PROJE

Figure 4.2 ( c ) Applying the more restrictive SELECT operation first.

www.arihantinfo.com
138
RDBMS
π
LN AM E

πLNAME

σ
σ EESSN=SSN
σ S S N= S S N σNAME=’Aquarius’
ESSN = SSN
σ
PN AM E= ’Aquari
us’

σBDATE=’1967-12-57 PRO JEC T


σ WORK ON
BD ATE= ’1 967-1 2-
31 ’ W O RKS_O N

EM PLO YEE
4.2 (d ) Replacing CARETESIAN PRODUCT and SELECT with JOIN operations

πLNAME

σ
ESSN=SSN

σ
ESSN
πSSN, LNAME

σ σ
PNUMBER=PNO BDATE>’1967-12-31’

πPNUMBER πESSN, PNO


EMPLOYEE

σ
PNAME=’Aquarius’

WORKS_ON

PROJECT

Figure 4.2(e): Steps in converting a query tree during heuristic optimisation

General Transformation Rules for Relational Algebra Operations


There are many rules for transforming relational algebra operations into equivalent ones. Here we
are interested in the meaning of the operations and the resulting relations. Hence, if two relations
have the same set of attributes in a different order but the two relations represent the same

www.arihantinfo.com
139
RDBMS
information, we consider the relations equivalent. We now state some transformation rules that
are useful in query optimization, without proving them:

1. Cascade of σ: A conjunctive selection condition can be broken up into a cascade (that is, a
sequence) of individual σ operations.

σc1 AND σc2 AND…AND σcn (R) = σc1(σc2(…(σcn(R))…))

2. Commutativity of σ: The σ operation is commutative.


σc1(σc2(R))=σc2(σc1(R))

3. Cascade of π: In a cascade (sequence) of π operations, all but the last one can be ignored:
List (πList2 (…πListn (R))…))= πListn(R)

4. Commuting σ with π: If the selection condition c involves only those attributes A l,…, An in the
projection list, the two operations cab be commuted:

πA1, A2…An (σc(R))=σc (πA1, A2… An (R))


5. Commutatively (and x): The operation is commutative, as is the x operation:
R S= S R
R x S= S x R
Notice that, although the order of attributes may not be the same in the relations resulting
from the two joints (or two Cartesian products), the “meaning” is the same because order
of attributes is not important in the alternative definition of relation.

6. Commuting σ with (or x): If all the attributes in the selection condition σc involve only
the attributes of one of the relations being joined, say R, the two operations can be
commuted as follows:

σc (R S) =(σc (R) ) S

Alternatively, if the selection condition σc can be written as (c1 AND c2), where condition c1
involves only the attributes of R and condition c 2 involves only the attributes of S, the
operations commute as follows:

σc (R S) = (σc1 (R) ) (σc2 (S))


The same rules apply if the is replaced by a x operation.

7. Commuting π with (or σc): Suppose that the projection list is L= (A 1,…, An, B1,…,Bm),
where A1…, An are attributes of R and B1,…Bm are attributes in L, the two operations can
be commuted as follows:

πL (R c S) = (πA1.., An (R)) c (πB1,…Bm(S))


If the join condition c contains additional attributes not in L, these must be added to the
projection list, and a final π operation is needed. For example, if attributes An+1,…An+k of R
and Bm+1, …. Bm+p of S are involved in the join condition c but are not in the projection list
L, the operations commute as follows:

π (R c S)= πL ((π Al, An+1, …,…An+k (R)) c (πB1, …Bm, Bm+1,…Bm+p (S))
For x, there is no condition c, so the first transformation rule always applies by replacing
c with x.

8. Commutatively of set operations: The set operations  and  are commutative.

www.arihantinfo.com
140
RDBMS
9. Associativity of , x,  , and  : These four operations are individually associative; that
is, if θ stands for any one of these four operations (through out the expression), we have:

R θ (S θ T) = (R θ S) θ T

10. Distribution σ with set operations: The σ operation commutes with  ,  and x. If θ
stands for any one of these three operations (throughout the expression), we have:

σc (R θ S )= (σc (R)) θ (σc (S))

11. The π operation commutes with  :

πL (R  C)= (πL (R))  (πL (S))

12. Converting a (σ, x) sequence into: If the condition c of a σ that follows a x corresponds to a
join condition, convert the (σ, x) sequence into c as follows:

(σc (RxS)) = (R c S)
There are other possible transformations. For example, a selection or join condition c can
be converted into an equivalent condition by using the following rules (DeMorgan’s laws):
NOT (c1 AND c2)= (NOT c1) OR (NOT c2)
NOT (c1 OR c2) = (NOT c1) AND (NOT c2)

Outline of Heuristic Algebraic Optimization Algorithm


We can now outline the steps of an algorithm that utilizes some of the above rules to transform an
initial query tree into an optimized tree that is more efficient to execute (in most cases). The
algorithm will lead to transformations similar to those discussed in our example of Figure 4.2. The
steps of the algorithm are as follows:
Using Rule 1, break up any SELECT operation with conjunctive conditions into a cascade of
SELECT operations. This permits a greater degree of freedom in moving SELECT operations
down different branches of the tree.
Using Rules 2, 4, 6, and 10 concerning the commutativity of SELECT with other operations, move
each SELECT operation as far down the query tree as is permitted by the attributes involved
in the select condition.
Using Rules 5 and 9 concerning commutativity and associativity of binary operations, rearrange
the leaf nodes of the tree using the following criteria. First, position the leaf node relations
with most restrictive SELECT operations so they are executed first in the query tree
representation. The definition of most restrictive are executed fires in the query tree
representation. The definition of most restrictive SELECT can mean either the ones that
produce a relation with the fewest tuples or with the smallest absolute size. Another
possibility is to define the most restrictive SELECT as the one with the smallest selectivity;
this is more practical because estimates of selectivity are often available in the DBMS catalog.
Second, make sure that the ordering of leaf nodes does not cause CARTESIAN PRODUCT
operations. For example, if the two relations with the most restrictive SELECT do not have a
direct join condition between them, it may be desirable to change the order of leaf nodes to
avoid Cartesian products.
Using Rule 12, combine a CARTESIAN PRODUCT operation with a subsequent SELECT operation
in the tree into a JOIN operation, if the condition represents a join condition.
Using Rules 3, 4, 7, and 11 concerning the cascading of PROJECT and the commuting of
PROJECT with other operations, break down and move lists of projection attributes down the
tree as far as possible by creating new PROJECT operations as needed. Only those

www.arihantinfo.com
141
RDBMS
attributes needed in the query result and in subsequent operations in the query tree should
be kept after each PROJECT operation.
Identify sub-trees that represent groups of operations that can be executed by a single algorithm.
In our example, Figure (b) shows the tree of Figure (a) after applying steps 1
and 2 of the algorithm; Figure (c) shows the tree after Step 3; Figure (d) after Step 4; and
Figure (e) after Step 6 we may group together. The operations in the sub-tree whose root is the
operation πESSN into a single algorithm. We may also group the remaining operations into
another sub-tree, where the tuples resulting from the first algorithm replace the sub-tree
whose root is the operation πESSN because the first grouping means that this sub-tree is
executed first.

8.10. Basic Algorithms for Executing Query Operations


Converting Query Trees into Query Execution Plans
An execution plan for a relational algebraic expression represented as a query tree includes
information about the access methods available for each relation as well as the algorithms to be
used in computing the relational operators represented in the tree. As a simple example, consider
query Q1 whose corresponding relational algebraic expression is:

πFNAME, LNAME, ADDRESS (σDNAME= ‘RESEARCH’ AND DNUMBER=DNO (DEPARTMENT, EMPLOYEE))


The query trees shown in Figure 4.3. To convert this into an execution plan, the optimizer might
choose an index search for the SELECT operation (assuming one exists), a table scan as access
method for EMPLOYEE, a nested loop join algorithm for the join, and a scan of the JOIN result for
the PROJECT operator. In addition, the approach taken for executing the query may specify a
materialized or a pipelined evaluation.
With materialized evolution, the result of operations is stored as temporary relation (that is, the
result is physically materialized). For instance, the join operation can be computed and the entire
result stored as a temporary relation, which is then read as input by the algorithm that computes
the PROJECT operation, which would produce the query result table. On the other hand, with
pipelined evaluation, as the resulting tuples of an operation are produced, they are forwards
directly to the next operation in the query sequence. For example, as the selected tuples from
DEPARTMENT are produced by the SELECT operation, they are placed in a buffer; the JOIN
operation algorithm would ten consume the tuples from the buffer, and those tuples that result
from the JOIN operation are pipelined to the projection operation algorithm. The advantages of
pipelining is the cost savings in not having to write the intermediate results to disk and not
having to read them back for the next operation.

πFNAME, LNAME, ADDRESS

DNUMBER=DNO

DNAME=’Research’ EMPLOYEE

DEPARTMENT

www.arihantinfo.com
142
RDBMS

Unit 9
The Query Compiler

9.1.Parsing
9.2.Algebraic Laws for Improving Query Plans
9.3.From Parse Trees to Logical Query Plans
9.4.Estimating the Cost of Operations
9.5.Introduction to Cost-Based Plan Selection
9.6.Completing the Physical-Query-Plan
9.7.Coping With System Failures
9.8.Issues and Models for Resilient Operation
9.9.Redo Logging
9.10.Undo/Redo Logging
9.11.Protecting Against Media Failures

9.1. Parsing

One of the most powerful features of Rexx is its ability to parse text values. If you are like many
others who are learning Rexx you may be unfamiliar with the word parse. Perhaps you recall
parsing sentences during your schooling, but you think that was quite some time ago. Webster's
New World Dictionary contains the following definition.
parse vt., vi. parsed, pars'ing

1. To separate (a sentence) into its parts, explaining the grammatical form, function, and
interrelation of each part.
2. To describe the form, part of speech, and function of (a word in a sentence)
The above definition has little in common with the Rexx parsing capability. The key phrase is: "to
separate into its parts". For the word parse is computer science parlance for the act of separating
computer input into meaningful parts for subsequent processing actions.
Rexx is one of few languages which provides parsing as a fundamental instruction. Most
languages merely provide lower level string separation capabilities, leaving the preparation of
parsing capabilities as user developed endeavors. Within Rexx, these capabilities are immediately
available, and this is very powerful.

Preparing to parse

Let us learn about parsing by analyzing the following reduction of Descartes' famous quote:
I think I am
Here is a program that parses the words in the phrase. When a value consists of words that are
separated by only one space, and there are no leading or trailing spaces, the value is easy to parse
into a known number of words as follows.
parse value 'I think I am' with word1 word2 word3 word4
say "'"word1"'"
say "'"word2"'"
say "'"word3"'"
say "'"word4"'"
This shows:
'I' 'think,'
'I' 'am'

Here is another program that parses the above phrase.


phrase = 'I think I am'
do while phrase <> ''
parse var phrase word phrase
say "'"word"'"
www.arihantinfo.com
143
RDBMS
end

Let Rexx know what you mean

When the value that is being parse contains punctuation that partitions the values into
meaningful components, you can easily assign these parts to variables. Consider the following
example:
parse value 'I think, therefore I am (I think)' with precondition ', ' consequence ' (' qualifier ')'
say 'precondition' precondition
say 'consequence' consequence
say 'qualifier' qualifier

This shows:
'precondition' I think
'consequence' therefore I am
'qualifier' I think

Suppose the value consists of a sequence of fields separated by tabs. You can easily assign these
to variables as follows:
tab = '09'x /* this is an Ascii tab character */

parse var request ,


Company (tab) ,
Sales (tab) ,
CostOfGoods (tab) ,
NetIncome (tab) ,
Cash (tab) ,
AccountsReceivable (tab) ,
AccountsReceivablePrior (tab) ,
Inventory (tab) ,
InventoryPrior (tab) ,
OtherCurrentAssets (tab) ,
PropertyEquipment (tab) ,
AccumulatedDepreciation (tab) ,
OtherAssets (tab) ,
TotalAssetsPrior (tab) ,
CurrentLiabilities (tab) ,
LongTermDebt (tab) ,
OtherLiabilities (tab) ,
PreferredStock (tab) ,
CommonStock (tab) ,
RetainedEarnings (tab) ,
StockholdersEquity (tab) ,
StockValue (tab) ,
SharesOutstanding (tab) ,
PreferredDividends (tab)

Multiple value assignment


You might have seen Rexx programs that have multiple assignment instructions on a single line,
especially in books. Your programs will be easier to understand if the assignments are placed on
separate lines.

Consider the following example.


drop a3; a33 = 7; k = 3; fred='K'; list.5 = '?'
The parse instruction can perform multiple assignments. The above assignments can be
accomplished as follows:
drop a3 /* the parse instruction can not drop values */

www.arihantinfo.com
144
RDBMS
How does parsing work ?

The parse statement divides a source string into constitutent parts and assigns these to variables,
as directed by the parsing template.

The following picture introduces how parsing is performed, with multiple space dividers between
the variables to assign.

While the template is processed from left to right, several current positions in the source string are
maintained. The motion of these positions is guided by the division specifiers within the template.
In the picture above, the positions are those that would be in effect after the template's verb term
is processed. The object term will be processed next. The previous start position locates the 'l' in
'likes'. The current end position locates the space between 'likes' and 'peaches'. The next start
position locates the 'p' in 'peaches'. With these positions established the value 'likes' is assigned to
variable verb. When the object term is processed, it is the only term remaining. Consequently, the
remainder of the source string is assigned to the object variable -- it receives the value: 'peaches
and cream'.
If a relative position division specifier followed the verb term, the verb variable would receive that
many characters after the previous start position and all positions would be advanced to that
relative position. Study the following effect:
parse value 'Sam likes peaches and cream' with subject verb +2 object
say 'subject:' subject
say 'verb:' verb
say 'object:' object
This shows:
subject: Sam
verb: li
object: kes peaches and cream

The following is another illustration that shows how parsing is performed, with a literal pattern
divider between the variables to assign.

www.arihantinfo.com
145
RDBMS

The literal pattern in this example is a quoted comma -- ',' . The previous start position locates the
't' in 'think'. The current end position locates the ','. The next start position locates the space
between the comma and the 't' in 'therefore'. With these positions established the value 'I think' is
assigned to variable precondition. When the consequence term is processed, it is the only term
remaining. Consequently, the remainder of the source string is assigned to the consequence
variable -- it receives the value: ' therefore I am'. This value contains a leading space.

9.2. Algebraic Laws for Improving Query Plans

The Oracle database has three different optimizer modes. The default optimizer mode is RULE
base and this can be change using the ALTER SESSION command. To obtain a query plan for a
specific query, execute the EXPLAIN PLAN command. The result of the EXPLAIN PLAN will be
inserted into a plan_table. Therefore, before executing the EXPLAIN PLAN command, the
plan_table must be created. To view the result of the EXPLAIN PLAN command, simply query the
plan_table (using simple SELECT statement). Following is some important column names:
statement_id operation options
object_name id parent_id cost

TUTORIAL QUESTION

Given the following EXPLAIN PLAN statement:


EXPLAIN PLAN SET STATEMENT_ID = 'Q'
INTO plan_table
FOR
SELECT s.idno, s.fname, s.lname
FROM student s, enrol e, subject su
WHERE s.idno=e.idno AND su.subno=e.subno
AND su.subno='CSE3000' AND mark >= 80;
Below is the result inserted into the plan_table.
OPERATION
OPTIONS
OBJECT_NAME

www.arihantinfo.com
146
RDBMS
ID
PARENT_ID

PLAN_TABLE

CREATE TABLE plan_table


(statement_id VARCHAR2(2),
timestamp DATE,
remarks VARCHAR2(80),
operation VARCHAR2(30),
options VARCHAR2(30),
object_node VARCHAR2(128),
object_owner VARCHAR2(30),
object_name VARCHAR2(30),
object_instance NUMERIC,
object_type VARCHAR2(30),
optimizer VARCHAR2(255),
search_columns NUMERIC,
id NUMERIC,
parent_id NUMERIC,
position NUMERIC,
cost NUMERIC,
cardinality NUMERIC,
bytes NUMERIC,
other_tag VARCHAR2(255),
other LONG);

INDEX
CREATE INDEX student_index ON student (lastname);
CREATE INDEX enrol_index ON enrol (mark);
DROP INDEX student_index;
DROP INDEX enrol_index;

OPTIMIZATION GOALS
The default optimizer mode can be changed by executing one of the following statements.
ALTER SESSION SET OPTIMIZER_MODE=ALL_ROWS;
ALTER SESSION SET OPTIMIZER_MODE=FIRST_ROWS;
ALTER SESSION SET OPTIMIZER_MODE=RULE;

EXPLAIN PLAN
EXPLAIN PLAN
SET STATEMENT_ID = 'Q1'
INTO plan_table
FOR
SELECT * FROM student WHERE id=14506302;
The above statement gives and inserts the plan for the query into the plan_table.

SPOOL
The following script creates a spool file called results.txt which has all the output displayed on
the screen from the time it is on until the spool is turned off.
Spool On
set pause off
set echo on
spool results.txt
Spool Off
set echo link bar to this page, you must first save it to a Web server that is running the FrontPage
Server Extensions 2002 or SharePoint link bar to this page, you must first save it to a Web server
that is running the FrontPage Server Extensions 2002 or SharePoint off
www.arihantinfo.com
147
RDBMS
spool off
set pause on

9.3. From Parse Trees to Logical Query Plans

The parse tree is transformed into an expression tree of relational algebra, which is a logical query
plan. The logical query plan must be turned into a physical query plan.

Syntax Analysis and Parse Trees


Nodes of a parse tree Atoms
Lexical elements: keywords, names of attributes or relations, constants, parentheses, operators.
It has no children Syntactic categories Names for families of query subparts that all play a similar
role in a query Notation: triangular bracket

A Grammar for a Simple SQL


A real SQL grammar would have a much more complex structure for queries

Query
Select-From <table_name>
Where <Query> ::= <SFW>
<Query> ::= ( <Query> )
<SFW> ::= SELECT <SelList> FROM <FromList>
WHERE <Condition>

Simple Subset of SQL


Select-Lists
From-Lists
<FromList> ::= <Relation> , <FromList>
<FromList> ::= <Relation>
<SelList> ::= <Attribute> , <SelList>
<SelList> ::= <Attribute>

Conditions
<Condition> ::= <Condition> AND <Condition>
<Condition> ::= <Tuple> IN <Query>
<Condition> ::= <Attribute> = <Attribute>
<Condition> ::= <Attribute> LIKE <Pattern>
<Tuple> ::= <Attribute>

Base Syntactic Categories


<Attribute>, <Relation>, <Pattern>
They are not defined by grammatical rules, but rules about the atoms for which they can stand.
For example, one child of <Attribute> can be any string of characters that can be interpreted as
the name of an attribute.

Example
StarsIn(title, year, starName)
MovieStar(name, address, gender, birthdate)

Find the movies with stars born in 1960


SELECT title
FROM StarsIn
WHERE starName IN
(
www.arihantinfo.com
148
RDBMS
SELECT name
FROM MovieStar
WHERE birthdate LIKE ‘%1960’

);

Example : Parse Tree


<Query>
<SFW>
SELECT <SelList> FROM <FromList> WHERE <Condition> <Attribute> <RelName>
<Tuple> IN <Query> title StarsIn <Attribute> ( <Query> ) starName <SFW>
SELECT <SelList> FROM <FromList> WHERE <Condition>
<Attribute><RelName><Attribute> LIKE <Pattern> name MovieStar birthDate‘%1960’

Example
Find the movies with stars born in 1960

SELECT title FROM StarsIn, MovieStar


WHERE starName = name AND birthdate LIKE ‘%1960’;

Algebraic Laws for Improving query Plans

To get an equivalent expression tree (a logical query plan) that may have a more efficient physical
query plan
Commutative and associative laws
Selection
Push selections
Projection
Duplicate elimination
Grouping and aggregation

Commutative and Associative Laws


A commutative law about an operator. It does not matter in which order you present the
arguments of the operator. An associative law about an operator. It says that we may group two
used of the operator either from the left or the right.

Property
When an operator is both associative and commutative, then any number of operands connected
by this operator can be grouped and ordered as we wish without changing the result.

9.4. Estimating the Cost of Operations

Query Optimization: Estimating Cost, Cost Plans, .

Cost-Based Query Optimization

Cost-based Query Optimization is the process of selecting an optimal or near-optimal query


execution plan by computing or estimating the cost of executing each plan and choosing the best
one.

Problems: too many plans (exponential growth), cost not exactly computable unless the query is
executed, statistics on databases often not constant and/or unreliable, network cost cannot be
anticipated...

Estimating the cost Of Operations

www.arihantinfo.com
149
RDBMS
Estimating cost is useful both for improving logical plans and for choosing physical plans. Cost
can not be computed exactly must estimate. For most algorithms implementing operators, cost is
roughly proportional to (input) relation size.
Estimating the Size of Operation Results
• Sizes of base relations are known (data dictionary)
• Can gather statistics (below)
•size of join can be estimated from sizes of underlying, relation and number of duplicates

Example

Joining R S (natural join). Relation R: 100,000 tuples, Relation S: 200 tuples. Join on common
attribute A. Number of distinct tuples in R.A: 100, uniformly distributed. Each value occurs 1,000
times. S.C is a primary key in S. Therefore: Only 100 values in S.C could possibly match with R.C.
T (Recommended Books: S) = 1, 000 ∗ 100 = 100, 000 tuples.
|R| ≡ T (R)

Estimating the Size of a Projection (π)

• Computation is trivial if projection is a bag operator.


• Only interesting part: can compute the size of relation in bytes (by looking at the size of each
tuple).
• For set-semantics: computation must be done like for a δ-operator (i.e., statistics must be
available).

Estimating the Size of a Selection

Concept of selectivity: sel σ = number of output tuples number of input tuples


Simple case:
select-condition is an equality with a constant (A = c)
T (σ(R)) ≈ T (R)
V (R, A)
where V (R, A) is number of distinct values in attribute R.A.

Estimating Selectivities

More difficult cases:


• A < c: add all selectivities for values < c OR just assume some constant ratio (for example 1/3)
• A = c: assume no tuples are “selected out” ⇒ T (σ(R)) = T (R)
• R.A = S.A: need better statistics... (next slide)
• conjunctions (AND) and disjunctions (OR) of conditions : use standard statistical techniques.

Estimating the Size of a Join

Simple: T (R × S) = T (R) × T (S).


Equijoin (Recommended Books: R.Y =S.Y S): under many assumptions, the
following holds:
T (Recommended Books: S) ≈ T (R) · T (S)
max(V (R, Y ), V (S, Y ))
Remember: size of tuples might also be an issue.

Some Research Results

• use sampling to determine join sizes


• collect histograms (i.e., more sophisticated versions of “V(R,A)”)
• gather statistics by “listening in” to queries
• number of distinct values in each attribute
• domains of values (minimum, maximum, range)
• update statistics as database gets updated (“incrementally”)
www.arihantinfo.com
150
RDBMS

9.5. Introduction to Cost-Based Plan Selection

The number of disk I/O’s is influenced by The particular logical operators chosen to implement
the query. The sizes of intermediate relations. The physical operators used to implement logical
operators. The ordering of similar operations. The method of passing arguments from one physical
operator to the next.

Obtaining Estimates (1/2)

A modern DBMS generally allows the user or administrator explicitly to request the gathering of
statistics, which are used in query optimizations: Statistics
T(R) and V(R,a)
Scan an entire relation R
B(R)
Count the actual number of blocks used (if R is clustered)
Or, divide T(R) by the (average) length of a tuple

Obtaining Estimates (2/2)

DBMSs may compute a histogram of the values for a given attribute. If V(R,A) is not too large, the
histogram consist of the number (or fraction) of the tuples having each of the values of attribute A

Common three types of histograms

Equal-width
Equal-height
Most-frequent-values
One advantage of keeping a histogram is that the sizes of joins can be estimated more accurately

Example (1/2)

Consider the join: R(a,b) S(b,c)


Histogram for R.b
1: 200, 0: 150, 5: 100, others: 550
Histogram for S.b
0: 100, 1: 80, 2: 70, others: 250
Assume, V(R,b) = 14 and V(S,b) = 13
550 tuples of R and 250 tuples of S are divided among eleven values and ten values, respectively

Example (2/2)
The 150 tuples of R with b = 0 join with the 100 tuples of S having b = 0, to yield 15,000 tuples,
With b = 1, 200 * 80 = 16,000 tuples
With b = 2, 50 * 70 = 3,500 tuples
With Other nine b-values, 50 * 25 = 1250
The estimate of the output size
15,000 + 16,000 + 3,500 + 2,500 + 9*1,250 = 48,250

* Note that the simpler estimate from Section 7.4 would be 1000 * 500/14 (=35,714)

Example (1/2)
Consider two relations
Jan(day, temp)
July(day, temp)
The query is:
SELECT Jan.day, July.day
FROM Jan, July
www.arihantinfo.com
151
RDBMS
WHERE Jan.temp = July.temp;

0
0-9
10-19
20-29
30-39
40-49
50-59
60-69
70-79
80-89
90-99
July
Jan
Range
“Find pairs of days in January and July that had the same temperature”

Query Plan: all grandparents of ann from the available sources.


answer(X) :- parent(X,Z), parent(Z,ann)
parent(X,Y) :- v1(X,Y)
parent(X,Y) :- v2(X,Y)

Functional Dependencies

Suppose the virtual relations:


conference(Paper, Conference), year(Paper, Year), location(Conference, Year, Location)

Functional dependencies
conference: Paper -> Conference
year: Paper -> Year
location: Conference, Year -> Location
Information sources
v1(P,C,Y) :- conference(P,C), year(P,Y)
v2(P,L) :- conference(P,C), year(P,Y), location(C,Y,L)
Query: q(L):- location(ijcai, 1991, L)
Answer: answer(L) :- v1(P, ijcai, 1991), v2(P, L)

Definition (inverse rule): Let v be a source description Then for j=1, …, n, is an inverse
rule of v.
Modifying to obtain as follows:
if X is a constant or is a variable in ,then X is unchanged in .
Otherwise, X is one of the variables Xi appearing in the body of v but not in , and X is replaced
by in purpose is to recover tuples of the virtual relations from the source relations.

9.6. Completing the Physical-Query-Plan

Estimating sizes of relations


The sizes of intermediate results are important for the choices made when planning query
execution. Time for operations grow: linearly with size of (largest) argument. The total size can
even be used as a crude estimate on the running time.

Statistics for computing estimates


The book suggests several statistics on relations that may be used to (heuristically) estimate the
size of intermediate results.
T(R): # tuples in R
S(R): # bytes in each R tuple
www.arihantinfo.com
152
RDBMS
B(R): # blocks to hold all R tuples
V(R, A): # distinct values in R for attribute A

Size estimates for W= R1 R2


T(W) =
S(W) =
T(R1)
 T(R2)
S(R1) + S(R2)

Question: How good are these estimates?


S(W) = S(R)
T(W) = ?

Size estimate for W =


A=a (R)

Some possible assumptions


Values in select expression A = a (or at least one of them) are uniformly distributed over the
possible V(R,A) values.
As above, but with uniform distribution over domain with DOM(R,A) values.
Zipfian distribution of values.

Selection cardinality
SC(R,A) = expected # records that satisfy
equality condition on R.A
T(R)
V(R,A)
SC(R,A) =
T(R)
DOM(R,A)

Size estimate for W = A ≥a(R)


T(W) = ?
Suggestion # 1: T(W) = T(R)/2.
Suggestion # 2: T(W) = T(R)/3.
Suggestion # 3 (not in book):
Be consistent with equality estimate.

Example:
Consistency with 2 nd
equality estimate.
Recommended Books: A
Min=1
W= A ≥15 (R)
Max=20
f = 20−14 (fraction of range)
20
T(W) = f T(R)

Problem session
Consider the natural join operation on two relations R1 and R2 with join attribute A.
If values for A are uniformly distributed on DOM(R1,A)=DOM(R2,A) values, what is the expected
size of R1, R2?
What can you say if values for A are instead uniform on respectively V(R1,A) and V(R2,A) values?
What if A is primary key for R1 and/or R2?

Crude estimate
www.arihantinfo.com
153
RDBMS
Values uniformly distributed over domain
R1 A B C
R2 A D
This tuple matches T(R2)/DOM(R2,A) so
T(W) = T(R2) T(R1) = T(R2) T(R1)
DOM(R2, A) DOM(R1, A)
T(W) = T(R1) T(R2) ... T(Rk)
DOM(R1,A)
k−1

General crude estimate


Let W = R1
R2
R3
...
Rk
Symmetric wrt. the relations. A rare property...
"Better" size estimate for W = R1
R2

Assumption:
Containment of value sets
V(R1,A) ≤ V(R2,A)
Every A value in R1 is in R2
V(R2,A) ≤ V(R1,A)
Every A value in R2 is in R1
R1 A B C R2 A D

Computing T(W) when V(R1,A) ≤ V(R2,A)


Take 1 tuple
Match 1 tuple matches with T(R2) tuples...
V(R2,A)
so T(W) = T(R2) * T(R1)
V(R2, A)
T(W) = T(R1) ... T(Rk) min{ V(R1,A),...,V(Rk,A)}
V(R1,A) V(R2,A) V(Rk,A)

General estimate
Let W = R1
R2
R3
...
Rk

Underlying assumption:
The previous estimates are easily extended to several join attributes A1,...,Aj:

New assumption: Values are independent.Under assumption 1, the joint values in


attributes are uniformly distributed on V(R,A1) V(R,A2) ... V(R,Aj) values.
Under assumption 2, they are uniform on DOM(R,A1) DOM(R,A2) ... DOM(R,Aj)
values.

Other estimates use similar ideas


AB (R)
.. Sec. 16.4.2
A=a
∧B=b (R)
Union, intersection, duplicate elimination, difference,
www.arihantinfo.com
154
RDBMS

Improved estimates through


Idea: Maintain more info than just V(R,A).
Histogram: Number of values in each of a number of intervals.
< 11 11−17 18−22 23−30 31−41 > 42

9.7. Coping With System Failures

Integrity or consistency constraints Predicates data must satisfy


Examples:
- x is key of relation R
- x y holds in R
- Domain(x) = {Red, Blue, Green}
- a is valid index for attribute x of R
no employee should make more than twice the average salary
SQL uses constraints or triggers to enforce integrity
CHECK age > 16 AND age <= 99

Definitions
Consistent state: satisfies all constraints
Consistent DB: DB in consistent state
Ideally: database should reflect real world
DB Reality
A1
A2
.
.
500
.
.
500
.
.
600
.
.
500
.
.
600
.
.
600

Example: Salary A1 = Salary A2 (constraint)


Deposit $100 in A1: A1 A1 + 100
Deposit $100 in A2: A2 A2 + 100
DB may not be always consistent!

Transaction: collection of actions that preserve consistency


Consistent DB
Consistent DB’ T

Big assumption:
If T starts with consistent state + T executes in isolation
T leaves consistent state
Correctness (informally)
www.arihantinfo.com
155
RDBMS
If we stop running transactions, DB left consistent
Each transaction sees a consistent DB
How can constraints be violated?
Transaction bug
DBMS bug

Hardware failure
e.g., disk crash alters balance of account
Data sharing
e.g.: T1: give 10% raise to programmers T2: change programmers systems analysts

Coping with system failures


Question 1: What is our Failure Model?
Events Desired
Undesired Expected

Unexpected
CPU
Memory
Disk
Desired events: see product manuals…
Undesired expected events:
System crash
- memory lost
- cpu halts, resets

Examples:
Disk data is lost
Memory lost without CPU halt
CPU dies…
Undesired Unexpected: Everything else!

Is this model reasonable?


Approach: Add low level checks + redundancy
E.g., Replicate disk storage (stable store), Memory parity, CPU checks.

Question 2
Storage hierarchy
Memory Disk
x
x

Operations:
Input (x): block with x memory
Output (x): block with x disk
Read (x,t): do input(x) if necessary t value of x in block
Write (x,t): do input(x) if necessary value of x in block t
Memory Disk
x
x
t

Key problem: Unfinished transaction


Example Constraint: A=B
T1: A A 2
B B2


T1: Read (A,t); t t2
www.arihantinfo.com
156
RDBMS
Write (A,t);

Read (B,t); t t2
Write (B,t);
Output (A);
Output (B);
A: 8
B: 8
A: 8
B: 8
memory
disk
16
16
16
failure!

18 Need atomicity: execute all actions of a transaction or none at all.

9.8. Issues and Models for Resilient Operation

Methods, models, etc. to assess both vulnerability and resiliency of social, political, and
economic systems across different units of analysis (e.g., individuals, organizations,
institutions in both the public and private sector as well as nongovernmental
organizations),geographic scales, and phases of the emergency management cycle (e.g.,
preparedness,response, recovery, and mitigation).

Assessment of direct, (psychological, social, economic), indirect, and ripple effects resulting from
the September 11 attacks

Risk factors affecting both impacts and outcomes. Relationships and Connections Between
Human and Physical (engineered) Systems. Research is needed to identify ways in which the built
environment and human and organizational behavior interact to either amplify or reduce
vulnerability. Topics for study include:
• Models, methods, data focusing on the interface between human and physical (engineered)
• systems, and in particular ways in which these systems can be better integrated.

Examples
include building designs and emergency plans to enhance life safety through protecting
building occupants and facilitating emergency egress
• Risk communication, pre-event planning, and post-event response
management to protect
• lives and property life safety and encourage appropriate self-protective
behavior.

Institutional Arrangements
Additional research is needed to address institutional, multiorganizational, and organizational
dimensions of pre-event mitigation and planning and post-event response and recovery.
Research focusing on the following areas is needed:
• Capability and adaptability of institutions (e.g., governmental and private-sector entities
and entities responsible for infrastructure maintenance) to deal with vulnerability both
before and after a disaster.
• Interorganizational and intergovernmental relations, including dynamics of multi- agency
decision-making and challenges associated with horizontal (among organizations) and
vertical (among different governmental levels) integration in major crises.
• Communications and information sharing among individuals, groups, and organizations,
especially with respect to the various phases of the emergency management cycle (e.g.,
preparedness, response, recovery, and mitigation).

www.arihantinfo.com
157
RDBMS
• Social, political, legal, administrative, and other factors that influence institutional
behavior and response in large-scale and near-catastrophic events

Decision-making and Risk


Additional research is needed to improve decision-making at different units of analysis. Topics for
further research include:
• Models and approaches to characterize tradeoffs and decision processes employed by
individuals, organizations, and institutions across the emergency management cycle.
• Decision processes at various levels of analysis across the entire hazards/emergency
management cycle and those that provide linkages among the various stages of the cycle.

CROSS-CUTTING ISSUES
Many research needs span both engineering and social science disciplines. These areas of
convergence include the need for:
• Improved theories, models, methods, and analytical tools, including tools that are capable
of integrating data both spatially and temporally.
• Strategies to ensure maximum data availability, access, and sharing.
• Research focusing on documenting and analyzing both successes and failures in
engineered and human systems (e.g., robust and redundant structures and systems,
successful organizational coping and adaptation in crises.
• Research to better understand similarities and dissimilarities among varied disaster
agents--natural, technological, and terrorism-related disasters.
• Studies that address the needs of a wide range of users and target audiences (e.g.,
organizations charged with responsibility for managing response, recovery and
reconstruction activities).

Structures and Physical Systems

1. Analytical models / Simulation of performance. This capability has been developed in other
areas and can be applied to structures.

1.A. Data from the World Trade Center collapse is needed to validate such models and
simulations. The design and operation should be considered under normal and extreme
events. Data from other buildings and cases should also be included.

2. Analytical models / Simulation of building systems. This area refers to the electrical,
mechanical aspects of buildings. Examples include temperature, air flow, and other aspects.

2.A. Data from the World Trade Center is needed to validate these models and
simulations. Design and operation under normal and extreme events should be included.

3. Analytical models / Simulation of emergency management and human response. Such tools
can be used in planning and execution.

3.A. Data from the World Trade Center should be used to validate these models and
simulations.

4. Analytical models of information flows, including sharing of information. This research topic
consists of looking at what was done in terms of data sharing and what could be done better in
the future.

4. A. An area of research within this topic is the availability and incentives for sharing
information. Being able to demonstrate the consequences of lack of sharing. How access
to information can be preserved while respecting security needs.

www.arihantinfo.com
158
RDBMS
5. Debris field and collateral damage. This research area addresses questions related to where
the collapsed pieces are likely to go and what the structure of the collapsed material is likely to
be.

5. The area of analysis includes both the surface and subsurface, and this also includes
infrastructure.

6. Structure of collapsed buildings. This refers to three areas: safety and removal; prediction of
void
spaces; and strategies for search and rescue.

7. Environmental consequences. This area includes, but is not limited to: airborne/plume model;
water borne and land based pollution; evolution of source over time; model validation (WTC and
other crises for urban terrains); and NBC applications.

8. Intelligent buildings and bridges. This research addresses the role of advanced technologies on
intelligent structures/buildings and their future performance goals.

9. Distributed networks. Given New York City’s unique energy network, an important research
question relates to what an event similar to the attack on the World Trade Center would do in a
setting with a different energy network configuration. This area also refers to strategies for
resilient networks and complex adaptive systems, such as energy, communications, water, and
others.

9.A. The World Trade Center and other cases can be used to understand what worked and
why.

10. Overarching ‘tools’ for making risk-informed decisions. This includes databases of networks,
models and processes. The main research question is how models of structures, networks, and
processes can be integrated into risk models and risk management.

11. Fragility curves for organizations collapse. This area of research refers to the application of
models from physical systems to organizations. An example could be how organizations perform
under different levels of stress.

12. Interdependencies between and among infrastructure systems.

13. Cost/consequence models. Issues related to costs and benefits should be considered for
normal and extreme events, as well as for response efforts.

14. Damage/update/reanalyze for real performance deterioration.

9.9. Redo Logging

A transaction on the current database transforms it form the current state to a new state. This is
the co-called DO operation. The undo and redo operations are functions of the recovery
subsystem of the database system used in the recovery process. The undo operation undoes or
reversers the actions possibly partially executed) of a transaction and restores the database to the
state that existed before the start of the transaction. The redo operation redoes the action of a
transaction and redoes the action of a transaction and restores the database to the state it would
be in the end of the transaction. The undo operation is also called into plays when a transaction
decides to terminate itself.

The undo and redo operations for given transaction are required to be idempotent; that is for any
transaction of the database as a result of translation, performing one of these operations once is
equivalent to performing it any number of times. Thus:
Undo(any action)= undo(undo(..undo(any action)..))
www.arihantinfo.com
159
RDBMS
redo(any action)= redo(redo(..redo(any action)..))

the reason for that requirement that undo and redo be idempotent is that the recovery process,
while in the process of redoing the actions of a transaction, may fail without a trace, and this type
of failure can occur any number of times before the recovery is completed successfully.

9.10. Undo/Redo Logging

A transaction that discovers an error while it is in progress and consequently needs to abort itself
and roll back any changes made by it uses the transaction undo removes all database changes,
partial or otherwise, made by the transaction.

Redo
it involves performing the changes made by a transaction that committed before a system crash.
With the write-ahead log strategy, a committed transaction implies that log for the transaction.
Since the redo operation is idempotent, redoing the partial or complete modification made by a
transaction.

Undo
Transaction that are partially complete at the time of a system crash with loss of volatile storage
need to be undoing any changes made by the transaction. The global undo operation, initiated by
the recovery system, involves undoing the partial or otherwise updates made by all uncommitted
transactions at the time of a system failure.

The Redo-Logging Rule


The order in which material associated with one transaction gets written to disk:
The log records indicating changed database elements
The “COMMIT” log record
The changed database elements themselves
Example
<START T>
<T,A,16>
<T,B,16>
<COMMIT T>
READ(A,t)
t := t*2
WRITE(A,t)
READ(B,t)
t := t*2
WRITE(B,t)
FLUSH LOG
OUTPUT(A)
OUTPUT(B)
1)
2)
3)
4)
5)
6)
7)
8)
9)
10)
11)
Log
D-B
D-A
www.arihantinfo.com
160
RDBMS
M-B
M-A
t
Action
Step

9.11. Protecting Against Media Failures

Transaction
A sequence of database operations that have ACID properties

Syntax in ESQL/C
Start: most (but not all) SQL statements
Not transaction-initiating statements
Connect, Disconnect, Set, Commit, Rollback, Declare, Get Diagnostics, …
End: Commit work, Rollback work
Commit indicates successful end of a transaction, Rollback indicates abnormal termination of a
transaction.

ACID Properties
Atomicity, Either all actions in a transaction occur successfully or nothing has happened,
All-or-nothing property.
Consistency, Assumes that any successful transaction commits only legal result, A transaction is
a correct transformation of the state, i.e., from one valid state to another valid state.
Isolation, Events within a transaction must be hidden from other transactions running
concurrently, The actions carried out by a transaction against a shared database cannot become
visible to other transactions until the transaction commits.
Durability, Once a transaction has completed and has commits, the system must guarantee that
these results survive any subsequent failures.

Failure Modes
Transaction failure, When a transaction aborts, Need transaction rollback
System failure, Refers to the loss or corruption of volatile storage (main memory)
Power out, OS failure, …
Need system restart
Media (catastrophic) failure
When any part of the stable storage (disk) is destroyed
Head crash, disk controller error, …
Need roll-forward

The Primitive Operations of Transactions


How transactions interact with databases
The space of disk blocks holding the database elements. The virtual or main memory address
space that is managed by the buffer manager. The local address space of the transaction.
* Transactions don’t access the disk holding the database elements directly.

INPUT(X): Copy the disk block containing database element X to a memory buffer
READ(X,t): Copy the database element X to the transaction’s local variable t
If the block containing database element X is not in memory buffer, then first execute INPUT(X)
WRITE(X,t): Copy the value of local variable t to database element X in a memory buffer
OUTPUT(X): Copy the buffer containing X to disk.

Recovery Techniques
A very complex area. No formal (mathematical) model on recovery
Implementation and techniques are completely dependent on other features (concurrency control,
disk management, buffer management, index management, etc.) of a particular system. Much of
work did not get documented well
www.arihantinfo.com
161
RDBMS

Shadowing Approach
A logical page is read from a physical page P (shadow version) and after modification is written to
another physical page P’ (current version)
During checkpoint, shadow versions is discarded and current versions become shadow versions
On failure, recovery is performed with log and shadow versions
UNDO is very simple (+)
Lot of disk space needed (-)
Hard to cluster pages in disk (-)
Hard to support record-level locking (-)
Not adopted in modern commercial systems.

Logging Approach
In-place update in buffer and disk. All updates are logged in a “linear file” called log. Outperform
shadowing in general. Widely used in various systems.

Log Concept
A history of all changes to the state
Log + old state gives new state
Log + new state gives old state
Log is a sequential file
Complete log is the complete history

DO-REDO-UNDO
Redo proceeds forward in the log (FIFO) while undo backward (LIFO)
Old state
Log record
New state
DO
Old state
Log record
New state
REDO
New state
Log record
Old state
UNDO

www.arihantinfo.com
162
RDBMS
Unit 10
Concurrency Control

10.1Serial and Serializable Schedules


10.2Conflict-Serializability
10.3Enforcing Serializability by Locks
10.4Locking Systems With Several Lock Modes
10.5Architecture for a Locking Scheduler
10.6Managing Hierarchies of Database Elements
10.7Concurrency Control by Timestamps
10.8Concurrency Control by Validation

10.1. Serial and Serializable Schedules

When two or more transactions are running concurrently, the steps of the transactions would
normally be interleaved. The interleaved execution of transactions is decided by the database
scheduler, which receives a stream of user requests that arise from the active transactions. A
particular sequencing (usually interleaved) of the actions of a set of transactions is called a
schedule. A serial schedule is a schedule in which all the operations of one transaction are
completed before another transaction can begin (that is, there is no interleaving).

Serial execution means no overlap of transaction operations.


If T1 and T2 transactions are executed serially:

RT1(X) WT1(X) RT1(Y) WT1(Y) RT2(X) WT2(X)


or
RT2(X) WT2(X) RT1(X) WT1(X) RT1(Y) WT1(Y)
The database is left in a consistent state.

The basic idea:

Each transaction leaves the database in a consistent state if run individually


If the transactions are run one after the other, then the database will be consistent
For the first schedule:

Database T1 T2
--- x=100, y=50 --- read(x) --- x=100 ---
x:=x*5 --- x=500 ---
--- x=500, y=50 --- write(X)
read(Y) --- y=50 ---
Y:=Y-5 --- y=45 ---
--- x=500, y=45 --- write(Y)
read(x) --- x=500 ---
x:=x+8 --- x=508 ---
--- x=508, y=45 --- write(X)

For the second schedule:


Database T1 T2
--- x=100, y=50 --- read(x) --- x=100 ---
x:=x+8 --- x=108 ---
--- x=108, y=50 --- write(x)
read(x) --- x=108 ---
x:=x*5 --- x=540 ---

www.arihantinfo.com
163
RDBMS
--- x=540, y=50 --- write(x)
read(y) --- y=50 ---
y:=y-5 --- y=45 ---
--- x=540, y=45 --- write(y)

Serializable Schedules

Let T be a set of n transactions T1, T2, ..., Tn . If the n transactions are executed serially (call
this execution S), we assume they terminate properly and leave the database in a consistent
state. A concurrent execution of the n transactions in T (call this execution C) is called
serializable if the execution is computationally equivalent to a serial execution. There may be
more than one such serial execution. That is, the concurrent execution C always produces
exactly the same effect on the database as some serial execution S does. (Note that S is some
serial execution of T, not necessarily the order T1, T2, ..., Tn ). A serial schedule is always
correct since we assume transactions do not depend on each other and furthermore, we
assume, that each transaction when run in isolation transforms a consistent database into a
new consistent state and therefore a set of transactions executed one at a time (i.e. serially)
must also be correct.

Example

1. Given the following schedule, draw a serialization (or precedence) graph and find if the
schedule is serializable.

Solution:

There is a simple technique for testing a given schedule S for serializability. The testing is
based on constructing a directed graph in which each of the transactions is represented by
one node and an edge between and exists if any of the following conflict operations
appear in the schedule:

executes WRITE( X) before executes READ( X), or

executes READ( X) before executes WRITE( X)

executes WRITE( X) before executes WRITE( X).

www.arihantinfo.com
164
RDBMS
If the graph has a cycle, the schedule is not serializable.

10.2. Conflict-Serializability

• Two operations conflict if:


o they are issued by different transactions,
o they operate on the same data item, and
o at least one of them is a write operation

• Two executions are conflict-equivalent, if in both executions all conflicting operations have
the same order

• An execution is conflict-serializable if it is conflict-equivalent to a serial history

Conflict graph

Execution is conflict-serializable iff the conflict graph is acyclic


T1
T2
T3
W1(a) R2(a) R3(b) W2(c) R3(c) W3(b) R1(b)

Example

Schedule Conflict Graph


Nodes: transactions
Directed edges: conflicts between operations

Serializablity (examples)
• H1: w1(x,1), w2(x,2), w3(x,3), w2(y,1),r1(y)
• H1 is view-serializable, since it is view- equivalent to H2 below:
o H2: w2(x,2), w2(y,1), w1(x,1), r1(y), w3(x,3)
• However, H1 is not conflict-serializable, since its conflict graph contains a cycle: w1(x,1)
occurs before w2(x,2), but w2(x,2), w2(y,1) occurs before r1(y)
• No serial schedule that is conflict-equivalent to H1 exists

Execution Order vs. Serialization Order


• Consider the schedule H3 below: H3: w1(x,1), r2(x), c2, w3(y,1), C3, r1(y), C1
• H3 is conflict equivalent to a serial execution in which T3 is followed by T1, followed by T2
• This is despite the fact that T2 was executed completely and committed, before T3 even
started

Recoverability of a Schedule

• A transaction T1 reads from transaction T2, if T1 reads a value of a data item that was
written into the database by T2
• A schedule H is recoverable, iff no transaction in H is committed, before every transaction
it read from is committed
• The schedule below is serializable, but not recoverable: H4: r1(x), w1(x), r2(x), w2(y) C2, C1

Cascadelessness of a Schedule

www.arihantinfo.com
165
RDBMS
• A schedule H is cascadeless (avoids cascading aborts), iff no transaction in H reads a value
that was written by an uncommitted transaction
• The schedule below is recoverable, but not cascadeless: H4: r1(x), w1(x), r2(x), C1, w2(y)
C2

Strictness of a Schedule

• A schedule H is strict if it is cascadeless and no transaction in H writes a value that was


previously written by an uncommitted transaction
• The schedule below is cascadeless, but not strict: H5: r1(x), w1(x), r2(y), w2(x), C1, C2
• Strictness permits the recovery from before images logs

Strong Recoverability of a Schedule

• A schedule H is strongly recoverable, iff for every pair of transactions in H, their


commitment order is the same as the order of their conflicting operations.
• The schedule below is strict, but not strongly recoverable: H6: r1(x) w2(x) C2 C1

Rigorousness of a Schedule

• A schedule H is rigorous, if it is strict and no transaction in H reads a data item untils all
transactions that previously read this item either commit or abort
• The schedule below is strongly recoverable, but not rigorous: H7: r1(x) w2(X) C1 C2
• A rigorous schedule is serializable and has all properties defined above

10.3. Enforcing Serializability by Locks

Database servers support transactions: sequences of actions that are either all processed or none
at all, i.e. atomic. To allow multiple concurrent transactions access to the same data,
most database servers use a two-phase locking protocol. Each transaction locks sections of the
data that it reads or updates to prevent others from seeing its uncommitted changes. Only when
the transaction is committed or rolled back can the locks be released. This was one of the earliest
methods of concurrency control, and is used by most database systems.
Transactions should be isolated from other transactions. The SQL standard's default isolation
level is serialisable. This means that a transaction should appear to run alone and it should not
see changes made by others while they are running. Database servers that use two-phase locking
typically have to reduce their default isolation level to read committed because running a
transaction as serialisable would mean they'd need to lock entire tables to ensure the data
remained consistent, and such table-locking would block all other users on the server. So
transaction isolation is often traded for concurrency. But losing transaction isolation has
implications for the integrity of your data. For example, if we start a transaction to read the
amounts in a ledger table without isolation, any totals calculated would include amounts
updated, inserted or deleted by other users during our reading of the rows, giving an unstable
result.
Database research in the early 1980s discovered a better way of allowing concurrent access to
data .Storing multiple versions of rows would allow transactions to see a stable snapshot of the
data. It had the advantage of allowing isolated transactions without the drawback of locks. While
one transaction was reading a row, another could be updating the row by creating a new
version. This solution at the time was thought to be impractical: storage space was expensive,
memory was small, and storing multiple copies of the data seemed unthinkable.
Of course, Moore's Law has meant that disk space is now inexpensive and memory sizes have
dramatically increased. This, together with improvements in processor power, has meant that
today we can easily store multiple versions and gain the benefits of high concurrency and
transaction isolation without locking.

www.arihantinfo.com
166
RDBMS
Unfortunately the locking protocols of popular database systems, many of which were designed
well over a decade ago, form the core of those systems and replacing them seems to have been
impossible, despite recent research again finding that storing multiple versions is better than a
single version with locks

10.4. Locking Systems With Several Lock Modes

several Object Orientated Databases, which were more recently developed, have incorporated OCC
within their designs to gain the performance advantages inherent within this technological
approach.
Though optimistic methods were originally developed for transaction management the concept is
equally applicable for more general problems of sharing resources and data. The methods have
been incorporated into several recently developed Operating Systems, and many of the newer
hardware architectures provide instructions to support and simplify the implementation of these
methods.
Optimistic Concurrency Control does not involve any locking of rows as such, and therefore
cannot involve any deadlocks. Instead it works by dividing the transaction into phases.

• Build-up commences the start of the transaction. When a transaction is


started a consistent view of the database is frozen based on the state after
the last committed transaction. This means that the application will see this
consistent view of the database during the entire transaction. This is
accomplished by the use of an internal Transaction Cache, which contains
information about all ongoing transactions in the system. The application
"sees" the database through the Transaction Cache. During the Build-up
phase the system also builds up a Read Set documenting the accesses to
the database, and a Write Set of changes to be made, but does not apply
any of these changes to the database. The Build-up phase ends with the
calling of the COMMIT command by the application.
• The Commit involves using the Read Set and the Transaction Cache to
detect access conflicts with other transactions. A conflict occurs when
another transaction alters data in a way that would alter the contents of the
Read Set for the transaction that is checked. Other transactions that were
committed during the checked transaction's Build-up phase or during this
check phase can cause a conflict. If a transaction conflict is detected, the
checked transaction is aborted. No rollback is necessary, as no changes
have been made to the database. An error code is returned to the
application, which can then take appropriate action. Often this will be to
retry the transaction without the user being aware of the conflict.
• If no conflicts are detected the operations in the Write Set for the
transaction are moved to another structure, called the Commit Set that is
to be secured on disk. All operations for one transaction are stored on the
same page in the Commit Set (if the transaction is not very large). Before
the operations in the Commit Set are secured on permanent storage, the
system checks if there is any other committed transactions that can be
stored on the same page in the Commit Set. After this, all transactions
stored on the Commit Set page are written to disk (to the transaction
databank TRANSDB) in one single I/O operation. This behavior is called a
Group Commit, which means that several transactions are secured
simultaneously. When the Commit Set has been secured on disk (in one I/O
operation), the application is informed of the success of the COMMIT
command and can resume its operations.
• During the Apply phase the changes are applied to the database, i.e. the
databanks and the shadows are updated. The Background threads in the
Database Server carry out this phase. Even though the changes are applied
in the background, the transaction changes are visible to all applications
through the Transaction Cache. Once this phase is finished the transaction

www.arihantinfo.com
167
RDBMS
is fully complete. If there is any kind of hardware failure that means that
SQL is unable to complete this phase, it is automatically restarted as soon
as the cause of the failure is corrected.

Most other DBMSs offer pessimistic concurrency control. This type of concurrency control protects
a user's reads and updates by acquiring locks on rows (or possibly database pages, depending on
the implementation), this leads to applications becoming 'contention bound' with performance
limited by other transactions. These locks may force other users to wait if they try to access the
locked items. The user that 'owns' the locks will usually complete their work, committing the
transaction and thereby freeing the locks so that the waiting users can compete to attempt to
acquire the locks.
Optimistic Concurrency Control (OCC) offers a number of distinct advantages including:
• Complicated locking overhead is completely eliminated. Scalability is affected in locking
systems as many simultaneous users cause locking graph traversal costs to escalate.
• Deadlocks cannot occur, so the performance overheads of deadlock detection are avoided
as well as the need for possible system administrator intervention to resolve them.
• Programming is simplified as transaction aborts only occur at the Commit command
whereas deadlocks can occur at any point during a transaction. Also it is not necessary for
the programmer to take any action to avoid the potentially catastrophic effects of
deadlocks, such as carrying out database accesses in a particular order. This is
particularly important as potential deadlock situations are rarely detected in testing, and
are only discovered when systems go live.
• Data cannot be left inaccessible to other users as a result of a user taking a break or being
excessively slow in responding to prompts. Locking systems leave locks set in these
circumstances denying other users access to the data.
• Data cannot be left inaccessible as a result of client processes failing or losing their
connections to the server.
• Delays caused by locking systems being overly cautious are avoided. This can arise as a
result of larger than necessary lock granularity, but there are also several other
circumstances when locking causes unnecessary delays even when using fine granularity
locking.
• Removes the problems associated with the use of ad-hoc tools.
• Through the Group Commit concept, which is applied in SQL, the number of I/Os needed
to secure committed transactions to the disk is reduced to a minimum. The actual updates
to the database are performed in the background, allowing the originating application to
continue.
• The ROLLBACK statement is supported but, because nothing is written to the actual
database during the transaction Build-up phase, this involves only a re-initialization of
structures used by the transaction control system.
• Another significant transaction feature in SQL is the concept of Read-Only transactions,
which can be used for transactions that only perform read operations to the database.
When performing a Read-Only transaction, the application will always see a consistent
view of the database. Since consistency is guaranteed during a Read-Only transaction no
transaction check is needed and internal structures used to perform transaction checks
(i.e. the Read Set) is not needed, and for this reason no Read Set is established for a Read-
Only transaction. This has significant positive effects on performance for these
transactions. This means that a Read-Only transaction always succeeds, unaffected of
changes performed by other transactions. A Read-Only transaction also never disturbs any
other transactions going on in the system. For example, a complicated long-running query
can execute in parallel with OLTP transactions.

10.5. Architecture for a Locking Scheduler

Architecture Features
• Memory Usage
• Shared Memory

www.arihantinfo.com
168
RDBMS
File system
• Page Replacement Problems
• Page eviction
• Simplistic NRU replacement
• Clock algorithm can evict accessed pages
• Sub-optimal reaction to variable load or load

Spikes after inactivity


• Improvements:
• Finer-grained SMP Locking
• Unification of buffer and page caches
• Support for larger memory configurations
• SYSV shared memory code replaced
• Page aging reintroduced
• Active & inactive page lists
• Optimized page flushing
• Controlled background page aging
• Aggressive read ahead

SMP locking optimizations, Use of global “kernel_lock” was minimized. More subsystem based
spinlock are used. More spinlocks embedded in data structures.
Semaphores used to serialize address space access.
More of a spinlock hierarchy established. Spinlock granularity tradeoffs.

Kernel multi-threading improvements

Multiple threads can access address space data structures simultaneously.


Single mem->msem semaphore was replaced with multiple reader/single writer semaphore.
Reader lock is now acquired for reading per address space data structures.
Exclusive write lock is acquired when altering per address space data structures.
32 bit UIDs and GIDs
Increase from 16 to 32 bit UIDs allow up to 4.2 billion users.
Increase from 16 to 32 bit GIDs allow up to 4.2 billion groups.
64 bit virtual address space, Architectural limit of the virtual address space was
expanded to a full 64 bits.
IA64 currently implements 51 bits (16 disjoint 47 bit regions)
Alpha currently implements 43 bits (2 disjoint 42 bit regions)
S/390 currently implements 42 bits
Future Alpha is expanded to 48 bits (2 disjoint 47 bit regions)
Unified file system cache
Single page cache was unified from previous
Page cache read write functionality. Reduces memory consumption by eliminating double buffered
copies of file system data.
Eliminates overhead of searching two levels of data cache.

10.6. Managing Hierarchies of Database Elements

www.arihantinfo.com
169
RDBMS
These storage type are sometimes called the storage hierarchy. It contains of the archival storage.
It consist of the archival database, physical database, archival log, and current log.
Physical database: this is the online copy of the database that is stored in nonvolatile storage and
used by all active transactions.

Current Database: the current version of the database is made up of physical database plus
modifications implied by buffer in the volatile storage.
Database users

Program code
Applicationi Applicationi
And buffer
in volatile
storage Data Buffer Log Buffers

physical current log,


database on check point
nonvolatile on stable storage
storage

archive copy archive log


of database on stable
on stable storage
storage

Archival database in stable storage: this is the copy of the database at a given time, stored. it
contain the entire database in a quiescent mode and could have been made by simple dump
routine to dump the physical database on to stable storage. all transaction that have been
executed on the database from the time of archiving have to be redline in a global recovery
database is a copy of the database in a quiescent state, and only the committed transaction
since the time of archiving are applied to this database.
Current log: the log information required for recovery from system failure involving loss of
volatile information.
Archival log: is used for failure involving if loss of nonvolatile information.
The online or current database is made up of all the records that are accessible to the DBMS
during its operation. The current database consist of the data stored in nonvolatile storage
and not yet propagated tot the nonvolatile storage.

10.7. Concurrency Control by Timestamps

One of the important transactions is that their effect on shared data is serially equivalent. This
means that any data that is touched by a set of transactions must be in such a state that the
results could have been obtained if all the transactions executed serially (one after another) in
some order (it does not matter which). What is invalid, is for the data to be in some form that
cannot be the result of serial execution (e.g. two transactions modifying data concurrently).

One easy way of achieving this guarantee is to ensure that only one transaction executes at a
time. We can accomplish this by using mutual exclusion and having a “transaction” resource that
each transaction must have access to. However, this is usually overkill and does not allow us to
take advantage of the concurrency that we may get in distributed systems (for instance, it is
obviously
www.arihantinfo.com
170
RDBMS
overkill if two transactions don’t even access the same data). What we would like to do is allow
multiple transactions to execute simultaneously but keep them out of each other’s way and
ensure serializability. This is called concurrency control.

Locking

We can use exclusive locks on a resource to serialize execution of transactions that share
resources. A transaction locks an object that it is about to use. If another transaction requests the
same object and it is locked, the transaction must wait until the object is unlocked.
To implement this in a distributed system, we rely on a lock manager - a server that issues locks
on resources. This is exactly the same as a centralized mutual exclusion server: a client can
request a lock and then send a message releasing a lock on a resource (by resource in this
context, we mean some specific block of data that may be read or written). One thing to watch out
for, is that we still need to preserve serial execution: if two transactions are accessing the same
set of objects, the results must be the same as if the transactions executed in some order
(transaction A cannot modify some data while transaction B modifies some other data and then
transaction A accesses that
modified data -- this is concurrent modification). To ensure serial ordering on resource access, we
impose a restriction that states that a transaction is not allowed to get any new locks after it has
released a lock. This is known as two-phase locking. The first phase of the transaction is a
growing phase in which it acquires the locks it needs. The second phase is the shrinking phase
where locks are released.

Strict two-phase locking

A problem with two-phase locking is that if a transaction aborts, some other transaction may have
already used data from an object that the aborted transaction modified and then unlocked. If this
happens, any such transactions will also have to be aborted. This situation is known as
cascading aborts. To avoid this, we can strengthen our locking by requiring that a transaction
will hold all its locks to the very end: until it commits or aborts rather than releasing the lock
when the object is no longer needed. This is known as strict two-phase locking.

Locking granularity

A typical system will have many objects and typically a transaction will access only a small
amount of data at any given time (and it will frequently be the case that a transaction will not
clash with other transactions). The granularity of locking affects the amount of concurrency we
can achieve. If we
can have a smaller granularity (lock smaller objects or pieces of objects) then we can generally
achieve higher concurrency. For example, suppose that all of a bank’s customers are locked for
any transaction that needs to modify a single customer datum: concurrency is severely limited
because any other transactions that need to access any customer data will be blocked. If,
however, we use a customer record as the granularity of locking, transactions that access
different customer records will be capable of running concurrently.

Multiple readers/single writer

There is no harm having multiple transactions read from the same object as long as it has not
been modified by any of the transactions. This way we can increase concurrency by having
multiple transactions run concurrently if they are only reading from an object. However, only one
transaction should be allowed to write to an object. Once a transaction has modified an object, no
other transactions should be allowed to read or write the modified object. To support this, we now
use two locks: read locks and write locks. Read locks are also known as shared locks (since they
can be shared by multiple transactions) If a transaction needs to read an object, it will request a
read lock from the lock manager. If a transaction needs to modify an object, it will request a write
lock from the lock manager. If the lock manager cannot grant a lock, then the transaction will
wait until it can
www.arihantinfo.com
171
RDBMS
get the lock (after the transaction with the lock committed or aborted). To summarize lock
granting:
If a transaction has: another transaction may obtain:
no locks read lock or write lock
read lock read lock (wait for write lock)
write lock wait for read or write locks

Increasing concurrency: two-version locking

Two-version locking is an optimistic concurrency control scheme that allows one transaction to
write tentative versions of objects while other transactions read from committed versions of the
same objects. Read operations only wait if another transaction is currently committing the same
object. This scheme allows more concurrency than read-write locks, but writing transactions risk
waiting (or rejection) when they attempt to commit. Transactions cannot commit their write
operations immediately if other uncommitted transactions have read the same objects.
Transactions that request to commit in this situation have to wait until the reading transactions
have completed.

Two-version locking

The two-version locking scheme requires three types of locks: read, write, and commit locks.
Before an object is read, a transaction must obtain a read lock. Before an object is written, the
transaction must obtain a write lock (same as with two-phase locking). Neither of these locks will
be granted if there is a commit lock on the object. When the transaction is ready to commit: - all of
the transaction’s write locks are changed to commit locks - if any objects used by the transaction
have outstanding read locks, the transaction must wait until the transactions that set these locks
have completed and the locks are released. If we compare the performance difference between
two-version locking and strict two-phase locking (read/write locks):
- read operations in two-version locking are delayed only while transactions are being committed
rather than during the entire execution of transactions (usually the commit protocol takes far less
time than the time to perform the transaction) - but… read operations of one transaction can
cause a delay in the committing of other transactions.

Problems with locking

Locks are not without drawbacks Locks have an overhead associated with them: a lock manager
is needed to keep track of locks - there is overhead in requesting them. Even read-only operations
must still request locks. The use of locks can result in deadlock. We need to have software in
place to detect or avoid deadlock. Locks can decrease the potential concurrency in a system by
having a transaction hold locks for the duration of the transaction (until a commit or abort).

Optimistic concurrency control

King and Robinson (1981) proposed an alternative technique for achieving concurrency control,
called optimistic concurrency control. This is based on the observation that, in most
applications, the chance of two transactions accessing the same object is low. We will allow
transactions to proceed as if there were no possibility of conflict with other transactions: a
transaction does not have to obtain or check for locks. This is the working phase. Each
transaction has a tentative version (private workspace) of the objects it updates - copy of the most
recently committed version. Write operations record new values as tentative values. Before a
transaction can commit, a validation is performed on all the data items to see whether the data
conflicts with operations of other transactions. This is the
validation phase. If the validation fails, then the transaction will have to be aborted and restarted
later. If the transaction succeeds, then the changes in the tentative version are made permanent.
This is the update phase. Optimistic control is deadlock free and allows for maximum parallelism
(at the expense of possibly restarting transactions)

Timestamp ordering
www.arihantinfo.com
172
RDBMS

Reed presented another approach to concurrency control in 1983. This is called timestamp
ordering. Each transaction is assigned a unique timestamp when it begins (can be from a
physical or logical clock). Each object in the system has a read and write timestamp associated
with it (two timestamps per object). The read timestamp is the timestamp of the last committed
transaction that read the object. The write timestamp is the timestamp of the last committed
transaction that modified the object (note - the timestamps are obtained from the transaction
timestamp - the start of that transaction) The rule of timestamp ordering is: - if a transaction
wants to write an object, it compares its own timestamp with the object’s read and write
timestamps. If the object’s timestamps are older, then the ordering is good.
- if a transaction wants to read an object, it compares its own timestamp with the object’s write
timestamp. If the object’s write timestamp is older than the current transaction, then the ordering
is good. If a transaction attempts to access an object and does not detect proper ordering, the
transaction is aborted and restarted (improper ordering means that a newer transaction came in
and modified data before the older one could access the data or read data that the older one
wants to modify).

10.8. Concurrency Control by Validation

Validation or certification techniques. A transaction proceeds without waiting and all updates are
applied to local copies. At the end, a validation phase check if any updates violate serializability. If
certified, the transaction is committed and updates made permanent. If not certified, the
transaction is aborted and restarted later.
Three phases:
read phase
validation phase
write phase

Validation Test

Each T is associated with three TS's:


start(T): T started
val(T): T finished read and started its validation
finish(T): T finished its write phase
Validation test for Ti: for each Tj that is committed or in its validation phase, at least one of the
following holds:
finish(Tj) < start(Ti)
writeset(Tj) ∩ readset(Ti) = finish(Tj) < val(Ti)
writeset(Tj) ∩ readset(Ti) =..
writeset(Tj) ∩ writeset(Ti) = ..
val(Tj) < val(Ti)

www.arihantinfo.com
173
RDBMS
Unit 11
More About Transaction Management

11.1Introduction of Transaction management


11.2Serializability and Recoverability
11.3View Serializability
11.4Resolving Deadlocks
11.5Distributed Databases
11.6Distributed Commit
11.7Distributed Locking

11.1 Introduction of Transaction Management

The synchronization primitives we have seen so far are not as high-level as we might want them to
be since they require programmers to explicitly synchronize, avoid deadlocks, and abort if
necessary. Moreover, the high-level constructs such as monitors and path expressions do not give
users of shared objects flexibility in defining the unit of atomicity. We will study here a high-level
technique, called concurrency control, which automatically ensures that concurrently interacting
users do not execute inconsistent commands on shared objects. A variety of concurrency models
defining different notions of consistency have been proposed. These models have been developed
in the context of database management systems, operating systems, CAD tools, collaborative
software engineering, and collaboration systems. We will focus here on the classical database
models and the relatively newer operating system models.
A type of computer processing in which the computer responds immediately to user requests.
Each request is considered to be a transaction. Automatic teller machines for banks are an
example of transaction processing.
The opposite of transaction processing is batch processing, in which a batch of requests is stored
and then executed all at one time. Transaction processing requires interaction with a user,
whereas batch processing can take place without a user being present.

The RDBMS must be able to support a centralized warehouse containing detail data, provide
direct access for all users, and enable heavy-duty, ad hoc analysis. Yet, for many companies just
starting a warehouse project, it seems a natural choice to simply use the corporate standard
database that has already proven itself for mission-critical work. This approach was especially
common in the early days of data warehousing, when most people expected a warehouse to do
little more than provide canned reports.

But decision-support requirements have evolved far beyond canned reports and known queries.
Today's data warehouses must give organizations the in-depth and accurate information they
need to personalize customer interactions at all touch points and convert browsers to buyers. An
RDBMS designed for transaction processing can't keep up with the demands placed on data
warehouses: support for high concurrency, mixed-workload, detail data, fast query response, fast
data load, ad hoc queries, and high-volume data mining.

The notion of concurrency control is closely tied to the notion of a ``transaction''. A transaction
defines a set of ``indivisible'' steps, that is, commands with the Atomicity, Consistency, Isolation,
and Durability (ACID) properties:

Atomicity: Either all or none of the steps of the transaction occur so that the invariants of the
shared objects are maintained. A transaction is typically aborted by the system in response to
failures but it may be aborted also by a user to ``undo'' the actions. In either case, the user is
informed about the success or failure of the transaction.
Consistency: A transaction takes a shared object from one legal state to another, that is,
maintains the invariant of the shared object.

www.arihantinfo.com
174
RDBMS

Isolation: Events within a transaction are hidden from other concurrently executing transactions.
Techniques for achieving isolation are called synchronization schemes. They determine how these
transactions are scheduled, that is, what the relationships are between the times the different
steps of these transactions. Isolation is required to ensure that concurrent transactions do not
cause an illegal state in the shared object and to prevent cascaded rollbacks when a transaction
aborts.

Durability: Once the system tells the user that a transaction has completed successfully, it
ensures that values written by the database system persist until they are explicitly overwritten by
other transactions.

Consider the schedules S1, S2, S3, S4 and S5 given below. Draw the precedence graphs for each
schedule and state whether each schedule is (conflict) serializable or not. If a schedule is
serializable, write down the equivalent serial schedule(s).

S1 : read1(X), read3(X), write1(X), read2(X), write3(X).


S2 : read1(X), read3(X), write3(X), write1(X), read2(X).
S3 : read3(X), read2(X), write3(X), read1(X), write1(X).
S4 : read3(X), read2(X), read1(X), write3(X), write1(X).
S5 : read2(X), write3(X), read2(Y), write4(Y), write3(Z), read1(Z), write4(X), read1(X),
write2(Y), read1(Y).

11.2 Serializability and Recoverability

Serializability is the classical concurrency scheme. It ensures that a schedule for executing
concurrent transactions is equivalent to one that executes the transactions serially in some order.
It assumes that all accesses to the database are done using read and write operations. A schedule
is called ``correct'' if we can find a serial schedule that is ``equivalent'' to it. Given a set of
transactions T1...Tn, two schedules S1 and S2 of these transactions are equivalent if the following
conditions are satisfied:

Read-Write Synchronization: If a transaction reads a value written by another transaction in one


schedule, then it also does so in the other schedule.

Write-Write Synchronization: If a transaction overwrites the value of another transaction in one


schedule, it also does so in the other schedule.

Recoverability for changes to the other control file records sections is provided by maintaining all
the information in duplicate. Two physical blocks represent each logical block. One contains the
current information, and the other contains either an old copy of the information, or a pending
version that is yet to be committed. To keep track of which physical copy of each logical block
contains the current information, Oracle maintains a block version bitmap with the database
information entry in the first record section of the control file.

Recovery is an algorithmic process and should be kept as simple as possible, since complex
algorithms are likely to introduce errors. Therefore, an encoding scheme should be designed
around a set of principles intended to make recovery possible with simple algorithms. For
processes such as tag removal, simple mappings are more straightforward and less error prone
than, say, algorithms which require rearrangement of the sequence of elements, or which are
context-dependent, etc. Therefore, in order to provide a coherent and explicit set of recovery
principles, various recovery algorithms and related encoding principles need to be worked out,
taking into account such things as:
The role and nature of mappings (tags to typography, normalized characters, spellings, etc., with
the original, ...);
• The encoding of rendition characters and rendition text;
www.arihantinfo.com
175
RDBMS
• Definitions and separability of the source and annotation (such as linguistic annotation,
notes, etc.);
• Linkage of different views or versions of a text;
• -0.5ex

11.3 View Serializability

Serializability In this paper, we assume serializability is the underlying consistency criterion.


Serializability requires that concurrent execution of transactions have the same effect as a serial
schedule. In a serial schedule, transactions are executed one at a time. Given that the database is
initially consistent and each transaction is a unit of consistency, serial execution of all
transactions will give each transaction a consistent view of the database and will leave the
database in a consistent state. Since serializable execution is computationally equivalent to serial
execution, serializable execution is also deemed correct.
View Serializability A subclass of serializability, called View Serializability, is identified based
on the observation that two schedules are equivalent if they have the same effects. The effects of a
schedule are the values they produce, which are functions of values they read. Two schedules H
1 and H
2 are defined to be view equivalent if
3 Runtime overhead in hard real-time systems effectively translates into increased tasks'
execution times, which in turn affects an algorithm's schedulability.

Database Concurrency Control ffl Multiple users. ffl Concurrent accesses. ffl Problems could arise
if there is no control. ffl Example:

- Transaction 1: withdraw $500 from account A. - Transaction 2: deposit $800 to account A - T

1 : read(A)

A = A \Gamma 500 write(A) - T

2 : read(A)

A = A + 800 write(A) - Initially A = 1000 - T

1T 2 A = 1300 - T

2T 1 A = 1300 - T

1 : read(A) T 1 : A = A \Gamma 500

T 2 : read(A) T 2 : A = A + 800

T 2 : write(A)

T 1 : write(A) A = 500, inconsistent

T 2 : read(A)

T 1 : read(A) T 1 : A = A \Gamma 500

T 1 : write(A) T 2 : A = A + 800

T 2 : write(A) A = 1800, inconsistent


www.arihantinfo.com
176
RDBMS
Transactions and Schedules ffl A transaction is a sequence of operations. ffl A transaction is
atomic. ffl T = fT 1 ; : : : ; T n g is a set of transactions. ffl A schedule of a sequence of operations
in T 1 ; : : : ; T n such that for each 1 ^ i ^ n; 1. each operation in Ti appears exactly once, and 2.
operations in Ti appear in the same order as in Ti ffl A schedule of T is serial if 8i8j; i 6= j ) either
1. all the operations in Ti appear before all the operations in Tj , or 2. all the operations in Tj
appear before all the operations in Ti ffl Assumption: Each Ti is correct when executed
individually. ffl Any serial schedule is valid. ffl Objective: Accept schedules "equivalent" to a serial
schedule (serializable schedules). ffl ? What do we mean by "equivalent" ?

View Serializability: ffl equivalent: same effects ffl The effects of a history are the values
produced by the Write operations of unaborted transactions. ffl We don't know anything about the
computation of each transactions. ffl Assume that if each transactions' Reads read the same value
in two histories, then all W rites write the same values in both histories. ffl If for each data item x,
the final Write on x is the same in both histories, then the final value of all data items will be the
same in both histories. ffl Two histories H and H 0 are view equivalent if
1. They are over the same set of transactions and have the same operations;
2. For any unaborted Ti , Tj and for any x, if Ti reads x from Tj in H then Ti reads x from Tj in H0
, and 3. For each x, if wi [x] is the final write of x in H then it is also the final write of x in H0. ffl
Assume that there is a transaction (Tb) which initializes the values for all the data objects. ffl A
schedule is view serializable if it is view equivalent
to a serial schedule. ffl r3 [x]w4 [x]w3 [x]w6 [x] - T3 read-from Tb.
- The final write for x is w6 [x].
- View equivalent to T3 T4 T6.
ffl r3 [x]w4 [x]r7 [x]w3 [x]w7 [x]
- T3 read-from Tb.
- T7 read-from T4.
- The final write for x is w7 [x].
- View equivalent to T3 T4 T7.
ffl r3 [x]w4 [x]w3 [x]
- T3 read-from Tb.
- The final write for x is w3 [x].
- Not serializable.
ffl w1 [x]r2 [x]w2 [x]r1 [x]
- T 2 read-from T1.
- T1 read-from T2.
- The final write for x is w2 [x].

- Not serializable.

ffl Test for view serializability. ffl Tb issues writes for all data objects (first transaction). ffl Tf read
the values for all data objects (last transaction).
ffl Construction of labeled precedence graph
1. Add an edge Ti. 0 ! Tj, if transaction Tj.
Reads from Ti.
2. For each data item Q such that - Tj
read-from Ti.- T k executes write(Q) and T k 6= Tb. (i 6= j 6= k) do the followings: (a) If Ti = Tb and
Tj 6= T f , then insert the edge Tj 0 ! T k .

11.4. Resolving Deadlocks

Use the Monitor Display to understand and resolve deadlocks. This section demonstrates how this
is done. The steps assume that the deadlock is easy to recreate with the tested application

www.arihantinfo.com
177
RDBMS
running within Optimize It Thread Debugger. If this is not the case, use the Monitor Usage
Analyzer instead.
To resolve a deadlock:
1. Recreate the deadlock.
2. Switch to Monitor Display.
3. Identify which thread is not making progress. Usually, the thread is yellow because it
is blocking on a monitor. Call that thread the blocking thread.
4. Select the Connection button to identify where the blocking thread is not making
progress. Double-click on the method to display the source code for the blocking
method, as well as
5. methods calling the blocking method. This provides some context for where the
deadlock occurs.
6. Identify which thread owns the unavailable monitor. Call this the locking thread.

7. Identify why the locking thread does not release the monitor. This can happen in the
following cases:

• The locking thread is itself trying to acquire a monitor owned directly or indirectly by
the blocking thread. In this case, a bug exists since both the locking and the blocking
threads enter monitors in a different order. Changing the code to always enter
monitors in the same order will resolve this deadlock.

• The locking thread is not releasing the monitor because it remains busy executing
the code. In this case, the locking thread is green because it uses some CPU. This type
of bug is not a real deadlock. It is an extreme contention issue caused by the locking
thread holding the monitor for too long, sometimes called thread starvation.

• The locking thread is waiting for an I/O operation. In this case the locking thread is
purple. It is dangerous for a thread to perform an I/O operation while holding a
monitor, unless the only purpose of that monitor is to protect the objects used to
perform the I/O. A blocking I/O operation may never occur, causing the program to
hang. Often these situations can be resolved by releasing the monitor before
performing the I/O.

• The locking thread is waiting for another monitor. In this case, the locking thread is

red. It is equally dangerous to wait for a monitor while holding another monitor. The

monitor may never be notified, causing a deadlock. Often this situation can be resolved

by releasing the monitor that the blocking thread wants to acquire before waiting on

the monitor.

11.5 Distributed Databases

To support data collection needs of large networks, a distributed multitier architecture is


necessary. The advantage of this approach is that massive scalability can be implemented in an
incremental fashion, easing the operations and financial burden. Web NMS Server has been
designed to scale from a single server solution to a distributed architecture supporting very large
scale needs. The architecture supports one or more relational databases (RDBMS).

A single back-end server collects the data and stores it in a local or remote database. The system
readily supports tens of thousands of managed objects. For example, on a 400 MHz Pentium
Windows NT system, the polling engine can collect over 4000 collected variables/minute,
including storage in the MySQL database on Windows.

www.arihantinfo.com
178
RDBMS

The bottleneck for data collection is usually the database inserts, which limits the number of
entries per second that can be inserted into the database. As we discuss below, with a distributed
database, considerably higher performance is possible through distributing the storage of data
into multiple databases.

Based on tests with different modes, one central database on commodity hardware can handle up
to 100 collected variables/second; with distributed databases, this can be scaled much higher.
The achievable rate depends on the number of databases and the number and type of servers
used. With distributed databases, there is often a need to aggregate data in a single central store
for reporting and other purposes. Multiple approaches are feasible here:
• Roll-up data periodically to the central database from the different distributed databases.
• Use Native database distribution for centralized views, e.g. Oracle SQL Net. This is vendor
dependent, but can provide easy consolidation of data from multiple databases.
• Aggregate data using JDBC only when creating a report. This would require the report
writer to take care of collecting the data from the different databases for the report.

The need arises to


• To reduce the burden of Poll Engine in Web NMS Server.
• To facilitate faster data collection

The solution is Distributed Polling. You can adopt this technique when you are able to distinguish
the network elements geographically. You can form a group of network elements and decide to
have one Distributed Poller for them.

This section describes Distributed Polling architecture available with Web NMS Server. It
discusses the design, and the choices available in implementing the distributed solution. It
provides guidelines on setting up the components of the distributed system.

Architecture is very simple.

• You have Web NMS server running in one machine and Distributed Poller running in
other machines, one in each.
• Each Poller is identified by a name and has an associated database (labelled as Secondary
RDBMS in the diagram)
• You create PolledData and specify the Poller name if you want to perform data collection
for that PolledData via the distributed poller. In case you want Web NMS Polling Engine to

www.arihantinfo.com
179
RDBMS
collect data you don't specify any Poller name. By default, PolledData will not be
associated with any of the Pollers.
• Once you associate the Polled Data with the Poller and start the Poller , data collection is
done by poller and collected data is stored in Poller database (Secondary RDBMS).

Major features of a DDB are:

o Data stored at a number of sites, each site logically single processor


o Sites are interconnected by a network rather than a multiprocessor configuration
o DDB is logically a single database (although each site is a database site)
o DDBMS has full functionality of a DBMS

To the user, the distributed database system should appear exactly like a non-distributed
database system.
Advantages of distributed database systems are:

o local autonomy (in enterprises that are distributed already)


o improved performance (since data is stored close to where needed and a query may
be split over several sites and executed in parallel)
o improved reliability/availability (should one site go down)
o economics
o expandability
o shareability

Disadvantages of distributed database systems are:

o complexity (greater potential for bugs in software)


o cost (software development can be much more complex and therefore costly. Also,
exchange of messages and additional computations involve increased overheads)
o distribution of control (no single database administrator controls the DDB)
o security (since the system is distributed the chances of security lapses are greater)
o difficult to change (since all sites have control of the their own sites)
o lack of experience (enough experience is not available in developing distributed
systems)

Replication improves availability since the system would continue to be fully functional even if a
site goes down. Replication also allows increased parallelism since several sites could be operating
on the same relations at the same time. Replication does result in increased overheads on update.
Fragmentation may be horizontal, vertical or hybrid (or mixed). Horizontal fragmentation splits a
relation by assigning each tuple of the relation to a fragment of the relation. Often horizontal
fragmentation is based on predicates defined on that relation.
Vertical fragmentation splits the relation by decomposing a relation into several subsets of the
attributes. Relation R produces fragments R1,R2,………,R3 each of which contains a subset of
attributes of R as well as the primary key of R. Aim of vertical fragmentation is to put together
those attributes that are accessed together.
Mixed fragmentation uses both vertical and horizontal fragmentation.
To obtain a sensible fragmentation design, it is necessary to know some information about the
database as well as about applications. It is useful to know the predicates used in the application
queries - at least the 'important' ones.
Aim is to have applications using only one fragment.
Fragmentation must provide completeness (all information in a relation must be available in the
fragments), reconstruction (the original relation should be able to be reconstructed from the
fragments) and disjointedness (no information should be stored twice unless absolutely essential,
for example, the key needs to be duplicated in vertical fragmentation).
Transparency involves the user not having to know how a relation is stored in the DDB; it is the
system capability to hide the details of data distribution from the user.
www.arihantinfo.com
180
RDBMS
Autonomy is the degree to which a designer or administrator of one site may be independent of
the remainder of the distributed system.
It is clearly undesirable for the users to have to know which fragment of the relation they require
to process the query that they are posing. Similarly the users should not need to know which copy
of a replicated relation or fragment they need to use. It should be upto the system to figure out
which fragment or fragments of a relation a query requires and which copy of a fragment the
system will use to process the query. This is called replication and fragmentation transparency.
A user should also not need to know where the data is located and should be able to refer to a
relation by name which could then be translated by the system into full name that includes the
location of the relation. This is location transparency.

Global query optimization is complex because of


• cost models
• fragmentation and replication
• large solution space from which to choose
Computing cost itself can be complex since the cost is a weighted combination of the I/O, CPU
and communications costs. Often one of the two cost models are used; one may wish to minimize
the total cost (time) or the response time. Fragmentation and replication add another complexity
to finding an optimum query plan.

Date's 12 Rules for Distributed Databases


RDBMS in all other respects should behave like a non distributed RDBMS. This is sometimes
called Rule 0.

Distributed Database Characteristics:


According to Oracle, these are the database characteristics and how Oracle 7 technology meets
each point:

1. Local autonomy The data is owned and managed locally. Local operations remain purely local.
One site (node) in the distributed system does not depend on another site to function successfully.

2. No reliance on a central site All sites are treated as equals. Each site has its own data
dictionary.

3. Continuous operation Incorporating a new site has no effect on existing applications and does
not disrupt service.

4. Location independence Users can retrieve and update data independent of the site.

5. Partitioning [fragmentation] independence Users can store parts of a table at different locations.
Both horizontal and vertical partitioning of data is possible.

6. Replication independence Stored copies of data can be located at multiple sites. Snapshots, a
type of database object, can provide both read-only and updatable copies of tables. Symmetric
replication using triggers makes readable and writable replication possible.

7. Distributed query processing Users can query a database residing on another node. The query is
executed at the node where the data is located.

8. Distributed transaction management A transaction can update, insert, or delete data from
multiple databases. The two-phase commit mechanism in Oracle ensures the integrity of
distributed transactions. Row-level locking ensures a high level of data concurrency.

9. Hardware independence Oracle7 runs on all major hardware platforms.

10. Operating system independence A specific operating system is not required. Oracle7 runs
under a variety of operating systems.

www.arihantinfo.com
181
RDBMS

11. Network independence The Oracle's SQL*Net supports most popular networking software.
Network independence allows communication across homogeneous and heterogeneous networks.
Oracle's MultiProtocol Interchange enables applications to communicate with databases across
multiple network protocols.

12. DBMS independence DBMS independence is the ability to integrate different databases.
Oracle's Open Gateway technology supports ODBC-enahled connections to non-Oracle databases.

11.6 Distributed Commit

To create a new user, test, and a corresponding default schema you must be connected as the
ADMIN user and then use:

CREATE USER test;


CREATE SCHEMA test AUTHORIZATION test; --sets the default schema
COMMIT;
and then connect to the new user/schema using:
CONNECT TO '' USER 'test'

Notice that the COMMIT was needed before the CONNECT because re-connecting would otherwise
rollback any uncommitted changes.
In this example the sequence of events is as follows:
The coordinator at Client A registers automatically with the Transaction Manager database at
Server B, using TM_DATABASE=TMB.
The application requester at Client A issues a DUOW request to Servers C and E. For example, the
following REXX script illustrates this:
/**/
'set DB2OPTIONS=+c' /* in order to turn off autocommit */
'db2 set client connect 2 syncpoint twophase'
'db2 connect to DBC user USERC using PASSWRDC'
'db2 create table twopc (title varchar(50) artno smallint not null)'
'db2 insert into twopc (title,artno) values("testCCC",99)'
'db2 connect to DBE user USERE using PASSWRDE'
'db2 create table twopc (title varchar(50) artno smallint not null)'
'db2 insert into twopc (title,artno) values("testEEE",99)'
'commit'
exit (0);

When the commit is issued, the coordinator at the application requester sends prepare requests to
the SPM for the updates requested at servers C and E.
The SPM is running on Server D, as part of DB2 Connect, and it sends the prepare requests to
servers C and E. Servers C and E in turn acknowledge the prepare requests.
The SPM sends back an acknowledgement to the coordinator at the application requester.
The coordinator at the application requester sends a request to the transaction manager at Server
B for the servers that have acknowledged, and the transaction manager decides whether to
commit or roll-back. The transaction manager logs the commit decision, and the updates are
guaranteed from this point. The coordinator issues commit requests, which are processed by the
SPM, and forwarded to servers C and E, as were the prepare requests. Servers C and E commit
and report success to the SPM. SPM then returns the commit result to the coordinator, which
updates the TMB with the commit results.

Two-phase Commit RDBMS Scenario

www.arihantinfo.com
182
RDBMS

11.7 Distributed Locking

The intent of this white paper is to convey information regarding database locks as they apply to
transactions in general and the more specific case of how they are implemented by the Progress
server. We’ll begin with a general overview discussing why locks are needed and how they affect
transactions. Transactions and locking are outlined in the SQL standard so no introduction
would be complete without discussing the guidelines set forth here. Once we have a grasp on the
general concepts of locking we’ll dive into lock modes, such as table and record locks and their
effect on different types of
database operations. Next, the subject of timing will be introduced, when locks are obtained and
when they are released. From here we’ll get into lock contention and deadlocks, which are
multiple operations or transactions all attempting to get locks on the same resource at the same
time. And to conclude our discussion on locking we’ll take a look at how we can see locks in our
application so we know which transactions obtain which types of locks. Finally, this white paper
describes differences in locking behavior
between previous and current versions of Progress and differences in locking behavior when both
4GL and SQL92 clients are accessing the same resources.

Locks

The answer to why we lock is simple; if we didn’t there would be no consistency. Consistency
provides us with successive, reliable, and uniform results without which applications such as
banking and reservation systems, manufacturing, chemical, and industrial data collection and
processing could not exist. Imagine a banking application where two clerks attempt to update an
account balance at the same time: one credits the account and the other debits the account.
While one clerk reads the account balance of $200 to credit the account $100, the other clerk has
already completed the debit of $100 and updated the account balance to $100. When the first
clerk finishes the credit of $100 to the balance of $200 and updates the balance to $300 it will be
as if the debit never happened. Great for the customer; however the bank wouldn’t be in business
for long.

What objects are we locking?

What database objects get locked is not as simple to answer as why they’re locked. From a user
perspective, objects such as the information schema, user tables, and user records are locked
while being accessed to maintain consistency. There are other lower level objects that require
locks that are handled by the RDBMS; however, they are not visible to the user. For the purposes
of this discussion we will focus on the objects that the user has visibility of and control over.

Transactions
www.arihantinfo.com
183
RDBMS
Now that we know why and what we lock, let’s talk a bit about when we lock. A transaction is a
unit of work; there is a well-defined beginning and end to each unit of work. At the beginning of
each transaction certain locks are obtained and at the end of each transaction they are released.
During any given transaction, the RDBMS, on behalf of the user, can escalate, deescalate, and
even release locks as required. We’ll talk about this in more detail later when we discuss lock
modes. The aforementioned is all-true in the case of a normal, successful transaction; however in
the case of an abnormally terminated transaction things are handled a bit differently. When a
transaction fails, for any reason, the action performed by the transaction needs to be backed out,
the change undone. To accomplish this most RDBMS use what are known as “save points.” A save
point marks the last known good point prior to the abnormal termination;
typically this is the beginning of the transaction. It’s the RDBMS’s job to undo the changes back
to the previous save point as well as ensuring the proper locks are held until the transaction is
completely undone.
So, as you can see, transactions that are in the process to be undone (rolled back) are still
transactions nonetheless and still need locks to maintain data consistency.

Locking certain objects for the duration of a transaction ensures database consistency and
isolation from other concurrent transactions, preventing the banking situation we described
previously. Transactions are the basis for the ACID
• ATOMICITY guarantees that all operations within a transaction are performed or none of them
are performed.
• CONSISTENCY is the concept that allows an application to define consistency points and
validate the correctness of data transformations from one state to the next.
• ISOLATION guarantees that concurrent transactions have no effect on each other.
• DURABILITY guarantees that all transaction updates are preserved.

www.arihantinfo.com
184
RDBMS
Unit 12
Database System Architectures

12.1Centralized And Client-Server Architectures


12.2Server System Architectures
12.3Parallel Systems
12.4Distributed Systems
12.5Network Types

12.1. Centralized And Client-Server Architectures

Centralized Systems:

Run on a single computer system and do not interact with other computer systems. A Modern,
General-purpose computer system: one to a few CPUs and a number of device controllers that are
connected through a common bus that provides access to shared memory.
Single-user system (e.g., personal computer or workstation): desk-top unit, single user, usually as
only one CPU and one or two hard disks; the OS may support only one user.
Multi-user system: more disks, more memory, multiple CPUs, and a multi-user OS. Serve a large
number of users who are connected to the system vie terminals. Often called server systems.

Client-Server Systems:

In this system, Server systems satisfy requests generated at client systems, whose general
structure is shown below: client and server

www.arihantinfo.com
185
RDBMS

Database functionality can be divided into:


a)Back-end: manages access structures, query evaluation and optimization,
concurrency control and recovery.

b) Front-end: consists of tools such asforms, report-writers, and graphical


user interface facilities.

The interface between the front-end and the back-end is through SQL or through an application
program interface.

Advantages of replacing mainframes with networks of workstations or personal computers


connected to back-end server machines:
– better functionality for the cost
– flexibility in locating resources and expanding facilities
– better user interfaces
-easier maintenance

12.2. Server System Architectures

Server systems can be broadly categorized into two kinds:


– transaction servers which are widely used in relational database systems, and
– data servers, used in object-oriented database systems

A) Transaction Servers

www.arihantinfo.com
186
RDBMS
- Also called query server systems or SQL server systems; clients send requests to the server
system where the transactions are executed, and results are shipped back to the client..
-Requests specified in SQL, and communicated to the serverthrough a remote procedure call
(RPC) mechanism.
-Transactional RPC allows many RPC calls to collectively form a transaction.
-Open Database Connectivity (ODBC) is an application program interface standard from Microsoft
for connecting to a server, sending SQL requests, and receiving results.

Transaction Server Process Structure:


A typical transaction server consists of multiple processes accessing data in shared memory.
1. Server processes
 These receive user queries (transactions), execute them and send results back
 Processes may be multithreaded, allowing a single process to execute several user
queries concurrently
 Typically multiple multithreaded server processes

2.Lock manager process: This process implements lock manger functionality, which includes
lock grant, lock release, and deadlock detection.

3. Database writer process


 Output modified buffer blocks to disks continually
4. Log writer process
 Server processes simply add log records to log record buffer
 Log writer process outputs log records to stable storage.
5.Checkpoint process
 Performs periodic checkpoints
6. Process monitor process
 Monitors other processes, and takes recovery actions if any of the other processes
fail
-E.g. aborting any transactions being executed by a server process and
restarting it

www.arihantinfo.com
187
RDBMS

Shared memory contains shared data


 Buffer pool
 Lock table
 Log buffer
 Cached query plans (reused if same query submitted again)
All database processes can access shared memory
To ensure that no two processes are accessing the same data structure at the same time,
databases systems implement mutual exclusion using either
 Operating system semaphores or
 Atomic instructions such as test-and-set
To avoid overhead of inter-process communication for lock request/grant, each database
process operates directly on the lock table data structure (Section 16.1.4) instead of sending
requests to lock manager process
 Mutual exclusion ensured on the lock table using semaphores, or more commonly,
atomic instructions
 If a lock can be obtained, the lock table is updated directly in shared memory
 If a lock cannot be immediately obtained, a lock request is noted in the lock table
and the process (or thread) then waits for lock to be granted
 When a lock is released, releasing process updates lock table to record release of
lock, as well as grant of lock to waiting requests (if any)
 Process/thread waiting for lock may either:
-Continually scan lock table to check for lock grant, or
-Use operating system semaphore mechanism to wait on a semaphore.
-Semaphore identifier is recorded in the lock table
-When a lock is granted, the releasing process signals the
semaphore to tell the waiting process/thread to proceed

www.arihantinfo.com
188
RDBMS
-Lock manager process still used for deadlock detection

B) Data Servers:

Used in LANs, where there is a very high speed connection between the clients and the server, the
client machines are comparable in processing power to the server machine, and the tasks to be
executed are compute intensive.
Ship data to client machines where processing is performed, and then ship results back to the
server machine. This architecture requires full back-end functionality at the clients.
Used in many object-oriented database systems

Issues:
- Page-Shipping versus Item-Shipping
– Locking
– Data Caching
– Lock Caching

a) Page-Shipping versus Item-Shipping


– Smaller unit of shipping ) more messages
– Worth pre-fetching related items along with requested item
– Page shipping can be thought of as a form of pre-fetching

b) Locking
– Overhead of requesting and getting locks from server is high due to message delays
– Can grant locks on requested and prefetched items; with page shipping, transaction is granted
lock on whole page.
– Locks on the page can be deescalated to locks on items in the page when there are lock
conflicts. Locks on unused items can then be returned to server.

c) Data Caching
– Data can be cached at client even in between transactions
– But check that data is up-to-date before it is used (cache coherency)
– Check can be done when requesting lock on data item

d) Lock Caching
– Locks can be retained by client system even in between transactions
– Transactions can acquire cached locks locally, without contacting server
– Server calls back locks from clients when it receives conflicting lock request. Client returns lock
once no local transaction is using it.
– Similar to deescalation, but across transactions.

12.3. Parallel Systems

_ Parallel database systems consist of multiple processors and multiple disks connected by a fast
interconnection network.
_ A coarse-grain parallel machine consists of a small number of powerful processors; a massively
parallel or fine grain machine utilizes thousands of smaller processors.
_ Two main performance measures:
throughput — the number of tasks that can be completed in a given time interval
response time — the amount of time it takes to complete a single task from the time it is
submitted
A) Speed-Up and Scale-Up
Speedup: a fixed-sized problem executing on a small system is given to a system which is N-
times larger.
Measured by: speedup = small system elapsed time
large system elapsed time
Speedup is linear if equation equals N.

www.arihantinfo.com
189
RDBMS
Scaleup: increase the size of both the problem and the system N-times larger system used
to perform N-times larger job

Measured by: scaleup = small system small problem elapsed time


big system big problem elapsed time

Scaleup is linear if equation equals 1.

Speedup linear speedup sublinear speedup resources speed

Scaleup linear scaleup sublinear scaleup problem size (resources increase proportional to
problem size) TS
TL.

www.arihantinfo.com
190
RDBMS
Batch and Transaction Scaleup:
Batch scaleup:
_ A single large job; typical of most database queries and scientific simulation.
_ Use an N-times larger computer on N-times larger problem.
Transaction scaleup:
_ Numerous small queries submitted by independent users to a shared database; typical
transaction processing and timesharing systems.
_N-times as many users submitting requests (hence, N-times as many requests) to an N-
times larger database, on an N-times larger computer.
_ Well-suited to parallel execution.

Factors Limiting Speedup and Scaleup:


Startup costs: Cost of starting up multiple processes may dominate computation time, if
the degree of parallelism is high.
Interference: Processes accessing shared resources (e.g., system bus, disks, or locks)
compete with each other, thus spending time waiting on other processes, rather than
performing useful work.
Skew: Increasing the degree of parallelism increases the variance in service times of
parallely executing tasks. Overall execution time determined by slowest of parallely
executing tasks.

B) Interconnection Network Architectures


Bus. System components send data on and receive data from a single communication
bus; does not scale well with increasing parallelism.
Mesh. Components are arranged as nodes in a grid, and each component is connected to
all adjacent components; communication links grow with growing number of components,
and so scales better. But may require 2pn hops to send message to a node (or p n with
wraparound connections at edge of grid).
Hypercube. Components are numbered in binary; components are connected to one
another if their binary representations differ in exactly one bit. n components are
connected to log( n) other components and can reach each other via at most log( n) links;
reduces communication delays.

C) Parallel Database Architectures


1. Shared memory -- processors share a common memory
2. Shared disk -- processors share a common disk
3. Shared nothing -- processors share neither a common memory nor common disk
4. Hierarchical -- hybrid of the above architectures

www.arihantinfo.com
191
RDBMS

a) Shared Memory
1. Processors and disks have access to a common memory, typically via a bus or
through an interconnection network.
2. Extremely efficient communication between processors — data in shared memory can be
accessed by any processor without having to move it using software.
3. Downside – architecture is not scalable beyond 32 or 64 processors since the bus or the
interconnection network becomes a bottleneck
a. Widely used for lower degrees of parallelism (4 to 8).

b) Shared Disk
1. All processors can directly access all disks via an interconnection network, but the
processors have private memories.
 The memory bus is not a bottleneck
 Architecture provides a degree of fault-tolerance — if a processor fails, the other
processors can take over its tasks since the database is resident on disks that are
accessible from all processors.
2. Examples: IBM Sys plex and DEC clusters (now part of Compaq) running Rdb (now Oracle
Rdb) were early commercial users
3. Downside: bottleneck now occurs at interconnection to the disk subsystem.
4. Shared-disk systems can scale to a somewhat larger number of processors, but
communication between processors is slower.

C) Shared Nothing
1. Node consists of a processor, memory, and one or more disks. Processors at one node
communicate with another processor at another node using an interconnection network. A
node functions as the server for the data on the disk or disks the node owns.
2. Examples: Teradata, Tandem, Oracle-n CUBE
3. Data accessed from local disks (and local memory accesses) do not pass through
interconnection network, thereby minimizing the interference of resource sharing.
4. Shared-nothing multiprocessors can be scaled up to thousands of processors without
interference.

www.arihantinfo.com
192
RDBMS
5. Main drawback: cost of communication and non-local disk access; sending data involves
software interaction at both ends.

d) Hierarchical
1. Combines characteristics of shared-memory, shared-disk, and shared-nothing
architectures.
2. Top level is a shared-nothing architecture – nodes connected by an interconnection
network, and do not share disks or memory with each other.
3. Each node of the system could be a shared-memory system with a few processors.
4. Alternatively, each node could be a shared-disk system, and each of the systems sharing a
set of disks could be a shared-memory system.
5. Reduce the complexity of programming such systems by distributed virtual-memory
architectures
 Also called non-uniform memory architecture (NUMA)

4. Distributed Systems
_ Data spread over multiple machines (also referred to as sites or nodes).
_ Network interconnects the machines
_ Data shared by users on multiple machines

Differentiate between local and global transactions


– A local transaction accesses data in the single site at which the transaction was initiated.
– A global transaction either accesses data in a site different from the one at which the
transaction was initiated or accesses data in several different sites.

Distributed Databases
1. Homogeneous distributed databases
 Same software/schema on all sites, data may be partitioned among sites
 Goal: provide a view of a single database, hiding details of distribution
2. Heterogeneous distributed databases
 Different software/schema on different sites
 Goal: integrate existing databases to provide useful functionality
3. Differentiate between local and global transactions
 A local transaction accesses data in the single site at which the transaction was
initiated.
 A global transaction either accesses data in a site different from the one at which
the transaction was initiated or accesses data in several different sites.

Trade-offs in Distributed Systems


_ Sharing data – users at one site able to access the data residing at some other sites.
-Autonomy – each site is able to retain a degree of control over data stored locally.

www.arihantinfo.com
193
RDBMS
_ Higher system availability through redundancy — data can be replicated at remote sites, and
system can function even if a site fails.
_ Disadvantage: added complexity required to ensure proper coordination among
sites.
– Software development cost.
– Greater potential for bugs.
– Increased processing overhead.

Implementation Issues for Distributed Databases


1. Atomicity needed even for transactions that update data at multiple site
 Transaction cannot be committed at one site and aborted at another
2. The two-phase commit protocol (2PC) used to ensure atomicity
 Basic idea: each site executes transaction till just before commit, and then leaves
final decision to a coordinator
 Each site must follow decision of coordinator: even if there is a failure while waiting
for coordinators decision
 To do so, updates of transaction are logged to stable storage and
transaction is recorded as “waiting”
 More details in Sectin 19.4.1
3. 2PC is not always appropriate: other transaction models based on persistent messaging,
and workflows, are also used
4. Distributed concurrency control (and deadlock detection) required
5. Replication of data items required for improving data availability

12.5. Network Types

Local-area networks (LANs) – composed of processors that are distributed over small
geographical areas, such as a single building or a few adjacent buildings.

Wide-area networks (WANs) – composed of processors distributed over a large geographical area.
Discontinuous connection– WANs, such as those based on periodic dial-up (using, e.g., UUCP),
that are connected only for part of the time.
Continuous connection – WANs, such as the Internet, where hosts are connected to the network
at all times.
WANs with continuous connection are needed for implementing distributed database systems
Groupware applications such as Lotus notes can work on WANs with discontinuous connection:
– Data is replicated.
– Updates are propagated to replicas periodically.
– No global locking is possible, and copies of data may be independently updated.
– Non-serializable executions can thus result. Conflicting updates may have to be detected,
and resolved in an application dependent manner.

www.arihantinfo.com
194
RDBMS

Unit13
Distributed Databases

13.1Homogeneous And Heterogeneous Database


13.2Distributed Data Storage
13.3Distributed Transaction
13.4Commit Protocols
13.5Concurrency Control In Distributed Databases
13.6Availability

13.1. Homogeneous And Heterogeneous Databases

All computer systems have limits. These limitations can be seen in the amount of memory the
system can address, the number of hard disk drives which can be connected to it or the number
of processors it can run in parallel. In practice this means that, as the quantity of information in a
database becomes larger, a single system can no longer cope with all the information that needs
to be stored, sorted and queried.

Although it is (currently) still possible to build bigger and faster computer systems, it is often not
a cost-effective solution to upgrade the hardware every few months. Instead, it is more affordable
to have several database servers that appear to the users to be a single system, and which split
the tasks between themselves. By doing this we can use commodity machines at affordable prices.
This has the added advantage that systems are not simply discarded as soon as a newer version
arrives, but can be added to and then replaced as they become obsolete.

These are called distributed databases, and have the common characteristics that they are stored
on two or more computers, called nodes, and that these nodes are connected over a network.

There are two classifications for distributed databases, homogeneous and heterogeneous.

Homogeneous databases all use the same DBMS software and have the same applications on
each node. They have a common schema (a file specifying the structure of the database), and can
have varying degrees of local autonomy. They can be based on any DBMS which supports this
function, but it is not possible to have more than one DBMS type in the system.

Local autonomy specifies how the system appears to works from the users and programmers
perspective. For example, we can have a system with little or no local autonomy, where all
requests are sent to a central node, called the gateway. From here they are assigned to whichever

www.arihantinfo.com
195
RDBMS
node holds the information or application required. This is typically seen on the web with mirror
sites for popular locations to speed access time since several nodes can hold exactly the same
information and applications to speed throughput and access times.

It has the disadvantage that the gateway into the system, has to have a very large network
connection and a lot of processing power to keep up with requests and the routing the data back
from the nodes to the users.

At the other end of the scale, we have heterogeneous databases which have a very high degree of
local autonomy. Each node in the system has its own local users, applications and data and
dealing with them itself, and only connects to other nodes for information it does not have.

This type of distributed database is often just called a federated system or a federation. It is
becoming more popular with organizations, both for its scalability and the reduced cost in being
able to add extra nodes when necessary and the ability to mix software packages. Unlike the
homogenous systems, heterogeneous systems can include different database management
systems in the system. This makes them appealing to organizations since they can incorporate
legacy systems and data into new systems.

Distributed Database System


-A distributed database system consists of loosely coupled sites that share no physical
component
-Database systems that run on each site are independent of each other
-Transactions may access data at one or more sites
Homogeneous Distributed Databases
In a homogeneous distributed database
 All sites have identical software
 Are aware of each other and agree to cooperate in processing user requests.
 Each site surrenders part of its autonomy in terms of right to change schemas or
software
 Appears to user as a single system
In a heterogeneous distributed database
 Different sites may use different schemas and software
 Difference in schema is a major problem for query processing
 Difference in software is a major problem for transaction processing
 Sites may not be aware of each other and may provide only
limited facilities for cooperation in transaction

13.2. Distributed Data Storage

Assume relational data model


Replication
 System maintains multiple copies of data, stored in different sites, for faster
retrieval and fault tolerance.
Fragmentation
 Relation is partitioned into several fragments stored in distinct sites
Replication and fragmentation can be combined
 Relation is partitioned into several fragments: system maintains several identical
replicas of each such fragment.
Data Replication
A relation or fragment of a relation is replicated if it is stored redundantly in two or more
sites.
Full replication of a relation is the case where the relation is stored at all sites.
Fully redundant databases are those in which every site contains a copy of the entire
database.
Advantages of Replication

www.arihantinfo.com
196
RDBMS
Availability: failure of site containing relation r does not result in unavailability
of r is replicas exist.
Parallelism: queries on r may be processed by several nodes in parallel.
Reduced data transfer: relation r is available locally at each site containing a
replica of r.
Disadvantages of Replication
-Increased cost of updates: each replica of relation r must be updated.
-Increased complexity of concurrency control: concurrent updates to distinct
replicas may lead to inconsistent data unless special concurrency control
mechanisms are implemented.
-One solution: choose one copy as primary copy and apply concurrency
control operations on primary copy
Data Fragmentation
Division of relation r into fragments r1, r2, …, rn which contain sufficient information to
reconstruct relation r.
Horizontal fragmentation: each tuple of r is assigned to one or more fragments
Vertical fragmentation: the schema for relation r is split into several smaller schemas
 All schemas must contain a common candidate key (or superkey) to ensure lossless
join property.
 A special attribute, the tuple-id attribute may be added to each schema to serve as
a candidate key.
Example : relation account with following schema
Account-schema = (branch-name, account-number, balance)

Horizontal Fragmentation of account Relation

www.arihantinfo.com
197
RDBMS

account-number
branch-name balance

Hillside 500
A-305
Hillside 336
A-226
Hillside 62
A-155
account1=σbranch-name=“Hillside”(account)

account-number
branch-name balance

205
Valleyview A-177 10000
Valleyview 1123
A-402
Valleyview A-408 750
Valleyview A-639

Vertical Fragmentation of employee-info Relation

www.arihantinfo.com
198
RDBMS

branch-name customer-name tuple-id


1
Lowman
2
Camp
Hillside 3
Camp
Hillside 4
Kahn
Valleyview 5
Kahn
Valleyview 6
Kahn
Hillside 7
Green
=Πbranch-name, customer-name, tuple-id(employee-info)
depositValleyview
1
Valleyview
account number balance tuple-id

500 1
336 2
205 3
10000 4
A-305
62 5
A-226
1123 6
A-177
750(employee-info) 7
Πaccount-number, balance, tuple-id
deposit2=A-402
A-155
A-408
A-639

Advantages of Fragmentation
Horizontal:
 allows parallel processing on fragments of a relation
 allows a relation to be split so that tuples are located where they are most
frequently accessed
Vertical:
 allows tuples to be split so that each part of the tuple is stored where it is most
frequently accessed
 tuple-id attribute allows efficient joining of vertical fragments
 allows parallel processing on a relation
Vertical and horizontal fragmentation can be mixed.
 Fragments may be successively fragmented to an arbitrary depth.

Data Transparency:
www.arihantinfo.com
199
RDBMS
Data transparency: Degree to which system user may remain unaware of the details of how
and where the data items are stored in a distributed system
Consider transparency issues in relation to:
 Fragmentation transparency
 Replication transparency
 Location transparency
Distributed Query Processing
For centralized systems, the primary criterion for measuring the cost of a particular strategy is
the number of disk accesses.
In a distributed system, other issues must be taken into account:
 The cost of a data transmission over the network.
 The potential gain in performance from having several sites process parts of the
query in parallel.
Query Transformation
Translating algebraic queries on fragments.
 It must be possible to construct relation r from its fragments
 by the expression to construct relation r from its fragments
Consider the horizontal fragmentation of the account relation into
account1 = σ branch-name = “Hillside” (account)
account2 = σ branch-name = “Valleyview” (account)
The query σ branch-name = “Hillside” (account) becomes σ branch-name = “Hillside” (account1
∪ account2) which is optimized into
σ branch-name = “Hillside” (account1) ∪ σ branch-name = “Hillside” (account2)

Example:
Since account1 has only tuples pertaining to the Hillside branch, we can eliminate the selection
operation.
-Apply the definition of account2 to obtain
σ branch-name = “Hillside” (σ branch-name = “Valleyview” (account))
-This expression is the empty set regardless of the contents of the account relation.
-Final strategy is for the Hillside site to return account1 as the result of the query.

Simple Join Processing


-Consider the following relational algebra expression in which the three relations are neither
replicated nor fragmented
account depositor branch
-account is stored at site S1
-depositor at S2
-branch at S3
-For a query issued at site SI, the system needs to produce the result at site SI

Possible Query Processing Strategies


Ship copies of all three relations to site SI and choose a strategy for processing the entire locally
at site SI.
Ship a copy of the account relation to site S2 and compute temp1 = account depositor at S2.
Ship temp1 from S2 to S3, and compute temp2 = temp1 branch at S3. Ship the result temp2 to SI.
Devise similar strategies, exchanging the roles S1, S2, S3
Must consider following factors:
 amount of data being shipped
 cost of transmitting a data block between sites
 relative processing speed at each site

Semijoin Strategy
Let r1 be a relation with schema R1 stores at site S1
Let r2 be a relation with schema R2 stores at site S2
Evaluate the expression r1 r2 and obtain the result at S1.
1. Compute temp1 ← ∏R1 ∩ R2 (r1) at S1.

www.arihantinfo.com
200
RDBMS
2. Ship temp1 from S1 to S2.
3. Compute temp2 ← r2 temp1 at S2
4. Ship temp2 from S2 to S1.
5. Compute r1 temp2 at S1. This is the same as r1 r2.

Formal Definition:

The semijoin of r1 with r2, is denoted by:


r1 r2

It is defined by:
∏R1 (r1 r2)

Thus, r1 r2 selects those tuples of r1 that contributed to r1 r2.

In step 3 above, temp2=r2 r1.

For joins of several relations, the above strategy can be extended to a series of semijoin steps.

Join Strategies that Exploit Parallelism:


Consider r1 r2 r3 r4 where relation ri is stored at site Si. The result must be presented

at site S1.
r1 is shipped to S2 and r1 r2 is computed at S2: simultaneously r3 is shipped to S4 and r3

r4 is computed at S4
S2 ships tuples of (r1 r2) to S1 as they produced;

S4 ships tuples of (r3 r4) to S1

Once tuples of (r1 r2) and (r3 r4) arrive at S1 (r1 r2) (r3 r4) is computed in parallel

with the computation of (r1 r2) at S2 and the computation of (r3 r4) at S4.

Heterogeneous Distributed Databases


Many database applications require data from a variety of preexisting databases located in a
heterogeneous collection of hardware and software platforms
Data models may differ (hierarchical, relational , etc.)
Transaction commit protocols may be incompatible
Concurrency control may be based on different techniques (locking, timestamping, etc.)
System-level details almost certainly are totally incompatible.
A multidatabase system is a software layer on top of existing database systems, which is
designed to manipulate information in heterogeneous databases
 Creates an illusion of logical database integration without any physical database
integration

Advantages
Preservation of investment in existing
 hardware
 system software
 Applications
Local autonomy and administrative control

www.arihantinfo.com
201
RDBMS
Allows use of special-purpose DBMSs
Step towards a unified homogeneous DBMS
 Full integration into a homogeneous DBMS faces
 Technical difficulties and cost of conversion
 Organizational/political difficulties
– Organizations do not want to give up control on their data
– Local databases wish to retain a great deal of autonomy
Unified View of Data
Agreement on a common data model
 Typically the relational model
Agreement on a common conceptual schema
 Different names for same relation/attribute
 Same relation/attribute name means different things
Agreement on a single representation of shared data
 E.g. data types, precision,
 Character sets
 ASCII vs EBCDIC
 Sort order variations
Agreement on units of measure
Variations in names
 E.g. Köln vs Cologne, Mumbai vs Bombay
Query Processing
Several issues in query processing in a heterogeneous database
Schema translation
 Write a wrapper for each data source to translate data to a global schema
 Wrappers must also translate updates on global schema to updates on local
schema
Limited query capabilities
 Some data sources allow only restricted forms of selections
 E.g. web forms, flat file data sources
 Queries have to be broken up and processed partly at the source and partly at a
different site
Removal of duplicate information when sites have overlapping information
 Decide which sites to execute query
Global query optimization

13.3. Distributed Transactions

Transaction may access data at several sites.


Each site has a local transaction manager responsible for:
 Maintaining a log for recovery purposes
 Participating in coordinating the concurrent execution of the transactions
executing at that site.
Each site has a transaction coordinator, which is responsible for:
 Starting the execution of transactions that originate at the site.
 Distributing subtransactions at appropriate sites for execution.
 Coordinating the termination of each transaction that originates at the site, which
may result in the transaction being committed at all sites or aborted at all sites.

Transaction System Architecture

www.arihantinfo.com
202
RDBMS

System Failure Modes


Failures unique to distributed systems:
 Failure of a site.
 Loss of messages
 Handled by network transmission control protocols such as TCP-IP
 Failure of a communication link
 Handled by network protocols, by routing messages via alternative links
 Network partition
 A network is said to be partitioned when it has been split into two or more
subsystems that lack any connection between them
Note: a subsystem may consist of a single node
Network partitioning and site failures are generally indistinguishable.

13.4. Commit Protocols

Commit protocols are used to ensure atomicity across sites


 a transaction which executes at multiple sites must either be committed at all the
sites, or aborted at all the sites.
 not acceptable to have a transaction committed at one site and aborted at another
The two-phase commit (2 PC) protocol is widely used
The three-phase commit (3 PC) protocol is more complicated and more expensive, but avoids
some drawbacks of two-phase commit protocol.

Two Phase Commit Protocol (2PC)

Assumes fail-stop model – failed sites simply stop working, and do not cause any other harm,
such as sending incorrect messages to other sites.
Execution of the protocol is initiated by the coordinator after the last step of the transaction
has been reached.
The protocol involves all the local sites at which the transaction executed
Let T be a transaction initiated at site Si, and let the transaction coordinator at Si be Ci

Phase 1: Obtaining a Decision

Coordinator asks all participants to prepare to commit transaction Ti.


 Ci adds the records <prepare T> to the log and forces log to stable storage

www.arihantinfo.com
203
RDBMS
 sends prepare T messages to all sites at which T executed
Upon receiving message, transaction manager at site determines if it can commit the
transaction
 if not, add a record <no T> to the log and send abort T message to Ci
 if the transaction can be committed, then:
 add the record <ready T> to the log
 force all records for T to stable storage
 send ready T message to Ci
Phase 2: Recording the Decision
T can be committed of Ci received a ready T message from all the participating sites: otherwise
T must be aborted.
Coordinator adds a decision record, <commit T> or <abort T>, to the log and forces record
onto stable storage. Once the record stable storage it is irrevocable (even if failures occur)
Coordinator sends a message to each participant informing it of the decision (commit or abort)
Participants take appropriate action locally.

Handling of Failures - Site Failure


When site Si recovers, it examines its log to determine the fate of transactions active at the time of
the failure.
-Log contain <commit T> record: site executes redo (T)
-Log contains <abort T> record: site executes undo (T)
-Log contains <ready T> record: site must consult Ci to determine the fate of T.
 If T committed, redo (T)
 If T aborted, undo (T)
-The log contains no control records concerning T replies that Sk failed before responding to
the prepare T message from Ci
 since the failure of Sk precludes the sending of such a
response C1 must abort T
 Sk must execute undo (T)
Handling of Failures- Coordinator Failure
-If coordinator fails while the commit protocol for T is executing then participating sites must
decide on T’s fate:
1. If an active site contains a <commit T> record in its log, then T must be
committed.
2. If an active site contains an <abort T> record in its log, then T must be aborted.
3. If some active participating site does not contain a <ready T> record in its log, then
the failed coordinator Ci cannot have decided to commit T. Can therefore abort T.
4. If none of the above cases holds, then all active sites must have a <ready T> record
in their logs, but no additional control records (such as <abort T> of <commit T>).
In this case active sites must wait for Ci to recover, to find decision.
-Blocking problem : active sites may have to wait for failed coordinator to recover.

13.5. Concurrency Control in Distributed Databases

-Modify concurrency control schemes for use in distributed environment.


-We assume that each site participates in the execution of a commit protocol to ensure global
transaction automicity.
-We assume all replicas of any item are updated

Single-Lock-Manager Approach
-System maintains a single lock manager that resides in a single chosen site, say Si
-When a transaction needs to lock a data item, it sends a lock request to Si and lock manager
determines whether the lock can be granted immediately
 If yes, lock manager sends a message to the site which initiated the request

www.arihantinfo.com
204
RDBMS
 If no, request is delayed until it can be granted, at which time a message is sent to
the initiating site

-The transaction can read the data item from any one of the sites at which a replica of the
data item resides.
-Writes must be performed on all replicas of a data item
-Advantages of scheme:
 Simple implementation
 Simple deadlock handling
-Disadvantages of scheme are:
 Bottleneck: lock manager site becomes a bottleneck
 Vulnerability: system is vulnerable to lock manager site failure.

Distributed Lock Manager

-In this approach, functionality of locking is implemented by lock managers at each site
 Lock managers control access to local data items
 But special protocols may be used for replicas
-Advantage: work is distributed and can be made robust to failures
-Disadvantage: deadlock detection is more complicated
 Lock managers cooperate for deadlock detection
 More on this later
-Several variants of this approach
 Primary copy
 Majority protocol
 Biased protocol
 Quorum consensus

Primary Copy
-Choose one replica of data item to be the primary copy.
 Site containing the replica is called the primary site for that data item
 Different data items can have different primary sites
-When a transaction needs to lock a data item Q, it requests a lock at the primary site of Q.
 Implicitly gets lock on all replicas of the data item
- Benefit
 Concurrency control for replicated data handled similarly to unreplicated data -
simple implementation.
-Drawback
 If the primary site of Q fails, Q is inaccessible even though other sites containing a
replica may be accessible.

Majority Protocol

-Local lock manager at each site administers lock and unlock requests for data items stored
at that site.
-When a transaction wishes to lock an unreplicated data item Q residing at site Si, a message
is sent to Si ‘s lock manager.
 If Q is locked in an incompatible mode, then the request is delayed until it can be
granted.
 When the lock request can be granted, the lock manager sends a message back to
the initiator indicating that the lock request has been granted.
-In case of replicated data
 If Q is replicated at n sites, then a lock request message must be sent to more than
half of the n sites at which Q is stored.
 The transaction does not operate on Q until it has obtained a lock on a majority of
the replicas of Q.
 When writing the data item, transaction performs writes on all replicas.

www.arihantinfo.com
205
RDBMS
-Benefit
 Can be used even when some sites are unavailable
-Drawback
 Requires 2(n/2 + 1) messages for handling lock requests, and (n/2 + 1) messages
for handling unlock requests.
 Potential for deadlock even with single item - e.g., each of 3 transactions may have
locks on 1/3rd of the replicas of a data.

Biased Protocol
-Local lock manager at each site as in majority protocol, however, requests for shared locks
are handled differently than requests for exclusive locks.
-Shared locks. When a transaction needs to lock data item Q, it simply requests a lock on Q
from the lock manager at one site containing a replica of Q.
-Exclusive locks. When transaction needs to lock data item Q, it requests a lock on Q from
the lock manager at all sites containing a replica of Q.
-Advantage - imposes less overhead on read operations.
-Disadvantage - additional overhead on writes

Deadlock Handling
Consider the following two transactions and history, with item X and transaction T1 at site 1, and
item Y and transaction T2 at site 2:

T1: write (X) T2: write (Y)


write (Y) write (X)

X-lock on X
write (X) X-lock on Y
write (Y)
wait for X-lock on X

Wait for X-lock on Y

Result: deadlock which cannot be detected locally at either site

Centralized Approach
-A global wait-for graph is constructed and maintained in a single site; the deadlock-detection
coordinator
 Real graph: Real, but unknown, state of the system.
 Constructed graph: Approximation generated by the controller during the execution
of its algorithm .

www.arihantinfo.com
206
RDBMS
-the global wait-for graph can be constructed when:
 a new edge is inserted in or removed from one of the local wait-for graphs.
 a number of changes have occurred in a local wait-for graph.
 the coordinator needs to invoke cycle-detection.
-If the coordinator finds a cycle, it selects a victim and notifies all sites. The sites roll back the
victim transaction.

Local and Global Wait-For Graphs

Local

Global

13.6. Availability

www.arihantinfo.com
207
RDBMS

-High availability: Time for which system is not fully usable should be extremely low (e.g.
99.99% availability)
-Robustness: ability of system to function despite failures of components
-Failures are more likely in large distributed systems
-To be robust, a distributed system must
 Detect failures
 Reconfigure the system so computation may continue
 Recovery/reintegration when a site or link is repaired
-Failure detection: distinguishing link failure from site failure is hard
 (partial) solution: have multiple links, multiple link failure is likely a site failure

www.arihantinfo.com
208

You might also like