Professional Documents
Culture Documents
Dbms by Korthj
Dbms by Korthj
www.arihantinfo.com
1
Unit 3:
Unit 4:
Unit 5:
Unit 6:
www.arihantinfo.com
2
7.1Introduction of sql
7.2Ddl statements
7.3Dml statements
7.4View definitions
7.5Constraints and triggers
7.6Keys and foreign keys
7.7Constraints on attributes and tuples
7.8Modification of constraints
7.9Cursors
7.10Dynamic sql
Unit 8: Relational Algebra
8.1 Basics of relational algebra
8.2 Set operations on relations
8.3 Extended operators of relational algebra
8.4 Constraints on relations
8.5 Self test
8.6 Summary
www.arihantinfo.com
3
10.9Summary
Unit 11: Transaction Management
11.1Introduction of transaction management
11.2Serializability and recoverability
11.3View serializability
11.4Resolving deadlocks
11.5Distributed databases
11.6Distributed commit
11.7Distributed locking
11.8Summary
www.arihantinfo.com
4
UNIT 1
INTRODUCTION TO DATA BASE SYSTEM
1.1 INTRODUCTION
www.arihantinfo.com
5
Database Management is an important aspect of data processing. It involves, several data models
evolving into different DBMS software packages. These packages demand certain knowledge in
discipline and procedures to effectively use them in data processing applications.
We need to understand the relevance and scope of Database in the Data processing area. This we do
by first understanding the properties and characteristics of data and the nature of data organization.
Data structure can be defined as specification of data. Different data structures like array, stack,
queue, tree and graph are used to implement data organization in main memory. Several strategies
are used to support the organization of data in secondary memory.
A database is a collection of related information stored in a manner that it is available to many users
for different purposes. The content of a database is obtained by combining data from all the different
sources in an organization. So that data are available to all users and redundant data can be
eliminated or at least minimized. The DBMS helps create an environment in which end user have
better access to more and better managed data than they did before the DBMS become the data
management standard.
A database can handle business inventory, accounting information in its files to prepare summaries,
estimates, and other reports. There can be a database, which stores new paper articles, magazines,
books, and comics. There is already a well-defined market for specific information for highly selected
group of users on almost all subjects. The database management system is the major software
component of a database system. Some commercially available DBMS are INGRES, ORACLE, and
Sybase. A database management system, therefore, is a combination of hardware and software that
can be used to set up and monitor a database, and can manage the updating and retrieval of
database that has been stored in it. Most database management systems have the following
facilities/capabilities:
Creating of a file, addition to data, deletion of data, modification of data; creation, addition
and deletion of entire files.
The data stored can be sorted or indexed at the user's discretion and direction.
Various reports can be produced from the system. These may be either standardized report or that
may be specifically generated according to specific user definition.
Mathematical functions can be performed and the data stored in the database can be manipulated
with these functions to perform the desired calculations.
www.arihantinfo.com
6
The DBMS interprets and processes users' requests to retrieve information from a database. The
following figure shows that a DBMS serves as an interface in several forms. They may be keyed
directly from a terminal, or coded as high-level language programs to be submitted for interactive or
batch processing. In most cases, a query request will have to penetrate several layers of software in
the DBMS and operating system before the physical database can be accessed.
1.2 OBJECTIVES
After going through this unit, you should be able, to Appreciate the limitations of the traditional
approach to application system development; Give reasons why the database approach is now being
increasingly adopted; Discuss different views of data; List the components of a database
management system; Enumerate the feature/capabilities of a database management system; and List
several advantages and disadvantages of DBMS.
1.3 TRADITIONAL FILE ORIENTED APPROACH
The traditional file-oriented approach to information processing has for each application a separate
master file and its own set of personal files. An organization needs flow of information across these
applications also and this requires sharing of data, which is significantly lacking in the traditional
approach. One major limitations of such a file-based approach is that the programs become
dependent on the files and the files become dependent upon the programs.
Disadvantages
Data Redundancy: The same piece of information may be stored in two or more files. For
example, the particulars of an individual who may be a customer or client may be stored in
two or more files. Some of this information may be changing, such as the address, the
payment maid, etc. It is therefore quite possible that while the address in the master file for
one application has been updated the address in the master file for another application may
have not been. It may be not easy to even find out as to in how many files the repeating items
such as the name occur.
Lack of Flexibility: In view of the strong coupling between the program and the data, most
information retrieval possibilities would be limited to well-anticipated and pre-determined
www.arihantinfo.com
7
requests for data, the system would normally be capable of producing scheduled records and
queries which it has been programmed to create.
1.4. MOTIVATION FOR DATABASE APPROACH
Having pointed out some difficulties that arise in a straightforward file-oriented approach towards
information system development. The work in the organization may not require significant sharing of
data or complex access. In other words the data and the way it is used in the functioning of the
organization are not appropriate to database processing. Apart from needing a more powerful
hardware platform, the software for database management systems is also quite expensive. This
means that a significant extra cost has to be incurred by an organization if it wants to adopt this
approach.
Advantages gained by the possibility of sharing of the data with others, also carries with it the risk of
unauthorized access of data. This may range from violation of office procedures to violation of privacy
rights of information to down right thefts. The organizations, therefore, have to be ready to cope with
additional managerial problems.
A database management processing system is complex and it could lead to a more inefficient system
than the equivalent file-based one.
The use of the database and its possibility of being shared will, therefore affect many departments
within the organization. If die integrity of the data is not maintained, it is possible that one relevant
piece of data could have been used by many programs in different applications by different users
without they are being aware of it. The impact of this therefore may be very widespread. Since data
can be input from a variety sources, the control over the quality of data become very difficult to
implement.
However, for most large organization, the difficulties in moving over to a database approach are still
worth getting over in view of the advantages that are gained, namely, avoidance of data duplication,
sharing of data by different programs, greater flexibility and data independence.
1.5 DATABASE BASICS
Since the DBMS of an organization will in some sense reflect the nature of activities in the
organization, some familiarity with the basic concepts, principles and terms used in the field are
important.
Data-items: The term data item is the word for what has traditionally been called the field in
data processing and is the smallest unit of data that has meaning to its users. The phrase
data element or elementary item is also sometimes used. Although the data item may be
treated as a molecule of the database, data items are grouped together to form aggregates
described by various names. For example, the data record is used to refer to a group of data
items and a program usually reads or writes the whole records. The data items could
occasionally be further broken down into what may be called an automatic level for processing
purposes.
www.arihantinfo.com
8
Entities and Attributes: The real world would consist of occasionally a tangible object such as
an employee; a component in an inventory or a space or it may be intangible such as an
event, a job description, identification numbers, or an abstract construct. All such items
about which relevant information is stored in the database are called Entities. The qualities of
the entity that we store as information are called the attributes. An attribute may be
expressed as a number or as a text. It may even be a scanned picture, a sound sequence, and
a moving picture that is now possible in some visual and multi-media databases.
Data processing normally concerns itself with a collection of similar entities and records
information about the same attributes of each of them. In the traditional approach, a
programmer usually maintains a record about each entity and a data item in each record
relates to each attribute. Similar records are grouped into files and such a 2-dimensional
array is sometimes referred to as a flat file.
Logical and Physical Data: One of the key features of the database approach is to bring about
a distinction between the logical and the physical structures of the data. The term logical
structure refers to the way the programmers see it and the physical structure refers to the
way data are actually recorded on the storage medium. Even in the early stages of records
stored on tape, the length of the inter-record tape requires that many logical records be
grouped into one physical record to several storage places on tape. It was the software, which
separated them when used in an application program, and combined them again before
writing back on tape. In today's system the complexities are even greater and as will be seen
when one is referring to distributed databases that some records may physically be located at
significantly remote places.
Schema and Subschema: Having seen that the database does not focus on the logical
organization and decouples it from the physical representation of data, it is useful to have a
term to describe the logical database description. A schema is a logical database description
and is drawn as a chart of the types of data that are used. It gives the names of the entities
and attributes, and specifies the relationships between them. It is a framework into which the
values of the data item can be fitted. Like an information display system such as that giving
arrival and departure time at airports and railway stations, the schema will remain the same
though the values displayed in the system will change from time to time. The relationships
that has specified between the different entities occurring in the schema may be a one to one,
one to many, many to many, or conditional.
The term schema is used to mean an overall chart of all the data item types and record-types
stored in a database. The term sub schema refers to the same view but for the data-item types
and record types which a particular user uses in a particular application or. Therefore, many
different sub schemas can be derived from one schema.
Data Dictionary: It holds detailed information about the different structures and data types:
the details of the logical structure that are mapped into the different structure, details of
relationship between data items, details of all users privileges and access rights, performance
of resource with details.
www.arihantinfo.com
9
DBMS is a collection of interrelated files and a set of programs that allow several users to access and
modify these files. A major purpose of a database system is to provide users with an abstract view of
the data. However, in order for the system to be usable, data must be retrieved efficiently.
The concern for efficiently leads to the design of complex data structure for the representation of data
in the database. By defining levels of abstract as which the database may be viewed, there are logical
view or external, conceptual view and internal view or physical view.
External view: This is the highest level of abstraction as seen by a user. This level of
abstraction describes only the part of entire database.
Conceptual view: This is the next higher level of abstraction which is the sum total of Data
Base Management System user's views. In this we consider; what data are actually stored in
the database. Conceptual level contains information about entire database in terms of a small
number of relatively simple structures.
Internal level: The lowest level of abstraction at which one describes how the data are
physically stored. The interrelationship of any three levels of abstraction is illustrated in figure
2.
www.arihantinfo.com
10
support the notion of a record or structure type. For example in the C language we declare structure
(record) as follows:
struct
Emp{
char
name
[30];
char
address
[100];
}
This defines a new record called Emp with two fields. Each field has a name and data type associated
with it.
In an Insurance organization, we may have several such record types, including among others:
-Customer with fields name and Salary
-Premium paid and Due amount at what date
-Insurance agent name and salary + Commission
At the internal level, a customer, Premium account, or employee (insurance agent) can be described
as a sequence of consecutive bytes. At the conceptual level each such record is described by a type
definition, illustrated above and also die interrelation among these record types is defined and
describing the rights or privileges assign to individual customer or end-users. Finally at the external
level, we define several views of the database. For example, for preparing the Insurance checks of
Customer_details, only information about them is required; one does not need to access information
about customer accounts. Similarly, tellers can access only account information. They cannot access
information concerning about the premium paid or amount received.
1.7 THE THREE LEVEL ARCHITECTURE OF DBMS
A database management system that provides these three levels of data is said to follow three-level
architecture as shown in fig. . These three levels are the external level, the conceptual level, and the
internal level.
www.arihantinfo.com
11
A schema describes the view at each of these levels. A schema as mentioned earlier is an outline or a
plan that describes the records and relationships existing in the view. The schema also describes the
way in which entities at one level of abstraction can be mapped to the next level. The overall design of
the database is called the database schema. A database schema includes such information as:
Characteristics of data items such as entities and attributes Format for storage representation
Integrity parameters such as physically authorization and backup politics. Logical structure and
relationship among those data items
The concept of a database schema corresponds to programming language notion of type definition. A
variable of a given type has a particular value at a given instant in time. The concept of the value of a
variable in programming languages corresponds to the concept of an instance of a database schema.
Since each view is defined by a schema, there exists several schema in the database and these exists
several schema in the database and these schema are partitioned following three levels of data
abstraction or views. At the lower level we have the physical schema; at the intermediate level we
have the conceptual schema, while at the higher level we have a subschema. In general, database
system supports one physical schema, one conceptual schema, and several sub-schemas.
1.7.1 External Level or Subschema
The external level is at the highest level of database abstraction where only those portions of the
database of concern to a user or application program are included. Any number of user views may
exist for a given global or conceptual view.
Each external view is described by means of a schema called an external schema or subschema. The
external schema consists of the, definition of the logical records and the relationships in the external
view. The external schema also contains the method of deriving the objects in the external view from
the objects in the conceptual view. The objects include entities, attributes, and relationships.
1.7.2 Conceptual Level or Conceptual Schema
One conceptual view represents the entire database. The conceptual schema defines this conceptual
view. It describes all the records and relationships included in the conceptual view and, therefore, in
the database. There is only one conceptual schema per database. This schema also contains the
method of deriving the objects in the conceptual view from the objects in the internal view.
The description of data at this level is in a format independent of its physical representation. It also
includes features that specify the checks to retain data consistency and integrity.
1.7.3 Internal Level or Physical Schema
It indicates how the data will be stored and describes the data structures and access methods to be
used by the database. The internal schema, which contains the definition of the stored record, the
method of representing the data fields, expresses the internal view and the access aids used.
1.7.4 Mapping between different Levels
www.arihantinfo.com
12
Two mappings are required in a database system with three different views. A mapping between the
external and conceptual level gives the correspondence among the records and the relationships of
the external and conceptual levels.
a) EXTERNAL to CONCEPTUAL: Determine how the conceptual record is viewed by the user
b) INTERNAL to CONCEPTUAL: Enable correspondence between conceptual and internal levels. It
represents how the conceptual record is represented in storage.
An internal record is a record at the internal level, not necessarily a stored record on a physical
storage device. The internal record of figure 3 may be split up into two or more physical records. The
physical database is the data that is stored on secondary storage devices. It is made up of records
with certain data structures and organized in files. Consequently, there is an additional mapping
from the internal record to one or more stored records on secondary storage devices.
1.8 DATABASE MANAGEMENT SYSTEM FACILITIES
Two main types of facilities are supported by the DBMS:
- The data definition facility or data definition language (DDL).
- The data manipulation facility or data manipulation language (DML).
- The data query facility or data query language [DQL].
- The data control facility or data control language [DCL].
- The transaction control facility or data control language [TCL].
1.8.1 Data Definition Language
Data Definition Language is a set of SQL commands used to create, modify and delete database
structures (not data). These commands wouldn't normally be used by a general user, who should be
accessing the database via an application. They are normally used by the DBA (to a limited extent), a
database designer or application developer. These statements are immediate; they are not susceptible
to ROLLBACK commands. You should also note that if you have executed several DML updates then
issuing any DDL command will COMMIT all the updates as every DDL command implicitly issues a
COMMIT command to the database. Anybody using DDL must have the CREATE object privilege and
a Tablespace area in which to create objects.
1.8.2 Data Manipulation Language
DML is a language that enables users to access or manipulate as organized by the appropriate data
model. Data manipulation involves retrieval of data from the database, insertion of new data into the
database, and deletion or modification of existing data. The first of these data manipulation
operations is called a query. A query is a statement in the DML that requests the retrieval of data
from the database. The DML provides commands to select and retrieve data from the database.
Commands are also provided to insert, update, and delete records.
There are basically two types of DML:
Procedural: which requires a user to specify what data is needed and how to get it
Nonprocedural: which requires a user to specify what data is needed without specifying how to get it
www.arihantinfo.com
13
www.arihantinfo.com
14
The database user retrieves data by formulating a query in the data manipulation language provided
with the database. The query processor is used to interpret the online user's query and convert it into
an efficient series of operations in a form capable of being sent to the data manager for execution.
The query processor uses the data dictionary to find the structure of the relevant portion of the
database and uses this information in modifying the query and preparing an optimal plan to access
the database.
1.9.6 Database Administrator
Data administration is a high level function that is responsible for overall management of data
resources in an organization including maintaining corporate wide data definitions and standards. It
is a technical function that is responsible for physical database design and for dealing with technical
issue such as security enforcement, Database performance, backup, and recovery. The person having
such control over the system is called the database administrator (DBA). The DBA administers the
three levels of the database and, in consultation with the overall user community, sets up the
definition of the global view or conceptual level of the database. The DBA further specifies the
external view of the various users and applications and is responsible for the definition and
implementation of the internal level, including the storage structure and access methods to be used
for the optimum performance of the DBMS. Changes to any of the three levels necessitated by
changes or growth in the organization and/or emerging technology are under the control of the DBA.
Mappings between the internal and the conceptual levels, as well as between the conceptual and
external levels, are also defined by the DBA. Ensuring that appropriate measures are in place to
maintain the integrity of the database and that the database is not accessible to unauthorized users
is another responsibility. The DBA is responsible for granting permission to the users of the database
and stores the profile of each user in the database. This profile describes the permissible activities of
a user on that portion of the database accessible to the user via one or more user views. The user
profile can be used by the database system to verify that a particular user can perform a given
operation on the database.
The DBA is also responsible for defining procedures to recover the database from failures due to
human, natural, or hardware causes with minimal loss of data. This recovery procedure should
enable the organization to continue to function and the intact portion of the database should
continue to be available.
Let us summarize the functions of DBA are:
Schema definition: The creation of the original database schema. This is accomplished by writing a
set of definition, which are translated by the DDL compiler to a set of tables that are permanently
stored in the data dictionary.
Storage Structure and access method definition: The creation of appropriate storage structure and
access method. This is accomplished by writing a set of definitions, which are translated by the data
storage and definition language compiler.
www.arihantinfo.com
15
Schema and Physical organization modification: Either the modification of the database schema or
the description of the physical storage organization. These changes, although relatively rare, are
accomplished by writing a set of definition which is used by either the DDL compiler or the data
storage and definition language compiler to generate modification to the appropriate internal system
tables (for example the data dictionary).
1.9.7 Data Dictionary
Keeping a track of all the available names that are used and the purpose for which they were used
becomes more and more difficult. Of course it is possible for a programmer who has coined the
available names to bear them in mind, but should the same author come back to his program after a
significant time or should another programmer have to modify the program, it would be found that it
is extremely difficult to make a reliable account of for what purpose the data files were used.
The problem becomes even more difficult when the number of data types that an organization has in
its database increased. It has also now perceived that the data of an organization is a valuable
corporate resource and therefore some kind of an inventory and catalogue of it must be maintained
so as to assist in both the utilization and management of the resource.
It is for this purpose that a data dictionary or dictionary/directory is emerging as a major tool. An
inventory provides definitions of things. A directory tells you where to find them. A data
dictionary/directory contains information (or data) about the data. A comprehensive data dictionary
would provide the definition of data item, how they fit into the data structure, and how they relate to
other entities in the database.
The DBA uses the data dictionary in every phase of a database life cycle, starting from the embryonic
data-gathering phase to the design, implementation, and maintenance phases. Documentation
provided by a data dictionary is as valuable to end users and managers as it provided by a data
dictionary is as valuable to end users and managers as it are essential to the programmers. Users
can plan their applications with the database only if they know exactly what is stored in it. For
example, the description of a data item in a data dictionary may include its origin and other text
description in plain English, in addition to its data format. Thus users and managers will be able to
see exactly what is available in the database. You could consider a data dictionary to be a road map,
which guides users to access information within a large database.
www.arihantinfo.com
16
www.arihantinfo.com
17
avoiding duplication is the elimination of the inconsistencies that tend to be present in redundant
data files. Any redundancies that exist in the DBMS are controlled and the system ensures that these
multiple copies are consistent.
Sharing Data: A database allows the sharing of data under its control by any number of application
programs or users.
Data Integrity: Centralized control can also ensure that adequate checks are incorporated in the
DBMS to provide data integrity. Data integrity means that the data contained in the database is both
accurate and consistent. Therefore, data values being entered for storage could be checked to ensure
that they fall within a specified range and are of the correct format. For example, the value for the age
of an employee may be in the range of 16 and 75. Another integrity check that should be
incorporated in the database is to ensure that if there is a reference to certain object, that object
must exist. In the case of an automatic teller machine, for example, a user is not allowed to transfer
funds from a nonexistent saving account to a checking account.
Data Security: Data is of vital importance to an organization and may be confidential. Unauthorized
persons must not access such confidential data. The DBA who has the ultimate responsibility for the
data in the DBMS can ensure that proper access procedures are followed, including proper
authentication schemas for access to the DBMS and additional checks before permitting access to
sensitive data. Different levels of security could be implemented for various types of data and
operations.
Conflict Resolution: DBA chooses the best file structure and access method to get optimal
Performance for the response-critical applications, while permitting less critical applications to
continue to use die database, albeit with a relatively slower response.
Data Independence: Data independence is usually considered from two points of view: physical data
independence and logical data independence. Physical data independence allows changes in the
physical storage devices or organization of the files to be made without requiring changes in the
conceptual view or any of the external views and hence in the application programs using the
database.
Disadvantages
A significant disadvantage of the DBMS system is cost. In addition to the cost of purchasing or
developing the software, the hardware has to be upgraded to allow for the extensive programs and the
workspaces required for their execution and storage. The processing overhead introduced by the
DBMS to implement security, integrity, and sharing of the data causes a degradation of the response
and through-put times. An additional cost is that of migration from a traditionally separate
application environment to an integrated one.
While centralization reduces duplication, the lack of duplication requires that the database be
adequately backed up so that in the case of failure the data can be recovered. Backup and recovery
operations are fairly complex in a DBMS environment, and this is exacerbated in a concurrent multi-
www.arihantinfo.com
18
user database system. In further a database system requires a certain amount of controlled
redundancies and duplication to enable access to related data items.
Centralization also means that the data is accessible from a single source namely the database. This
increases the potential severity of security breaches and disruption of the operation of the
organization because of downtimes and failures. The replacement of a monolithic centralized
database by a federation of independent and cooperating distributed databases resolves some of the
problems resulting from failures and downtimes.
1.11 Self Test
1)
Define the term database?
2)
Explain levels of database with the help of suitable example
3)
Explain components of database system.
4)
Explain elements of DBMS.
1.12 Summary
The traditional file-oriented approach to information processing has for each application a
separate master file and its own set of personal files. An organization needs flow of
information across these applications also and this requires sharing of data, which is
significantly lacking in the traditional approach.
A database management processing system is complex and it could lead to a more inefficient
system than the equivalent file-based one. The use of the database and its possibility of being
shared will, therefore affect many departments within the organization. If die integrity of the
data is not maintained, it is possible that one relevant piece of data could have been used by
many programs in different applications by different users without they are being aware of it.
The impact of this therefore may be very widespread. Since data can be input from a variety
sources, the control over the quality of data become very difficult to implement.
Data-items: The term data item is the word for what has traditionally been called the field in
data processing and is the smallest unit of data that has meaning to its users.
Entities and Attributes: The real world would consist of occasionally a tangible object such as
an employee; a component in an inventory or a space or it may be intangible such as an
event, a job description, identification numbers, or an abstract construct. All such items
about which relevant information is stored in the database are called Entities.
Logical and Physical Data: One of the key features of the database approach is to bring about
a distinction between the logical and the physical structures of the data.
www.arihantinfo.com
19
A schema is a logical database description and is drawn as a chart of the types of data that
are used. It gives the names of the entities and attributes, and specifies the relationships
between them. It is a framework into which the values of the data item can be fitted. Like an
information display system such as that giving arrival and departure time at airports and
railway stations, the schema will remain the same though the values displayed in the system
will change from time to time.
www.arihantinfo.com
20
UNIT 2
DATABASE MODELS
2.1 Introduction
2.2 Objectives
2.3 File management system
2.4 Entity-relationship (e-r) diagram
2.4.1 Intrdoction of erd
2.4.2 Entity-relationship diagram
2.4.3 Generalization and aggregation
2.4.3.1 Aggregation
2.5 The hierarchical model
2.6 The network model
2.7 The relational model
2.8 Advantages and disadvantages of relational approach
2.9 An example of a relational model
2.10Self test
2.11Summary
2.1 INTRODUCTION
Data modeling is the analysis of data objects that are used in a business or other context and the
identification of the relationships among these data objects. Data modeling is a first step in doing
object-oriented programming. As a result of data modeling, you can then define the classes that
provide the templates for program objects.
A simple approach to creating a data model that allows you to visualize the model is to draw a square
(or any other symbol) to represent each individual data item that you know about (for example, a
product or a product price) and then to express relationships between each of these data items with
words such as "is part of" or "is used by" or "uses" and so forth. From such a total description, you
can create a set of classes and subclasses that define all the general relationships. These then
become the templates for objects that, when executed as a program, handle the variables of new
transactions and other activities in a way that effectively represents the real world.
2.2 OBJECTIVES
After going through this unit, you should be able to:
www.arihantinfo.com
21
www.arihantinfo.com
22
define the data and development applications. Therefore, the file management system may be
considered to be a level higher than mere high-level languages.
Therefore some of the advantages of FMS compared to standard high level language are:
Less software development cost - Even by experienced programmers it takes months or years
in developing a good software system in high level language.
Support of efficient query facility - on line queries for multiple-key retrievals are tedious to
program.
Of course one could bear in mind the limitations Of an FMS in the sense FMS cannot handle
complicated mathematical operation and array manipulations. In order to remedy the situation some
FMS provide an interface to call other programs written in a high level language or an assembly
language.
Another limitation of FMS is that for data management and access it is restricted to basic access
methods. The physical and logical links required between different files to be able to cope with
complex multiple key queries on multiple files is not possible. Even though FMS is a simple, powerful
tool it cannot replace the high level language, nor can it perform complex information retrieval like
DBMS. It is in this context that reliance on a good database management system become essential.
Entity - An entity is a thing that exists and is distinguishable -- an object, something in the
environment.
www.arihantinfo.com
23
Entity Instance - An instance is a particular occurrence of an entity. For example, each person
Multivalued (list) - multiple values e.g. degrees, courses, skills (not allowed in first normal form)
Relationships
A
relationship
is
a
connection
between
entity
classes.
For example, a relationship between PERSONS and AUTOMOBILES could be an "OWNS"
relationship. That is to say, automobiles are owned by people.
The degree of a relationship indicates the number of entities involved.
The cardinality of a relationship indicates the number of instances in entity class E1 that can
ID
Student Name
524-77-9870
Tom Jones
543-87-4309
Suzy Smart
553-67-3965
Joe Blow
...
...
Major
524-77-9870
Accounting
543-87-4309
Accounting
www.arihantinfo.com
24
553-67-3965
Finance
...
...
Many-Many Relationships - There are no restrictions on how many entities in either class are
associated with a single entity in the other. An example of a many-to-many relationship would be
students taking classes. Each student takes many classes. Each class has many students.
Can be mandatory, optional or fixed.
o Isa Hierarchies - A special type of relationship that allows attribute inheritance. For
example, to say that a truck isa automobile and an automobile has a make, model and
serial number implies that a truck also has a make, model and serial number.
Keys - The Key uniquely differentiates one entity from all others in the entity class. A key is an
identifier.
o Primary Key - Identifier used to uniquely identify one particular instance of an entity.
Concatenated Key - Key made up of parts which when combined become a unique
also a key to the subclass of items. For example, if serial number is a key for automobiles it
would also be a key for trucks.
Foreign Keys - Foreign keys reference a related table through the primary key of that
related
table.
Referential Integrity Constraint - For every value of a foreign key there is a primary key
with that value in the referenced table e.g. if account number is to be used in a table for
journal entries then that account number must exist in the chart of accounts table.
www.arihantinfo.com
25
Arcs - Arcs connect entities to relationships. Arcs are also used to connect attributes to
entities. Some styles of entity-relationship diagrams use arrows and double arrows to indicate
the one and the many in relationships. Some use forks etc.
www.arihantinfo.com
26
www.arihantinfo.com
27
www.arihantinfo.com
28
table would he required for each lowest-level entity in the hierarchy. The number of different tables
required to represent these entities would be equal to the number of entities at the lowest level of the
generalization hierarchy.
2.4.3.1Aggregation
Aggregation is the process of compiling information on an object, thereby abstracting a higher-level
object. In this manner, aggregating the characteristics name, address, and Social Security number
derives the entity person. Another form of the aggregation is abstracting a relationship between
objects and viewing the relationship as an object. As such, the ENROLLMENT relationship between
entities student and course could be viewed as entity REGISTRATION. Examples of aggregation are
shown in fig.
Consider the relationship COMPUTING of fig. Here we have a relationship among the entities
STUDENT, COURSE, and COMPUTING SYSTEM.
www.arihantinfo.com
29
www.arihantinfo.com
30
www.arihantinfo.com
31
its subsequent revisions have been subjected to much debate and criticism. Many of the current
database applications have been built on commercial DBMS systems using the DBTG model.
DBTG Set: The DBTG model uses two different data structures to represent the database entities and
relationships between the entities, namely record type and set type. A record type is used to
represent an entity type. It is made up of a number of data items that represent the attributes of the
entity.
A set is used to represent a directed relationship between two record types, the so-called owner
record type, and the member record type. The set type, like the record type, is named and specifies
that there is a one-to-many relationship (I:M) between the owner and member record types. The set
type can have more than one re-cord type as its member, but only one record type is allowed to be
the owner in a given set type. A database could have one or more occurrences of each of its record
and set types. An occurrence of a set type consists of an occurrence of each of its record and set
types. An occurrence of a set type consists of an occurrence of the owner record type and any
number of occurrences of each of its member record types. A record type cannot be a member of two
distinct occurrences of the same set type.
Bachman introduced a graphical means called a data structure diagram to denote the logical
relationship implied by the set. Here a labelled rectangle represents the corresponding entity or
record type. An arrow that connects two labeled rectangles represents a set type. The arrow direction
is from the owner record type to the member record type. Figure shows two record types
(DEPARTMENT and ENTLOYEE) and the set type DEPIT-EMP, with DEPARTMENT as the owner
record type and EMPLOYEE as the member record type.
www.arihantinfo.com
32
The differences that arise in the relational approach are in setting up relationships between different
tables. This actually makes use of certain mathematical operations on the relation such as
projection, union, joins, etc. These operations from relational algebra and relational calculus are
discussion in some more details in the second Block of this course. Similarly in order to achieve the
organization of the data in terms of tables in a satisfactory manner, a technique called normalization
is used.
A unit in the second block of this course describes in detail the processing of normalization and
various stages including the first normal forms, second normal forms, and the third normal forms. At
www.arihantinfo.com
33
the moment it is sufficient to say that normalization is a technique, which helps in determining the
most appropriate grouping of data items into records, segments or tuples. This is necessary because
in the relational model the data items are arranged in tables, which indicate the structure,
relationship, and integrity in the following manner:
1. In any given column of a table, all items are of the same kind
2. Each item is a simple number or a character string
3. All rows of a table are distinct.
4. Ordering of rows within a table is immaterial
5. The columns of a table are assigned distinct names and the ordering of these columns is
immaterial
6. If a table has N columns, it is said to be of degree N. This is sometimes also referred to as the
cardinality of the table. From a few base tables it is possible by setting up relations; create views,
which provide the necessary information to the different users of the same database.
www.arihantinfo.com
34
tables themselves are voluminous, the performance in responding to queries is definitely degraded. It
must be appreciated that the simplicity in the relational database approach arises in the logical view.
With an interactive system, for example an operation like join would depend upon the physical
storage also. It is, therefore common in relational databases to tune the databases and in such a case
the physical data layout would be chosen so as to give good performance in the most frequently run
operations. It therefore would naturally result in the fact that the lays frequently run operations
would tend to become even more shared
While the relational database approach is a logically attractive, commercially feasible approach, but if
the data is for example naturally organized in a hierarchical manner and stored as such, the
hierarchical approach may give better results. It is helpful to have a summary view of the differences
between the relational and the non-relational approach in the following section.
2.9 An Example of a Relational Model
Let us see important features of a RDBMS through some examples as shown in figure 24. A relation
has the following properties:
1. A table is perceived as a two-dimensional structure composed of row and columns.
2. Each column has a distinct name (attribute name), and the order of columns is immaterial.
3. Each table row (tuple) represents a single data entity within the entity set.
4. Each row/column intersection represents a single data value.
5. Each row carries information describing one entity occurrence.
www.arihantinfo.com
35
SELF TEST
What do you mean by the term data model?
Create and ERD of Hostel management.
Explain difference between 1:1 and 1:M relationship.
2.11
SUMMARY
Data modeling is the analysis of data objects that are used in a business or other context and
the identification of the relationships among these data objects. Data modeling is a first step
in doing object-oriented programming.
One or more of these data items are designated as the key and is used for sequencing the file
and for locating and grouping records by sorting and indexing.
The entity-relationship model is a tool for analyzing the semantic features of an application
that are independent of events. Entity-relationship modeling helps reduce data redundancy.
www.arihantinfo.com
36
UNIT 3
FILE ORGANISATION FOR DBMS
3.1 Introduction
3.2 Objectives
3.3 File organization
3.4 Sequential file organisation
3.4.1 Index-sequential file organization
3.4.1.1 Types of indexes
3.5 B-trees
3.5.1 Advantages of b-tree indexes
3.6 Direct file organization
3.7 Need for the multiple access path
3.8 Self test
3.9 Summary
3.1 INTRODUCTION
Just as arrays, lists, trees and other data structures are used to implement data organization in main
memory, a number of strategies are used to support the organization of data in secondary memory.
www.arihantinfo.com
37
www.arihantinfo.com
38
www.arihantinfo.com
39
sequential file in a single pass. Such a file, containing the updates to be made to a sequential file, is
sometimes referred to a transaction file.
In the batched mode of updating, a transaction file of update records is made and then sorted in the
sequence of the sequential file. Ale update process requires the examination of each individual record
in the original sequential file (the old master rile). Records requiring no changes are copied directly to
a new file (the new master rile); records requiring one or Wore changes are written into the new
master file only after all necessary changes have been made. Insertions of new records are made in
the proper sequence. They are written into the new master file at the appropriate place. Records to be
deleted are not copied to the new master file. A big advantage of this method of update is the creation
of an automatic backup copy. The new master file can always be recreated by processing the old
master file and the transaction file.
www.arihantinfo.com
40
The idea behind an index access structure is similar to that behind the indexes used commonly in
textbooks. Along with each term, a list of page numbers where the term appears is given. We can
search the index to find a list of addresses - page numbers in this case - and use these addresses to
locate the term in the textbook by searching the specified pages. The alternative, if no other guidance
is given, is to sift slowly through the whole textbooks word by word to find the term we are interested
in, which corresponds to doing a linear search on a file. Of course, most books do have additional
information, such as chapter and section titles, which can help us find a term without having to
search through the whole book. However, the index is the only exact indication of where each term
occurs in the book.
An index is usually defined on a single field of a file, called an indexing Field. There are several types
of indexes. A primary index is an index specified on the ordering key field of an ordered file of records.
Recall that an ordering key field is used to physically order the file records on disk, and every record
has a unique value for that field. If the ordering field is not a key field that is, several records in the
file can have the same value for the ordering field another type of index, called a clustering index, can
be used. Notice that a file can have at most one physical ordering field, so it can have at most one
primary index or one clustering index, but not both. A third type of index, called a secondary index,
can be specified on any non-ordering field of a file. A file can have several secondary indexes in
addition to its primary access method. In the next three subsections we discuss these three types of
indexes.
Primary Indexes
A primary index is an ordered file whose records are of fixed length with two fields. The first field is of
the same data types as the ordering key field of the data file, and the second field is a pointer to a
disk block - a block address. The ordering key field is called the primary key of the data file there is
one index entry (or index record) in the index file for each block in the data file. Each index entry has
the value of the primary key field for the first record in a block and a pointer to other block as its two
field values. We will refer to the two field values of index entry i as K (i), P (i).
To create a primary index on the ordered file shown in figure 4, we use the Name field as the primary
key, because that is the ordering key field of the file (assuming that each value of NAME is unique).
www.arihantinfo.com
41
Fig : Some blocks on an ordered (sequential) file of EMPLOYEE records with NAME as
the ordering field
Each entry in the index will have a NAME value and a pointer. The first three index entries would be:
< K(1) = (Aaron, Ed), P(I)= address of block 1 >
< K(2) = (Adams, John), P(I) = address of block 2 >
< K(3) = (Alexander, Fd), P(3) = address of block 3 >
www.arihantinfo.com
42
Fig
:
Illustrating
internal
hashing
data
Structures.
(a) Array of M Positions for use in hashing. (b) Collision resolution by
chaining of records.
Fig illustrates this Primary index. The total number of entries in the index will be the same as the
number of disk block in the ordered data file. The first record in each block of the data file. The first
record in each block of the data file is called the anchor record of the block, or simple the block
anchor (a scheme called the anchor record of similar to the one described here can be used, with the
last record in each block, rather than the first as the block anchor. A primary index is an example of
what is called a non-dense index because it includes an entry for each disk block of the data file
rather than for every record in the data file. A dense index, on the other hand, contains an entry for
every record in the file.
The index file for a primary index needs substantially fewer blocks than the data file for two reasons.
First, there are fewer index entries than there are records in the data file because an entry exists for
each whole block of the data file rather than for each record. Second, each index entry is typically
smaller in size than a data record because it has only two fields, so more index entries than data
records will fit in one block. A binary search on the index file will hence require fewer block accesses
than a binary search on the data file.
www.arihantinfo.com
43
www.arihantinfo.com
44
blocks needed for the file is b = [(r/bfr)] = [(30,000/10)]= 3000 blocks. A binary search on the data file
would need approximately [(log2b)]= [(log23000)]= 12 block accesses.
Now suppose the ordering key field of the file is V = 9 bytes long, a block pointer is P = 6 bytes long,
and we construct a primary index for the file. The size of each index entry is Ri=(9 + 6) = 15 bytes, so
the blocking factor for the index is bfri= [(B/R)]= [(1024/15)] = 68 entries per block.
The total number of index entries ri is equal to the number of blocks in the data file, which is 3000.
The number of blocks needed for the index is hence bi = [ri/bfri)]= [(3OOO/68)]= 45 blocks. To
perform a binary search on the index file would need [ (log2bi)] = [(log245)]=6 block accesses. To
search for a record using the index, we need one additional block access to the data file for a total of
6 + 1 = 7 block accesses - an improvement over binary search on the data file, which required 12
block accesses.
A major problem with a primary index - as with any ordered file - is insertion and deletion of records.
With a primary index, the problem is compounded because if we attempt to insert a record in its
correct position in the data file, we not only have to move records to make space for the new record
but also have to change some index entries because moving records will change the anchor records of
some blocks. We can use an unordered overflow file. Another possibility is to use a linked list of
overflow records for each block in the data file. We can keep the records within each block and its
overflow-linked list sorted to improve retrieval time. Record deletion can be handled using deletion
markers.
Clustering Indexes
If records of a file are physically ordered on a non-key field that does not have a distinct value, for
each record, that field is called the clustering field of the file. We can create a different type of index,
called a clustering index, to speed up retrieval of records that have the same value for the clustering
field.
www.arihantinfo.com
45
www.arihantinfo.com
46
Fig: Clustering index with separate blocks for each group of records with
the same value for the clustering field
The record and record deletion still cause considerable problems because the data records are
physically ordered. To alleviate the problem of insertion, it is common to reserve a whole block for
each value of the clustering field; all records with that value are placed in the block. If more than one
block is needed to store the records for a particular value, additional blocks are allocated and linked
together. This makes insertion and deletion relatively straightforward. Figure 8 shows this scheme.
www.arihantinfo.com
47
A clustering index is another example of a non-dense index because it has an entry for every distinct
value of the indexing field rather than for every record in the file.
Secondary Indexes
A secondary index also is an ordered file with two fields, and, as in the other indexes, the second field
is a pointer to a disk block. The first field is of the same data type as some non-ordering field of the
data file. The field on which the secondary index is constructed is called an indexing field of the file,
whether its values are distinct for every record or not. There can be many secondary indexes, and
hence indexing fields, for the same file.
We again refer to the two field values of index entry i as K(i), P(i). The entries are ordered by value of
K(i), so we can use binary search on the index. Because the records of the data file are not physically
ordered by values of the secondary key field, we cannot use block
www.arihantinfo.com
48
A secondary index will usually need substantially more storage space than a primary index because
of its larger number of entries. However, the improvement in search time for an arbitrary record is
much greater for a secondary index than it is for a primary index, because we would have to do a
linear search on the data file if the secondary index did not exist. For a primary index, we could still
use binary search on the main file even if the index did not exist because the primary key field
physically orders the records. Example 2 illustrates the improvement in number of blocks accessed
when using a secondary index to search for a record.
Example 2: Consider the file of Example 1 with r = 30,000 fixed- length records of size R 100 bytes
stored on a disk with block size B = 1024 bytes. The file has b = 3000 blocks as calculated in
Example 1. To do a linear search on the file, we would require b/2=3000/2=1500 block accesses on
the average.
Suppose we construct a secondary index on a non-ordering key field of the file that is V=9 bytes long.
As in Example 1, a block pointer is P=6 bytes long, so each index entry is R. (9+6)=15 bytes, and the
blocking factor for the index is bfri = [(B/Ri)]=[(1024/15)]=68 entries per block. In a dense secondary
index such as this, the total number of index entries is ri is equal to the number of records in the
data file, which is 30,000. The number of blocks needed for the index is hence
bi=(ri/bfri)=(30,000/68) = 442 blocks.
Compare this to the 45 blocks needed by the non-dense primary index in Example 1.
A binary search on this secondary index needs (log2bi)=(log2442)= 9 block accesses. To search for a
record using the index, we need an additional block access to the data file for a total of 9 + 1 = 10
block accesses - a vast improvement over the 1500 block accesses needed on the average for a linear
search.
www.arihantinfo.com
49
Fig: A secondary index on a non key field implemented using one level of
indirection so that index entries are fixed length and have unique field
values
3.5 B-TREES
The basic B-tree structure was discovered by R.Bayer and E. McCreight (1970) of Boeing Scientific
Research Labs and has grown to become one of the most popular techniques for organizing an index
structure. Many variations on the basic B-tree have been developed; we cover the basic structure first
and then introduce some of the variations.
The B-tree is known as the balanced sort tree, which is useful for external sorting. There are strong
uses of B-trees in a database system as pointed out by D.Comer (1979): "While no single scheme can
be optimum for all applications, the technique of organizing a file and its index called the B -tree is,
The facto, the standard organization for indexes in a database system."
The file is a collection of records. The index refers to a unique key, associated with each record. One
application of B- trees is found in IBM's Virtual Storage Access Method (VSAM) file organization.
Many data manipulation tasks require data storage only in main memory. For applications with a
large database running on a system with limited company, the data must be stored as records on
secondary memory (disks) and be accessed in pieces. The size of a record can be quite large, as
shown below:
www.arihantinfo.com
50
struct DATA
{
int ssn;
char name [80];
char address [80];
char schoold [76];
struct DATA * left; /* main memory addresses */
struct DATA * right; /* main memory addresses */
d_block d_left; /* disk block address */
d_block d_right; /* disk block address */
}
www.arihantinfo.com
51
www.arihantinfo.com
52
www.arihantinfo.com
53
1. Use the low order part of the key. For keys that are consecutive integers with few gaps, this
method can be used to map the keys to the available range.
2. End folding. For long keys, we identify start, middle, and end regions, such that the sum of the
lengths of the start and end regions equals the length of the middle region. The start and end digits
are concatenated and the concatenated string of drifts is added to the middle region digits.
3.7 Need for the Multiple Access Path
Many interactive information systems require the support of multi-key files. Consider a banking
system in which there are several types of users: teller, loan officers, branch manager, bank officers,
account holders, and so forth. All have the need to access the same data, say records of the format
shown in fig various types of users need to access these records in different ways. A teller might
identify an account record by its ID value. A loan officer might need to access all account records
with a given value for OVERDRAW-LIMIT, or all account records for a given value of SOCNO. A
branch manager might access records by the BRANCH and TYPE group code. A bank officer might
want periodic reports of all accounts data, sorted by ID.
www.arihantinfo.com
54
External storage available, types of queries allowed, number of keys, mode of retrieval and
mode of update.
The technique used to represent and store the records on a file is called the file organization.
A sequentially organized file records are written consecutively when the file is created and
must be accessed consecutively when the file is later used for input.
The retrieval of a record from a sequential file, on average, requires access to half the records
in the file, making such inquiries not only inefficient but very time consuming for large files.
To improve the query response time of a sequential file, a type of indexing technique can be
added.
A primary index is an ordered file whose records are of fixed length with two fields. The first
field is of the same data types as the ordering key field of the data file, and the second field is
a pointer to a disk block - a block address.
www.arihantinfo.com
55
UNIT 4
REPRESENTING DATA ELEMENTS
To accomplish these goals the modeler must analyze narratives from users, notes from meeting,
policy and procedure documents, and, if lucky, design documents from the current information
system. Although it is easy to define the basic constructs of the ER model, it is not an easy task to
distinguish their roles in building the data model. What makes an object an entity or attribute? For
example, given the statement "employees work on projects". Should employees be classified as an
entity or attribute? Very often, the correct answer depends upon the requirements of the database. In
some cases, employee would be an entity, in some it would be an attribute.
While the definitions of the constructs in the ER Model are simple, the model does not address the
fundamental issue of how to identify them. Some commonly given guidelines are:
www.arihantinfo.com
56
Entities
Attributes
Validating Attributes
Relationships
Object Definition
Entities
There are various definitions of an entity:
"Any distinguishable person, place, thing, event, or concept, about which information is kept"
"A thing which can be distinctly identified"
"Any distinguishable object that is to be represented in a database"
"...anything about which we store information (e.g. supplier, machine tool, employee, utility pole,
airline seat, etc.). For each entity type, certain attributes are stored".
These definitions contain common themes about entities:
Entities should not be used to distinguish between time periods. For example, the entities 1st
Quarter Profits, 2nd Quarter Profits, etc. should be collapsed into a single entity called Profits. An
attribute specifying the time period would be used to categorize by time not every thing the users
want to collect information about will be an entity. A complex concept may require more than one
entity to represent it. Others "things" users think important may not be entities.
Attributes
Attributes are data objects that either identify or describe entities. Attributes that identify entities are
called key attributes. Attributes that describe an entity are called non-key attributes. Key attributes
will be discussed in detail in a latter section.
The process for identifying attributes is similar except now you want to look for and extract those
names that appear to be descriptive noun phrases.
Validating Attributes
Attribute values should be atomic, that is, present a single fact. Having disaggregated data allows
simpler programming, greater reusability of data, and easier implementation of changes.
Normalization also depends upon the "single fact" rule being followed. Common types of violations
include:
Simple aggregation - a common example is Person Name which concatenates first name, middle
initial, and last name. Another is Address which concatenates, street address, city, and zip code.
www.arihantinfo.com
57
When dealing with such attributes, you need to find out if there are good reasons for decomposing
them. For example, do the end-users want to use the person's first name in a form letter? Do they
want to sort by zip code?
Complex codes - these are attributes whose values are codes composed of concatenated pieces of
information. An example is the code attached to automobiles and trucks. The code represents over 10
different pieces of information about the vehicle. Unless part of an industry standard, these codes
have no meaning to the end user. They are very difficult to process and update.
Derived Attributes and Code Values
Two areas where data modeling experts disagree is whether derived attributes and attributes whose
values are codes should be permitted in the data model. Derived attributes are those created by a
formula or by a summary operation on other attributes. Arguments against including derived data
are based on the premise that derived data should not be stored in a database and therefore should
not be included in the data model. The arguments in favor are:
Derived data is often important to both managers and users and therefore should be included in the
data model it is just as important, perhaps more so, to document derived attributes just as you
would other attributes including derived attributes in the data model does not imply how they will be
implemented. A coded value uses one or more letters or numbers to represent a fact. For example,
the value Gender might use the letters "M" and "F" as values rather than "Male" and "Female". Those
who are against this practice cite that codes have no intuitive meaning to the end-users and add
complexity to processing data. Those in favor argue that many organizations have a long history of
using coded attributes, that codes save space, and improve flexibility in that values can be easily
added or modified by means of look-up tables.
Relationships
Relationships are associations between entities. Typically, a relationship is indicated by a verb
connecting two or more entities. For example: employees are assigned to projects. As relationships
are identified they should be classified in terms of cardinality, optionality, direction, and dependence.
As a result of defining the relationships, some relationships may be dropped and new relationships
added. Cardinality quantifies the relationships between entities by measuring how many instances of
one entity are related to a single instance of another. To determine the cardinality, assume the
existence of an instance of one of the entities. Then determine how many specific instances of the
second entity could be related to the first. Repeat this analysis reversing the entities. For example:
employees may be assigned to no more than three projects at a time; every project has at least two
employees assigned to it.
If a relationship can have a cardinality of zero, it is an optional relationship. If it must have a
cardinality of at least one, the relationship is mandatory. Optional relationships are typically
indicated by the conditional tense. For example: an employee may be assigned to a project
Mandatory relationships, on the other hand, are indicated by words such as must have. For example:
a student must register for at least three course each semester
www.arihantinfo.com
58
In the case of the specific relationship form (1:1 and 1:M), there is always a parent entity and a child
entity. In one-to-many relationships, the parent is always the entity with the cardinality of one. In
one-to-one relationships, the choice of the parent entity must be made in the context of the business
being modeled. If a decision cannot be made, the choice is arbitrary.
unique
contain the minimum number of words needed to uniquely and accurately describe the object
Some authors advise against using abbreviations or acronyms because they might lead to confusion
about what they mean. Other believe using abbreviations or acronyms are acceptable provided that
they are universally used and understood within the organization.
You should also take care to identify and resolve synonyms for entities and attributes. This can
happen in large projects where different departments use different terms for the same thing.
Recording Information in Design Document
The design document records detailed information about each object used in the model. As you
name, define, and describe objects, this information should be placed in this document. If you are
not using an automated design tool, the document can be done on paper or with a word processor.
There is no standard for the organization of this document but the document should include
information about names, definitions, and, for attributes, domains.
Two documents used in the IDEF1X method of modeling are useful for keeping track of objects. These
are the ENTITY-ENTITY matrix and the ENTITY-ATTRIBUTE matrix.
The ENTITY-ENTITY matrix is a two-dimensional array for indicating relationships between entities.
The names of all identified entities are listed along both axes. As relationships are first identified, an
"X" is placed in the intersecting points where any of the two axes meet to indicate a possible
relationship between the entities involved. As the relationship is further classified, the "X" is replaced
with the notation indicating cardinality.
www.arihantinfo.com
59
database
Access or
relational
complete
4.3 Records
www.arihantinfo.com
60
Data is usually stored in the form of records. Each record consists of a collection of related data
values or items where each value is formed of one or more bytes and corresponds to a particular field
of the record. Records usually describe entities and their attributes. For example, an EMPLOYEE and
record represents an employee entity, and each field value in the record specifies some attribute of
that employee, such as NAME, BIRTHDATE, SALARY, or SUPERVISOR. A collection of field names
and their corresponding data types constitutes a record type or record format definition. A data type,
associated with each field, specifies the type of values a field can take.
The data type of a field is usually one of the standard data types used in programming. These include
numeric (integer, long integer, or floating point), string of characters (fixed-length or varying), Boolean
(having 0 and 1 or TRUE and FALSE values only), and sometimes specially coded data and time data
types. The number of bytes required for each data type is fixed for a given computer system. An
integer may require 4 bytes, a long integer 8 bytes, a real number 4 bytes, a Boolean 1 byte, a
Boolean 1 byte, a date 10 bytes (assuming a format of YYYY-MM-DD), and a fixed-length string of k
characters k bytes. Variable-length strings may require, as many bytes as there are characters in
each field value. For example, an EMPLOYEE record type may be definedusing the C programming
language notationas the following structure:
Struct employee{
char name [30];
char ssn[9];
int salary;
int jobcode;
char department[20];
};
In recent database applications, the need may arise for storing data items that consist of large
unstructured objects, which represent images, digitized video or audio streams, or free text. These
are referred to as BLOBs (Binary Large Objects). A BLOB data item is typically stored separately from
its record in a pool in a pool of disk blocks, and a pointer to the BLOB is included in the record.
4.4 Representing Block and Record Addresses
Fields (attributes) need to be represented by fixed- or variable-length sequences of bytes, Fields are
put together in fixed- or variable-length collections called records (tuples, or objects). A collection of
records that forms a relation or the extent of a class is stored as a collection of blocks, called a file.
A tuple is stored in a record. The size is the sum of sizes of the fields in the record. The schema is
stored together with records (or a pointer to it). It contains:
_ Types of fields
_ the fields within the record
_ Size of the record
_ Timestamps (last read, last updated)
The schema is used to access specific fields in the record. Records are stored in blocks. Typically a
block only contains one kind of records.
www.arihantinfo.com
61
The client-based process is the application that the user interacts with. It contains solution-specific
logic and provides the interface between the user and the rest of the application system. In this sense
the graphical user interface (GUI) is one characteristic of client system. It uses tools some are as:
Administration Tool: for specifying the relevant server information, creation of users, roles and
privileges, definition of file formats, document type definitions (DTD) and document status
information.
Template Editor: for creating and modifying templates of documents
Document Editor: for editing instances of documents and for accessing component
information.
Document Browser: for retrieval of documents from a Document Server
Access Tool: provides the basic methods for information access.
WHAT DO THE SERVER DO?
Its purpose is fulfilling client's requests. What they do in general is;
Receive request
Execute database retrival and updates
Manage data integrity
Sent the results to client back
www.arihantinfo.com
62
Act as a software engine that manage shared resources like databases, printers,
communication links..etc.
When the server do these, it can use common or complex logic. It can supply the information only
using its own resources or it can employ some other machine on the network, in master and slave
attitude.
.[2]
In other words, there can be,
Single client, single server
Multiple clients, single server
Single client, multiple servers
www.arihantinfo.com
63
commands in the form of Structured Query Language (SQL) questions and return answers in the
form of result sets.
Network Connectivity
Generally, a server is connected to a clients by way of a network. This could be a LAN, WAN, or the
Internet. However, most correctly designed client/server designs do not consume significant
resources on a wire. In general, short SQL queries are sent from a client and a server returns brief
sets of specifically focused data.
Operationally Challenged Workstations
A client computer in a client/server system need not run a particularly complex application. It must
simply be able to display results from a query and capture criteria from a user for subsequent
requests. While a client computer is often more capable than a dumb terminal, it need not house a
particularly powerful processorunless it must support other, more hardware-hungry applications.
Applications Leverage Server Intelligence
Client/server applications are designed with server intelligence in mind. They assume that the server
will perform the heavy lifting when it comes to physical data manipulation. Then, all that is required
of the client application is to ask an intelligent questionexpecting an intelligent answer from the
server. The client application's job is to display the results intelligently.
Shared Data, Procedures, Support, Hardware
A client/server system depends on the behavior of the clientsall of themto properly share the
server'sand the clients' resources. This includes data pages, stored procedures, and RAM as well
as network capacity.
System Resource Use Minimized
All client/server systems are designed to minimize the load on the overall system. This permits an
application to be scaled more easilyor at all.
Maximize Design Tradeoffs
A well-designed client/server application should be designed to minimize overall systems load, or
designed for maximum throughput, or designed to avoid really bad response timesor maybe a good
design can balance all of these.
www.arihantinfo.com
64
At the logical level, a structured document is made up of a number of different parts. Some parts are
optional, others are compulsory. Many of these document structures have a required order--they
cannot be inserted at arbitrary points in the document.
For example, a document must have a title and it must be the first element in the document. The
programming example is very good because it is very simple. At the logical level, sections are used to
break up the document into parts and sub-parts that help to assist the reader to follow the structure
of the document and to navigate their way through it.
Sections are made up of a section title, followed by one or more text blocks and then, optionally, one
or more sub-sections. Sections are allowed in either a Chapter document (within chapters) or a
Simple document, and in Appendices.
Sections can contain other sections i.e. they may be nested. Only 4 levels of section nesting are
recommended. Note that once you start entering nested (sub-)sections, you cannot enter any textblocks after the sub-sections i.e.. all of the sections text blocks must come before any sub-sections.
Sections are automatically numbered. A level 1 section has one number, a level two section is
numbered N.n, a level 3 section N.n.n and so on. If the section is contained within a chapter or
appendix, the section number is prefixed with the chapter number or appendix number.
4.7 Record Modifications
The domain record modification/transfer process is the complete responsibility of the domain owner.
If you need assistance modifying your domain record, please contact your domain registrar for
technical support. ColossalHost.com will not provide excessive support resources assisting customers
in the domain record modification process. It is important to understand that ColossalHost.com has
no more power to make modifications to a Subscriber's domain record than a complete stranger
would. The domain name owner is completely responsible for the information (including the name
servers) that is contained within the domain record.
4.8 Index Structures
Index structures in object-oriented database management systems should support selections not only
with respect to physical object attributes, but also with respect to derived attributes. A simple
example arises, if we assume the object types Company , Division, and Employee, with the
relationships has division from Company to Division, and employs from Division to Employee.
Index structures in object-oriented database management systems should support selections not only
with respect to physical object attributes, but also with respect to derived attributes. A simple
example arises, if we assume the object types Company , Division, and Employee, with the
relationships has division from Company to Division, and employs from Division to Employee. In this
case the index structure should allow support queries for companies specifying the number of
employees of the company.
www.arihantinfo.com
65
4.10Secondary Indexes
Secondary indexes to provide access to subsets of records. Both databases and tables provide
automatic secondary indexes. All secondary indexes are held on a separate database file (.sr6). This is
created when the first index is created and deleted if the last index is deleted. Each secondary index
is physically very similar to a standard database. It contains index blocks and data blocks. The sizes
of these blocks are calculated in a similar way to the block size calculations for standard database
blocks to ensure reasonably efficient processing given the size of the secondary index key and the
maximum number of records of that type. Each index potentially has different block sizes. Each
record in the data block in a secondary index has the secondary key as the key and contains the
standard database key as the data. Thus the size of these data blocks is affected by the size of both
keys
All these index files (i.e., primary, secondary, and clustering indexes) are ordered files, and have two
fields of fixed length. One field contains data (in which its value is the same as a field from data file)
and the other field is a pointer to the data file, but
In primary indexes, the data item of the index has the value of the primary key of the first record (the
anchor record) of the block in which the pointer points to.
In secondary indexes, the data item of the index has a value of a secondary key and the pointer
points to a block, in which a record with such secondary keys is stored in.
www.arihantinfo.com
66
In clustering indexes, the data item on the index has a value of a non-key field, and the pointer
points to the block in which the first record with such non-key fields is stored in.
In primary index files, for every block in a data file, one record exists in the primary index file. The
number of records in the index file is, equal to the number of blocks in the data file. Hence, the
primary indexes are non-dense.
In secondary index files, for every record in the data file, one record exists in the secondary index file.
The number of records in the secondary file is equal to the number of records in the data file. Hence,
the secondary indexes are dense.
In clustering index files, for every distinct clustering field, one record exists in the clustering index
file. The number of records in the clustering index is equal to the distinct numbers for the clustering
field in the data file. Hence, the clustering indexes are non-dense.
The Customers Table holds information on customers, such as their customer number, name and
address. Run the Database Desktop program from the Start Menu or select Tools-> Database
Desktop in Delphi. Open the Customers table copied in the previous step. By default the data in the
table is displayed as read only. Familiarise yourself with the data. Change to edit mode (Table->Edit
Data) and add a new record. View the structure of the table (Table->Info Structure). Select Table>Restructure to restructure the table. Add a secondary index to the table, by selecting Secondary
Indexes from the Table properties combo box. The secondary index is composed of LastName and
FirstName, in that order. Call the index CustomersNameIndex. The index will be used to access the
Customers table on customer name.
4.11B-trees
A B-tree is a specialized multiway tree designed especially for use on disk. In a B-tree each node may
contain a large number of keys. The number of sub trees of each node, then, may also be large. A Btree is designed to branch out in this large number of directions and to contain a lot of keys in each
node so that the height of the tree is relatively small. This means that only a small number of nodes
must be read from disk to retrieve an item. The goal is to get fast access to the data, and with disk
drives this means reading a very small number of records. Note that a large node size (with lots of
keys in the node) also fits with the fact that with a disk drive one can usually read a fair amount of
data at once.
A multiway tree of order m is an ordered tree where each node has at most m children. For each node,
if k is the actual number of children in the node, then k - 1 is the number of keys in the node. If the
keys and subtrees are arranged in the fashion of a search tree, then this is called a multiway search
tree of order m. For example, the following is a multiway search tree of order 4. Note that the first row
in each node shows the keys, while the second row shows the pointers to the child nodes. Of course,
in any useful application there would be a record of data associated with each key, so that the first
row in each node might be an array of records where each record contains a key and its associated
www.arihantinfo.com
67
data. Another approach would be to have the first row of each node contain an array of records where
each record contains a key and a record number for the associated data record, which is found in
another file. This last method is often used when the data records are large. The example software
will
use
the
first
method.
What does it mean to say that the keys and subtrees are "arranged in the fashion of a search tree"?
Suppose that we define our nodes as follows:
typedef struct
{
int Count;
// number of keys stored in the current node
ItemType Key[3]; // array to hold the 3 keys
long Branch[4]; // array of fake pointers (record numbers)
} NodeType;
Then a multiway search tree of order 4 has to fulfill the following conditions related to the ordering of
the keys:
Node.Key[0].
The subtree starting at record Node.Branch[1] has only keys that are greater than
www.arihantinfo.com
68
Note that if less than the full number of keys are in the Node, these 4 conditions are
truncated so that they speak of the appropriate number of keys and branches.
This generalizes in the obvious way to multiway search trees with other orders.
A B-tree of order m is a multiway search tree of order m such that:
All internal nodes (except the root node) have at least ceil(m / 2) (nonempty) children.
The root node can have as few as 2 children if it is an internal node, and can obviously have
www.arihantinfo.com
69
no children if the root node is a leaf (that is, the whole tree consists only of the root node).
Note that ceil(x) is the so-called ceiling function. It's value is the smallest integer that is greater than
or equal to x. Thus ceil(3) = 3, ceil(3.35) = 4, ceil(1.98) = 2, ceil(5.01) = 6, ceil(7) = 7, etc.
A B-tree is a fairly well-balanced tree by virtue of the fact that all leaf nodes must be at the bottom.
Condition (2) tries to keep the tree fairly bushy by insisting that each node have at least half the
maximum number of children. This causes the tree to "fan out" so that the path from root to leaf is
very short even in a tree that contains a lot of data.
Example B-Tree
The following is an example of a B-tree of order 5. This means that (other that the root node) all
internal nodes have at least ceil(5 / 2) = ceil(2.5) = 3 children (and hence at least 2 keys). Of course,
the maximum number of children that a node can have is 5 (so that 4 is the maximum number of
keys). According to condition 4, each leaf node must contain at least 2 keys. In practice B-trees
usually have orders a lot bigger than 5.
4.12Hash Tables
Linked lists are handy ways of tying data structures together but navigating linked lists can be
inefficient. If you were searching for a particular element, you might easily have to look at the whole
list before you find the one that you need. Linux uses another technique, hashing to get around this
restriction. A hash table is an array or vector of pointers. An array, or vector, is simply a set of things
coming one after another in memory. A bookshelf could be said to be an array of books. Arrays are
accessed by an index, the index is an offset into the array. Taking the bookshelf analogy a little
further, you could describe each book by its position on the shelf; you might ask for the 5th book.
www.arihantinfo.com
70
A hash table is an array of pointers to data structures and its index is derived from information in
those data structures. If you had data structures describing the population of a village then you
could use a person's age as an index. To find a particular person's data you could use their age as an
index into the population hash table and then follow the pointer to the data structure containing the
person's details. Unfortunately many people in the village are likely to have the same age and so the
hash table pointer becomes a pointer to a chain or list of data structures each describing people of
the same age. However, searching these shorter chains is still faster than searching all of the data
structures.
As a hash table speeds up access to commonly used data structures, Linux often uses hash tables to
implement caches. Caches are handy information that needs to be accessed quickly and are usually a
subset of the full set of information available. Data structures are put into a cache and kept there
because the kernel often accesses them. There is a drawback to caches in that they are more complex
to use and maintain than simple linked lists or hash tables. If the data structure can be found in the
cache (this is known as a cache hit, then all well and good. If it cannot then all of the relevant data
structures must be searched and, if the data structure exists at all, it must be added into the cache.
4.13Self Test
1. Hash Tables (G&T Section 8.3, page 341)
R-8.4 Draw the 11-item hash table that results from using the hash function
h(i) = (2i+5) mod 11
to hash the keys 12, 44, 13, 88, 23, 94, 11, 39, 20, 16 and 5, assuming collisions are handled
by chaining.
R-8.5 What is the result of the previous exercise, assuming collisions are handled by linear
probing?
R-8.7 What is the result of Exercise R-8.4 assuming collisions are handled by double hashing
using a secondary hash function h'(k) = 7 - (k mod 7)?
2.
B-Trees are the bread and butter of the database world: Virtually every database is
implemented internally as some variant of B-Trees.. Comment on this statement.
3.
What are the differences among primary, secondary, and clustering indexes? How do these
differences affect the ways in which these indexes are implemented? Which of these indexes
are dense, and which are not?
4.
How does multi-level indexing improve the efficiency of searching an index file?
5.
Do detailed procedures exist for at least the following operations :
4.14Record creation?
4.15Record modification?
4.16Record duplication?
4.17Record destruction?
4.18Consistent quality control?
4.19Problem resolution?
6.
What is Client and server system. What do Client and the Server do.Client/Server
Architecture. Example:World Wide Web .What is Client/Server Computing
www.arihantinfo.com
71
www.arihantinfo.com
72
UNIT 5
RELATIONAL MODEL
5.1 Introduction
5.2 Objectives
5.3 Concepts of a relational model
5.4 Formal definition of a relation
5.5 The codd commandments
5.6 Summary
1
INTRODUCTION
One of the main advantages of the relational model is that it is conceptually simple and more
importantly based on mathematical theory of relation. It also frees the users from details of storage
structure and access methods.
The relational model like all other models consists of three basic components: A set of domains and a
set of relations Operation on relations Integrity rules
5.2 OBJECTIVES
After completing this unit, you will be able to:
Define the concepts of relational model
Discuss the basic operations of the relational algebra
State the integrity rules
5.3 CONCEPTS OF A RELATIONAL MODEL
The relational model was formally introduced by Dr. E. F. Codd in 1970 and has evolved since then,
through a series of writings. The model provides a simple, yet rigorously defined, concept of how
users perceive data. The relational model represents data in the form of two-dimension tables. Each
table represents some real-world person, place, thing, or event about which information is collected.
A relational database is a collection of two-dimensional tables. The organization of data into relational
tables is known as the logical view of the database. That is, the form in which a relational database
presents data to the user and the programmer. The way the database software physically stores the
data on a computer disk system is called the internal view. The internal view differs from product to
product and does not concern us here.
A basic understanding of the relational model is necessary to effectively use relational database
software such as Oracle, Microsoft SQL Server, or even personal database systems such as Access or
Fox, which are based on the relational model.
www.arihantinfo.com
73
E.F. Codd of the IBM propounded the relational model in 1972. The basic concept in the relational
model is that of a relation.
A relation can be viewed as a table, which has the following properties:
Property 1: It is column homogeneous. In other words, in any given column of a table, all items are of
the same kind.
Property 2: Each item is a simple number or a character string. That is, a table must be in 1NF.(First
Normal Form) which will be introduced in the second unit.
Property 3: All rows of a table are distinct.
Property 4: The ordering of rows within a table is immaterial.
Property 5: The columns of a table are assigned distinct names and the ordering of these columns is
immaterial.
Example of a valid relation
5.4.
S#
P#
SCITY
10
BANGALORE
10
BANGALORE
11
BANGALORE
11
BANGALORE
www.arihantinfo.com
74
There are alternate names used to describe relational tables. Some manuals use the terms tables,
fields, and records to describe relational tables, columns, and rows, respectively. The formal
literature tends to use the mathematical terms, relations, attributes, and tuples. Figure 2
summarizes these naming conventions.
In This Document
Formal Terms
Relational Table
Relation
Table
Column
Attribute
Field
Row
Tuple
Record
Relational tables can be expressed concisely by eliminating the sample data and showing just the
table name and the column names. For example,
www.arihantinfo.com
75
AUTHOR
TITLE
PUBLISHER
AUTHOR_TITLE
Each Row is Unique: This property ensures that no two rows in a relational table are identical;
there is at least one column, or set of columns, the values of which uniquely identify each row in
the table. Such columns are called primary keys and are discussed in more detail in
Relationships and Keys.
The Sequence of Columns is Insignificant: This property states that the ordering of the columns
in the relational table has no meaning. Columns can be retrieved in any order and in various
sequences. The benefit of this property is that it enables many users to share the same table
without concern of how the table is organized. It also permits the physical structure of the
database to change without affecting the relational tables.
The Sequence of Rows is Insignificant: This property is analogous the one above but applies to
rows instead of columns. The main benefit is that the rows of a relational table can be retrieved in
different order and sequences. Adding information to a relational table is simplified and does not
affect existing queries.
Each Column Has a Unique Name: Because the sequence of columns is insignificant, columns
must be referenced by name and not by position. In general, a column name need not be unique
within an entire database but only within the table to which it belongs.
www.arihantinfo.com
76
www.arihantinfo.com
77
Such a solution requires careful design, and must decrease production at the very least. This
situation is particularly undesirable when very high-level languages such as SQL are used to
manipulate such data, and if such a solution is used for numeric columns all sorts of problems can
arise during aggregate functions such as SUM and AVERAGE etc.
Rule 4: The database description rule
A description of the database is held and maintained using the same logical structures used to define
the data, thus allowing users with appropriate authority to query such information in the same ways
and using the same languages as they would any other data in the database.
Put into easy terms, Rule 4 means that there must be a data dictionary within the RDBMS that is
constructed of tables and/or views that can be examined using SQL. This rule states therefore that a
dictionary is mandatory, and if taken in conjunction with Rule 1, there can be no doubt that the
dictionary must also consist of combinations of tables and views.
Rule 5: The comprehensive sub-language rule
There must be at least one language whose statements can be expressed as character strings
conforming to some well-defined syntax, that is comprehensive in supporting the following:
Data definition
View definition
Data manipulation
Integrity constraints
Authorization
Transaction boundaries
Again in real terms, this means that the RDBMS must be completely manageable through its own
dialect of SQL, although some products still support SQL-like languages (Ingress support of Quel for
example). This rule also sets out to scope the functionality of SQL-you will detect an implicit
requirement to support access control, integrity constraints and transaction management facilities
for example.
Rule 6: The view-updating rule
All views that can be updated in theory, can also be updated by the system.
This is quite a difficult rule to interpret, and so a word of explanation is required whilst it is possible
to create views in all sorts of illogical ways, and with all sorts of aggregates and virtual columns, it is
obviously not possible to update through some of them. As a very simple example, if you define a
virtual column in a view as A*B where A and B are columns in a base table, then how can you
perform an update on that virtual column directly? The database cannot possible break down any
number supplied, into its two component parts, without more information being supplied. To delve a
little deeper, we should consider that can be defined in terms of both tables and other views.
Particular vendors restrict the complexity of their own implementations, in some cases quite
drastically.
Even in logical terms it is often incredibly difficult to tell whether a view is theoretically updateable,
let alone delve into the practicalities of actually doing so. In fact there exists another set of rules that,
www.arihantinfo.com
78
when applied to a view, can be used to determine its level of logical complexity, and it is only realistic
to apply Rule 6 to those views that are defined as simple by such criteria.
Rule 7: The insert , update and delete rule
An RDBMS must do more than just be able to retrieve relational data sets. It has to be capable of
inserting, updating and deleting data as a relational set. Many RDBMS that fail the grade fall back to
a single-record-at-a-time procedural technique when it comes time to manipulate data.
Rule 8: The physical independence rule
User access to the database, via monitors or application programs, must remain logically consistent
whenever changes to the storage representation, or access methods to the data, are changed.
Therefore, and by way of an example, if an index is built or destroyed by a DBA on a table, any user
should retrieve the same data from that table, albeit a little more slowly. It is this rule that demands
the clear distinction between the logical and physical layers of the database. Applications must be
limited to interfacing with the logical layer to enable the enforcement of this rule, and it is this rule
that sorts out the men from the boys in the relational market place. Looking at other architectures
already discussed, on can imagine the consequences of changing the physical structure of a network
or hierarchical system.
However there are plenty of traps waiting even in the relational world. Consider the application
designer who depends on the presence of a B-type tree index to ensure retrieval of data is in a
predefined order, only to find that the DBA dynamically drops the index. The removal of such an
index might be catastrophic. I point out these two issues because although they are serious factors, I
am not convinced that they constitute the breaking of this rule; it is for the individual to make up his
own mind.
Rule 9: The logical data independence rule
Application programs must be independent of changes made to the base tables.
www.arihantinfo.com
79
This rule allows many types of database design change to be made dynamically, without users being
aware of them. To illustrate the meaning of the rule the examples on the next page shows two types
of activity, described in more detail later, that should be possible if this rule is enforced.
Name
101
Jones
103
Smith
www.arihantinfo.com
80
104
Lalonde
107
Evan
110
Drew
112
Smith
Id
Name
101
Jones
Smith
104
Lalonde
107
Evan
110
Drew
Lalonde
Smith
(a)
P:
(b)
Fig: (a) Relation without null values and (b) relation with null values
Integrity rule 1 specifies that instances of the entities are distinguishable and thus no prime attribute
(component of a primary key) value may be null. This rule is also referred to as the entity rule. We
could state this rule formally as:
Definition: Integrity Rule 1 (Entity Integrity):
If the attribute A of relation R is a prime attribute of R, then A cannot accept null values.
Integrity Rule 2(Referential Integrity):
Integrity rule 2 is concerned with foreign keys, i.e., with attributes of a relation having domains that
are those of the primary key of another relation.
Relation (R) , may contain references to another relation (S). Relations R and S need not be distinct.
Suppose the reference in r is via a set of attributes that forms a primary key of the relation S. This set
of attributes in R is a foreign key. A valid relationship between a tuple in R to one in S requires that
the values of the attributes in the foreign key of R correspond to the primary key of a tuple in S. This
ensures that the reference from the tuple of the relation R is made unambiguously to an existing
tuple in the S relation. The referencing attribute(s) in the R relation can have null value(s);in this
case, it is not referencing any tuple in the S relation. However, if the value is not null, it must exist as
the primary attribute of a tuple of the S relation. If the referencing attribute in R has a value that is
nonexistent in S, R is attempting to refer a nonexistent tuple and hence a nonexistent instance of the
corresponding entity. This cannot be allowed. We illustrate this point in the following example.
Rule 11: Distribution independence rule:
A RDBMS must have distribution independence.
This is one of the more attractive aspects of RDBMS. Database systems built on the relational
framework are well suited to today's client/server database design.
www.arihantinfo.com
81
Self Test
Define relational database system? Where we use RDBMS.
Explain different type of RDBMS
Explain Codd Rules with the help of suitable example.
5.7 SUMMARY
One of the main advantages of the relational model is that it is conceptually simple and more
importantly based on mathematical theory of relation. It also frees the users from details of
storage structure and access methods.
The model provides a simple, yet rigorously defined, concept of how users perceive data. The
relational model represents data in the form of two-dimension tables. Each table represents
some real-world person, place, thing, or event about which information is collected. A
relational database is a collection of two-dimensional tables.
www.arihantinfo.com
82
UNIT 6
NORMALIZATION
www.arihantinfo.com
83
www.arihantinfo.com
84
fully functionally dependent on A if B is not functionally dependent on any subset of A. The above
example of students and subjects would show full functional dependence if mark and date are not
functionally dependent on either student number ( sno) or subject number ( cno) alone. The implies
that we are assuming that a student may have more than one subjects and a subject would be taken
by many different students. Furthermore, it has been assumed that there is at most one enrolment of
each student in the same subject.
The above example illustrates full functional dependence. However the following dependence
(sno, cno) -> instructor is not full functional dependence because cno -> instructor holds.
As noted earlier, the concept of functional dependency is related to the concept of candidate key of a
relation since a candidate key of a relation is an identifier which uniquely identifies a tuple and
therefore determines the values of all other attributes in the relation. Therefore any subset X of the
attributes of a relation R that satisfies the property that all remaining attributes of the relation are
functionally dependent on it (that is, on X), then X is candidate key as long as no attribute can be
removed from X and still satisfy the property of functional dependence. In the example above, the
attributes (sno, cno) form a candidate key (and the only one) since they functionally determine all the
remaining attributes.
Functional dependence is an important concept and a large body of formal theory has been developed
about it. We discuss the concept of closure that helps us derive all functional dependencies that are
implied by a given set of dependencies. Once a complete set of functional dependencies has been
obtained, we will study how these may be used to build normalised relations.
Rules About Functional Dependencies
Schema designer usually explicitly states only FDs which are obvious
Without knowing exactly what all tuples are, must be able to deduce other/all FDs
that hold on R
Essential when we discuss design of good relational schemas
6.2 Normalization
www.arihantinfo.com
85
Designing a database, usually a data model is translated into relational schema. The important
question is whether there is a design methodology or is the process arbitrary. A simple answer to this
question is affirmative. There are certain properties that a good database design must possess as
dictated by Codds rules. There are many different ways of designing good database. One of such
methodologies is the method involving Normalization. Normalization theory is built around the
concept of normal forms. Normalization reduces redundancy. Redundancy is unnecessary repetition
of data. It can cause problems with storage and retrieval of data. During the process of normalization,
dependencies can be identified, which can cause problems during deletion and updation.
Normalization theory is based on the fundamental notion of Dependency. Normalization helps in
simplifying the structure of schema and tables.
For example the normal forms, we will take an example of a database of the following logical design:
Relation S { S#, SUPPLIERNAME, SUPPLYTATUS, SUPPLYCITY}, Primary Key{S#}
Relation P { P#, PARTNAME, PARTCOLOR, PARTWEIGHT, SUPPLYCITY}, Primary Key{P#}
Relation SP { S#, SUPPLYCITY, P#, PARTQTY},
Primary Key{S#, P#}
Foreign Key{S#} Reference S
Foreign Key{P#} Reference P
SP
S#
S1
S1
S1
S1
S1
S1
S2
S2
S3
S4
S4
S4
SUPPLYCITY
Bombay
Bombay
Bombay
Bombay
Bombay
Bombay
Mumbai
Mumbai
Mumbai
Madras
Madras
Madras
P#
P1
P2
P3
P4
P5
P6
P1
P2
P2
P2
P4
P5
PARTQTY
3000
2000
4000
2000
1000
1000
3000
4000
2000
2000
3000
4000
Let us examine the table above to find any design discrepancy. A quick glance reveals that some of
the data are being repeated. That is data redundancy, which is of course an undesirable. The fact
that a particular supplier is located in a city has been repeated many times. This redundancy causes
many other related problems. For instance, after an update a supplier may be displayed to be from
Madras in one entry while from Mumbai in another. This further gives rise to many other problems.
Therefore, for the above reasons, the tables need to be refined. This process of refinement of a given
schema into another schema or a set of schema possessing qualities of a good database is known as
Normalization.
Database experts have defined a series of Normal forms each conforming to some specified design
quality condition(s). We shall
restrict ourselves to the first five
1NF
normal forms for the simple
reason of simplicity. Each next level
2NF
of normal form adds another
condition. It is interesting to note
3NF
4N
5NF
86
www.arihantinfo.com
that the process of normalization is reversible. The following diagram depicts the relation between
various normal forms.
The diagram implies that 5th Normal form is also in 4th Normal form, which itself in 3rd Normal form
and so on. These normal forms are not the only ones. There may be 6 th, 7th and nth normal forms, but
this is not of our concern at this stage.
Before we embark on normalization, however, there are a few more concepts that should be
understood.
Decomposition. Decomposition is the process of splitting a relation into two or more relations. This is
nothing but projection process. Decompositions may or may not loose information. As you would
learn shortly, that normalization process involves breaking a given relation into one or more relations
and also that these decompositions should be reversible as well, so that no information is lost in the
process. Thus, we will be interested more with the decompositions that incur no loss of information
rather than the ones in which information is lost.
Lossless decomposition: The decomposition, which results into relations without loosing any
information, is known as lossless decomposition or nonloss decomposition. The decomposition that
results in loss of information is known as lossy decomposition.
Consider the relation S{S#, SUPPLYSTATUS, SUPPLYCITY} with some instances of the entries as
shown below.
S
S#
SUPPLYSTATUS
SUPPLYCITY
S3
100
Madras
S5
100
Mumbai
Let us decompose this table into two as shown below:
(1)
SX
S#
SUPPLYSTATUS
SY
S3
100
S5
100
(2)
SX
S#
SUPPLYSTATUS
S3
100
S5
100
SY
S#
SUPPLYCITY
S3
Madras
S5
Mumbai
SUPPLYSTATUS
100
100
SUPPLYCITY
Madras
Mumbai
Let us examine these decompositions. In decomposition (1) no information is lost. We can still say
that S3s status is 100 and location is Madras and also that supplier S5 has 100 as its status and
location Mumbai. This decomposition is therefore lossless.
www.arihantinfo.com
87
In decomposition (2), however, we can still say that status of both S3 and S5 is 100. But the location
of suppliers cannot be determined by these two tables. The information regarding the location of the
suppliers has been lost in this case. This is a lossy decomposition. Certainly, lossless decomposition
is more desirable because otherwise the decomposition will be irreversible. The decomposition
process is in fact projection, where some attributes are selected from a table. A natural question
arises here as to why the first decomposition is lossless while the second one is lossy? How should a
given relation must be decomposed so that the resulting projections are nonlossy? Answer to these
questions lies in functional dependencies and may be given by the following theorem.
Heaths theorem: Let R{A, B, C} be a relation, where A, B and C are sets of attributes. If R satisfies the
FD AB, then R is equal to the join of its projections on {A, B} and {A, C}.
Let us apply this theorem on the decompositions described above. We observe that relation S satisfies
two irreducible sets of FDs
S# SUPPLYSTATUS
S# SUPPLYCITY
Now taking A as S#, B as SUPPLYSTATUS, and C as SUPPLYCITY, this theorem confirms that
relation S can be nonloss decomposition into its projections on {S#, SUPPLYSTATUS} and {S#,
SUPPLYCITY} . Note, however, that the theorem does not say why projections {S#, SUPPLYSTATUS}
and {SUPPLYSTATUS, SUPPLYCITY} should be lossy. Yet we can see that one of the FDs is lost in
this decomposition. While the FD S#SUPPLYSTATUS is still represented by projection on {S#,
SUPPLYSTATUS}, but the FD S#SUPPLYCITY has been lost.
An alternative criteria for lossless decomposition is as follows. Let R be a relation schema, and let F
be a set of functional dependencies on R. let R1 and R2 form a decomposition of R. this decomposition
is a lossless-join decomposition of R if at least one of the following functional dependencies are in F +:
R1 R2
R1
R1 R2
R2
Functional Dependency Diagrams: This is a handy tool for representing function dependencies
existing in a relation.
PARTNAME
SUPPLIERNAME
S#
S#
SUPPLYSTATUS
PARTCOLOR
PARTQTY
P#
PARTWEIGHT
P#
SUPPLYCITY
SUPPLYCITY
www.arihantinfo.com
88
The diagram is very useful for its eloquence and in visualizing the FDs in a relation. Later in the Unit
you will learn how to use this diagram for normalization purposes.
6.2.1 First Normal Form
A relation is in 1st Normal form (1NF) if and only if, in every legal value of that relation, every tuple
contains exactly one value for each attribute.
Although, simplest, 1NF relations have a number of discrepancies and therefore it not the most
desirable form of a relation.
Let us take a relation (modified to illustrate the point in discussion) as
S#
SUPPLYCITY
P#
SUPPLYSTATUS
PARTQTY
For a good design the diagram should have arrows out of candidate keys only. The additional arrows
cause trouble.
Let us discuss some of the problems with this 1NF relation. For the purpose of illustration, let us
insert some sample tuples into this relation.
REL1 S#
SUPPLYSTATUS
SUPPLYCITY P#
PARTQTY
S1
200
Madras
P1
3000
S1
200
Madras
P2
2000
S1
200
Madras
P3
4000
S1
200
Madras
P4
2000
S1
200
Madras
P5
1000
S1
200
Madras
P6
1000
S2
100
Mumbai
P1
3000
S2
100
Mumbai
P2
4000
S3
100
Mumbai
P2
2000
S4
200
Madras
P2
2000
S4
200
Madras
P4
3000
S4
200
Madras
P5
4000
www.arihantinfo.com
89
The redundancies in the above relation causes many problems usually known as update anamolies,
that is in INSERT, DELETE and UPDATE operations. Let us see these problems due to supplier-city
redundancy corresponding to FD S#SUPPLYCITY.
INSERT: In this relation, unless a supplier supplies at least one part, we cannot insert the
information regarding a supplier. Thus, a supplier located in Kolkata is missing from the relation
because he has not supplied any part so far.
DELETE: Let us see what problem we may face during deletion of a tuple. If we delete the tuple of a
supplier (if there is a single entry for that supplier), we not only delte the fact that the supplier
supplied a particular part but also the fact that the supplier is located in a particular city. In our
case, if we delete entries corresponding to S#=S2, we loose the information that the supplier is
located at Mumbai. This is definitely undesirable. The problem here is there are too many
informations attached to each tuple, therefore deletion forces loosing too many informations.
UPDATE: If we modify the city of a supplier S1 to Mumbai from Madras, we have to make sure that
all the entries corresponding to S#=S1 are updated otherwise inconsistency will be introduced. As a
result some entries will suggest that the supplier is located at Madras while others will contradict this
fact.
6.2.2 Second Normal Form
A relation is in 2NF if and only if it is in 1NF and every nonkey attribute is fully functionally
dependent on the primary key. Here it has been assumed that there is only one candidate key, which
is of course primary key.
A relation in 1NF can always decomposed into an equivalent set of 2NF relations. The reduction
process consists of replacing the 1NF relation by suitable projections.
We have seen the problems arising due to the less-normalization (1NF) of the relation. The remedy is
to break the relation into two simpler relations.
REL2{S#, SUPPLYSTATUS, SUPPLYCITY} and
REL3{S#, P#, PARTQTY}
The FD diagram and sample relation, are shown below.
SUPPLYCITY
S#
REL2
S#
SUPPLYSTATUS
S1
200
S2
100
S3
100
S4
200
S#
REL3
SUPPLYSTATUS
P#
SUPPLYCITY
Madras
Mumbai
Mumbai
Madras
PARTQTY
S#
S1
S1
S1
S1
P#
P1
P2
P3
P4
PARTQTY
3000
2000
4000
2000
www.arihantinfo.com
90
S5
300
Kolkata
S1
S2
S2
S3
S4
S4
S4
P6
P1
P2
P2
P2
P4
P5
S1
1000
3000
4000
2000
2000
3000
4000
P5
1000
REL2 and REL3 are in 2NF with their {S#} and {S#, P#} respectively. This is because all nonkeys of
REL1{ SUPPLYSTATUS, SUPPLYCITY}, each is functionally dependent on the primary key that is S#.
By similar argument, REL3 is also in 2NF. Evidently, these two relations have overcome all the
update anomalies stated earlier. Now it is possible to insert the facts regarding supplier S5 even when
he is not supplied any part, which was earlier not possible. This solves insert problem. Similarly,
delete and update problems are also over now.
These relations in 2NF are still not free from all the anomalies. REL3 is free from most of the
problems we are going to discuss here, however, REL2 still carries some problems. The reason is that
the dependency of SUPPLYSTATUS on S# is though functional, it is transitive via SUPPLYCITY. Thus
we see that there are two dependencies S#SUPPLYCITY and SUPPLYCITYSUPPLYSTATUS. This
implies S#SUPPLYSTATUS. This relation has a transitive dependency. We will see that this
transitive dependency gives rise to another set of anomalies.
INSERT: We are unable to insert the fact that a particular city has a particular status until we have
some supplier actually located in that city.
DELETE: If we delete sole REL2 tuple for a particular city, we delete the information that that city
has that particular status.
UPDATE: The status for a given city still has redundancy. This causes usual redundancy problem
related to updataion.
6.2.3 Third Normal Form
A relation is in 3NF if only if it is in 2NF and every non-key attribute is non-transitively dependent on
the primary key.
To convert the 2NF relation into 3NF, once again, the REL2 is split into two simpler relations REL4
and REL5 as shown below.
REL4{S#, SUPPLYCITY} and
REL5{SUPPLYCITY, SUPLLYSTATUS}
The FD diagram and sample relation, is shown below.
S#
SUPPLYCITY
SUPPLYCITY
SUPPLYCITY
www.arihantinfo.com
91
REL4
S#
S1
S2
S3
S4
S5
REL5
SUPPLYCITY
Madras
Mumbai
Mumbai
Madras
Kolkata
SUPPLYCITY SUPPLYSTATUS
Madras
200
Mumbai
100
Kolakata
300
Evidently, the above relations REL4 and REL5 are in 3NF, because there is no transitive
dependencies. Every 2NF can be reduced into 3NF by decomposing it further and removing any
transitive dependency.
Dependency Preservation
The reduction process may suggest a variety of ways in which a relation may be decomposed in
lossless decomposition. Thus, REL2 can be in which there was a transitive dependency and therefore,
we split it into two 3NF projections, i.e.
REL4{S#, SUPPLYCITY} and
REL5{SUPPLYCITY, SUPLLYSTATUS}
Let us call this decomposition as decompositio-1. An alternative decomposition may be:
REL4{S#, SUPPLYCITY} and
REL5{S#, SUPPLYSTATUS}
Which we will call decomposition-2.
Both the decompositions decomposition-1 and decomposition-2 are 3NF and lossless. However,
decomposition-2 is less satisfactory than decomposition-1. For example, it is still not possible to
insert the information that a particular city has a particular status unless some supplier is located in
the city.
In the decomposition-1, the two projections are independent of each other but the same is not true in
the second decomposition. Here independence is in the sense that updates are made into the
relations without regard of the other provided the insertion is legal. Also independent decompositions
preserve the dependencies of the database and no dependence is lost in the decomposition process.
The concept of independent projections provides for choosing a particular decomposition when there
is more than one choice.
6.2.4 Boyce-Codd Normal Form
The previous normal forms assumed that there was just one candidate key in the relation and that
key was also the primary key. Another class of problems arises when this is not the case. Very often
there will be more candidate keys than one in practical database designing situation. To be precise
www.arihantinfo.com
92
the 1NF, 2NF and 3NF did not deal adequately with the case of relations that had two or more
candidate keys, and that the candidate keys were composite, and
They overlapped (i.e. had at least one attribute common).
A relation is in BCNF (Boyce-Codd Normal Form) if and only if every nontrivial, left-irreducible FD has
a candiadte key as its determinant.
Or
A relation is in BCNF if and only if all the determinants are candidate keys.
In other words, the only arrows in the FD diagram are arrows out of candidate keys. It has already
been explained that there will always be arrows out of candidate keys; the BCNF definition says there
are no others, meaning there are no arrows that can be eliminated by the normalization procedure.
These two definitions are apparently different from each other. The difference between the two BCNF
definitions is that we tacitly assume in the former case determinants are "not too big" and that all
FDs are nontrivial.
It should be noted that the BCNF definition is conceptually simpler than the old 3NF definition, in
that it makes no explicit reference to first and second normal forms as such, nor to the concept of
transitive dependence. Furthermore, although BCNF is strictly stronger than 3NF, it is still the case
that any given relation can be nonloss decomposed into an equivalent collection of BCNF relations.
Thus, relations REL1 and REL2 which were not in 3NF, are not in BCNF either; also that relations
REL3, REL4, and REL5, which were in 3NF, are also in BCNF. Relation REL1 contains three
determinants, namely {S#}, {SUPPLYCITY}, and {S#, P#}; of these, only {S#, P#} is a candidate key, so
REL1 is not in BCNF. Similarly, REL2 is not in BCNF either, because the determinant {SUPPLYCITY}
is not a candidate key. Relations REL3, REL4, and REL5, on the other hand, are each in BCNF,
because in each case the sole candidate key is the only determinant in the respective relations.
We now consider an example involving two disjoint - i.e., nonoverlapping - candidate keys. Suppose
that in the usual suppliers relation REL1{S#, SUPPLIERNAME, SUPPLYSTATUS, SUPPLYCITY}, {S#}
and {SUPPLIERNAME} are both candidate keys (i.e., for all time, it is the case that every supplier has
a unique supplier number and also a unique supplier name). Assume, however, that attributes
SUPPLYSTATUS and SUPPLYCITY are mutually independent - i.e., the FD SUPPLYCITY
SUPPLYSTATUS no longer holds. Then the FD diagram is as shown below.
S#
SUPPLYSTATUS
SUPPLIERNAME
SUPPLYCITY
Relation REL1 is in BCNF. Although the FD diagram does look "more complex" than a 3NF diagram,
it is nevertheless still the case that the only determinants are candidate keys; i.e., the only arrows are
arrows out of candidate keys. So the message of this example is just that having more than one
candidate key is not necessarily bad.
www.arihantinfo.com
93
For illustration we will assume that in our relations supplier names are unique. Consider REL6.
REL6{ S#, SUPPLIERNAME, P#, PARTQTY }.
Since it contains two determinants, S# and SUPPLIERNAME that are not candidate keys for the
relation, this relation is not in BCNF. A sample snapshot of this relation is shown below:
REL6 S#
SUPPLIERNAME
P#
PARTQTY
S1
Pooran
P1
3000
S1
Anupam
P2
2000
S1
Vishal
P3
4000
S1
Vinod
P4
2000
As is evident from the figure above, relation REL6 involves the same kind of redundancies as did
relations REL1 and REL2, and hence is subject to the same kind of update anomalies. For example,
changing the name of suppliers from Vinod to Rahul leads, once again, either to search problems or
to possibly inconsistent results. Yet REL6 is in 3NF by the old definition, because that definition did
not require an attribute to be irreducibly dependent on each candidate key if it was itself a
component of some candidate key of the relation, and so the fact that SUPPLIERNAME is not
irreducibly dependent on {S#, P#} was ignored.
The solution to the REL6 problems is, of course, to break the relation down into two projections, in
this case the projections are:
REL7{S#, SUPPLIERNAME} and
REL8{S#, P#, PARTQTY}
Or
REL7{S#, SUPPLIERNAME} and
REL8{SUPPLIERNAME, P#, PARTQTY}
Both of these projections are in BCNF. The original design, consisting of the single relation REL1, is
clearly bad; the problems with it are intuitively obvious, and it is unlikely that any competent
database designer would ever seriously propose it, even if he or she had no exposure to the ideas of
BCNF etc. at all.
Comparison of BCNF and 3NF
We have seen two normal forms for relational-database schemas: 3NF and BCNF. There is an
advantage to 3NF in that we know that it is always possible to obtain a 3NF design without sacrificing
a lossless join or dependency preservation. Nevertheless, there is a disadvantage to 3NF. If we do not
eliminate all transitive dependencies, we may have to use null values to represent some of the
possible meaningful relationship among data items, and there is the problem of repetition of
information. The other difficulty is the repetition of information.
If we are forced to choose between BCNF and dependency preservation with 3NF, it is generally
preferable to opt for 3NF. If we cannot test for dependency preservation efficiently, we either pay a
high penalty in system performance or risk the integrity of the data in our database. Neither of these
alternatives is attractive. With such alternatives, the limited amount of redundancy imposed by
www.arihantinfo.com
94
transitive dependencies allowed under 3NF is the lesser evil. Thus, we normally choose to retain
dependency preservation and to sacrifice BCNF.
6.2.5 Multi-valued dependency
Multi-valued dependency may be formally defined as:
Let R be a relation, and let A, B, and C be subsets of the attributes of R. Then we say that B is
multi-dependent on A - in symbols,
A B
(read "A multi-determines B," or simply "A double arrow B") - if and only if, in every possible legal
value of R, the set of B values matching a given A value, C value pair depends only on the A value
and is independent of the C value.
To elucidate the meaningof the above statement, let us take one example relation REL8 as shown
beolw:
REL8
COURSE
Computer
TEACHERS
BOOKS
TEACHER
BOOK
Dr. Wadhwa
Graphics
Prof. Mittal
UNIX
Mathematics
TEACHER
BOOK
Prof. Saxena
Relational Algebra
Prof. Karmeshu
Discrete Maths
Assume that for a given course there can exist any number of corresponding teachers and any
number of corresponding books. Moreover, let us also assume that teachers and books are quite
independent of one another; that is, no matter who actually teaches any particular course, the same
books are used. Finally, also assume that a given teacher or a given book can be associated with any
number of courses.
Let us try to eliminate the relation-valued attributes. One way to do this is simply to replace relation
REL8 by a relation REL9 with three scalar attributes COURSE, TEACHER, and BOOK as indicated
below.
REL9
COURSE
Computer
Computer
Computer
Computer
Mathematics
Mathematics
Mathematics
TEACHER
Dr. Wadhwa
Dr. Wadhwa
Prof. Mittal
Prof. Mittal
Prof. Saxena
Prof. Karmeshu
Prof. Karmeshu
BOOK
Graphics
UNIX
Graphics
UNIX
Relational Algebra
Disrete Maths
Relational Algebra
As you can see from the relation, each tuple of REL8 gives rise to m * n tuples in REL9, where m and
n are the cardinalities of the TEACHERS and BOOKS relations in that REL8 tuple. Note that the
resulting relation REL9 is "all key".
www.arihantinfo.com
95
The meaning of relation REL9 is basically as follows: A tuple {COURSE:c, TEACHER:t, BOOK:x}
appears in REL9 if and only if course c can be taught by teacher t and uses book x as a reference.
Observe that, for a given course, all possible combinations of teacher and book appear: that is, REL9
satisfies the (relation) constraint
if
tuples (c, t1, x1), (c, t2, x2) both appear
then tuples (c, t1, x2), (c, t2, x1) both appear also
Now, it should be apparent that relation REL9 involves a good deal of redundancy, leading as usual
to certain update anomalies. For example, to add the information that the Computer course can be
taught by a new teacher, it is necessary to insert two new tuples, one for each of the two books. Can
we avoid such problems? Well, it is easy to see that:
1. The problems in question are caused by the fact that teachers and books are completely
independent of one another;
2. Matters would be much improved if REL9 were decomposed into its two projections call them
REL10 and REL11 - on {COURSE, TEACHER} and {COURSE, BOOK}, respectively.
To add the information that the Computer course can be taught by a new teacher, all we have to do
now is insert a single tuple into relation REL10. Thus, it does seem reasonable to suggest that there
should be a way of "further normalizing" a relation like REL9.
It is obvious that the design of REL9 is bad and the decomposition into REL10 and REL11 is better.
The trouble is, however, these facts are not formally obvious. Note in particular that REL9 satisfies no
functional dependencies at all (apart from trivial ones such as COURSE COURSE); in fact, REL9 is
in BCNF, since as already noted it is all key-any "all key" relation must necessarily be in BCNF. (Note
that the two projections REL10 and REL11 are also all key and hence in BCNF.) The ideas of the
previous normalization are therefore of no help with the problem at hand.
The existence of "problem" BCNF relation like REL9 was recognized very early on, and the way to deal
with them was also soon understood, at least intuitively. However, it was not until 1977 that these
intuitive ideas were put on a sound theoretical footing by Fagin's introduction of the notion of multivalued dependencies, MVDs. Multi-valued dependencies are a generalization of functional
dependencies, in the sense that every FD is an MVD, but the converse is not true (i.e., there exist
MVDs that are not FDs). In the case of relation REL9 there are two MVDs that hold:
COURSE TEACHER
COURSE BOOK
Note the double arrows; the MVD AB is read as "B is multi-dependent on A" or, equivalently, "A
multi-determines B." Let us concentrate on the first MVD, COURSETEACHER. Intuitively, what
this MVD means is that, although a course does not have a single corresponding teacher - i.e., the
functional dependence COURSETEACHER does not hold-nevertheless, each course does have a
well-defined set of corresponding teachers. By "well-defined" here we mean, more precisely, that for a
given course c and a given book x, the set of teachers t matching the pair (c, x) in REL9 depends on
www.arihantinfo.com
96
the value c alone - it makes no difference which particular value of x we choose. The second MVD,
COURSEBOOK, is interpreted analogously.
It is easy to show that, given the relation R{A, B, C), the MVD AB holds if and only if the MVD A
C also holds. MVDs always go together in pairs in this way. For this reason it is common to
represent them both in one statement, thus:
COURSETEACHER | TEXT
Now, we stated above that multi-valued dependencies are a generalization of functional
dependencies, in the sense that every FD is an MVD. More precisely, an FD is an MVD in which the
set of dependent (right-hand side) values matching a given determinant (left-hand side) value is
always a singleton set. Thus, if AB. then certainly AB.
Returning to our original REL9 problem, we can now see that the trouble with relation such as REL9
is that they involve MVDs that are not also FDs. (In case it is not obvious, we point out that it is
precisely the existence of those MVDs that leads to the necessity of for example - inserting two
tuples to add another Computer teacher. Those two tuples are needed in order to maintain the
integrity constraint that is represented by the MVD.) The two projections REL10 and REL11 do not
involve any such MVDs, which is why they represent an improvement over the original design. We
would therefore like to replace REL9 by those two projections, and an important theorem proved by
Fagin in reference allows us to make exactly that replacement:
Theorem (Fagin): Let R{A, B, C} be a relation, where A, B, and C are sets of attributes. Then R is equal
to the join of its projections on {A, B} and {A, C} if and only if R satisfies the MVDs AB | C.
At this stage we are equipped to define fourth normal form:
Fourth normal form: Relation R is in 4NF if and only if, whenever there exist subsets A and B of the
attributes of R such that the nontrivial (An MVD AB is trivial if either A is a superset of B or the
union of R and B is the entire heading) MVD AB is satisfied, then all attributes of R are also
functionally dependent on A.
In other words, the only nontrivial dependencies (FDs or MVDs) in R are of the form YX (i.e.,
functional dependency from a superkey Y to some other attribute X). Equivalently: R is in 4NF if it is
in BCNF and all MVDs in R are in fact "FDs out of keys." Therefore, that 4NF implies BCNF.
Relation REL9 is not in 4NF, since it involves an MVD that is not an FD at all, let alone an FD "out of
a key." The two projections REL10 and REL11 are both in 4NF, however. Thus 4NF is an
improvement over BCNF, in that it eliminates another form of undesirable dependency. What is more,
4NF is always achievable; that is, any relation can be nonloss decomposed into an equivalent
collection of 4NF relations.
You may recall that a relation R{A, B, C} satisfying the FDs AB and BC is better decomposed into
its projections on (A, B) and {B, C} rather than into those on {A, B] and {A, C). The same holds true if
we replace the FDs by the MVDs AB and BC.
www.arihantinfo.com
97
www.arihantinfo.com
98
its JD) suffers from a number of problems over update operations, problems that are removed when it
is 3-decomposed.
Fagin's theorem, to the effect that R{A, B, C} can be non-loss-decomposed into its projections on {A,
B} and {A, C] if and only if the MVDs AB and AC hold in R, can now be restated as follows:
R{A, B, C} satisfies the JD*{AB, AC} if and only if it satisfies the MVDs AB | C.
Since this theorem can be taken as a definition of multi-valued dependency, it follows that an MVD is
just a special case of a JD, or (equivalently) that JDs are a generalization of MVDs.
Thus, to put it formally, we have
AB | C * {AB, AC}
Note that joint dependencies are the most general form of dependency possible (using, of course, the
term "dependency" in a very special sense). That is, there does not exist a still higher form of
dependency such that JDs are merely a special case of that higher form - so long as we restrict our
attention to dependencies that deal with a relation being decomposed via projection and recomposed
via join.
Coming back to the running example, we can see that the problem with relation REL12 is that it
involves a JD that is not an MVD, and hence not an FD either. We have also seen that it is possible,
and probably desirable, to decompose such a relation into smaller components - namely, into the
projections specified by the join dependency. That decomposition process can be repeated until all
resulting relations are in fifth normal form, which we now define:
Fifth normal form: A relation R is in 5NF - also called projection-join normal torn (PJ/NF) - if and
only if every nontrivial* join dependency that holds for R is implied by the candidate keys of R.
Let us understand what it means for a JD to be "implied by candidate keys."
Relation REL12 is not in 5NF, it satisfies a certain join dependency, namely Constraint 3D, that is
www.arihantinfo.com
99
certainly not implied by its sole candidate key (that key being the combination of all of its attributes).
Stated differently, relation REL12 is not in 5NF, because (a) it can be 3-decomposed and (b) that 3decomposability is not implied by the fact that the combination {S#, P#, J#} is a candidate key. By
contrast, after 3-decomposition, the three projections SP, PJ, and JS are each in 5NF, since they do
not involve any (nontrivial) JDs at all.
Now let us understand through an example, what it means for a JD to be implied by candidate keys.
Suppose that the familiar suppliers relation REL1 has two candidate keys, {S#} and
{SUPPLIERNAME}. Then that relation satisfies several join dependencies - for example, it satisfies the
JD
*{ { S#, SUPPLIERNAME, SUPPLYSTATUS }, { S#, SUPPLYCITY } }
That is, relation REL1 is equal to the join of its projections on {S#, SUPPLIERNAME, SUPPLYSTATUS}
and {S#, SUPPLYCITY), and hence can be nonloss-decomposed into those projections. (This fact does
not mean that it should be so decomposed, of course, only that it could be.) This JD is implied by the
fact that {S#} is a candidate key (in fact it is implied by Heath's theorem) Likewise, relation REL1 also
satisfies the JD
* {{S#, SUPPLIERNAME}, {S#, SUPPLYSTATUS}, {SUPPLIERNAME, SUPPLYCITY}}
This JD is implied by the fact that {S#} and { SUPPLYNAME} are both candidate keys.
To conclude, we note that it follows from the definition that 5NF is the ultimate normal form with
respect to projection and join (which accounts for its alternative name, projection-join normal form).
That is, a relation in 5NF is guaranteed to be free of anomalies that can be eliminated by taking
projections. For a relation is in 5NF the only join dependencies are those that are implied by
candidate keys, and so the only valid decompositions are ones that are based on those candidate
keys. (Each projection in such a decomposition will consist of one or more of those candidate keys,
plus zero or more additional attributes.) For example, the suppliers relation REL15 is in 5NF. It can
be further decomposed in several nonloss ways, as we saw earlier, but every projection in any such
decomposition will still include one of the original candidate keys, and hence there does not seem to
be any particular advantage in that further reduction.
6.3 Self Test
Exercise 1
1.
2.
3.
4.
5.
6.
True-False Questions:
Normalisation is the process of splitting a relation into two or more relations. (T)
A functional dependency is a relationship between or among attributes. (T)
The known or given attribute is called the determinant in a functional dependency. (T)
The relationship in a functional dependency is one-to-one (1:1). (F)
A key is a group of one or more attributes that uniquely identifies a row. (T)
The selection of the attributes to use for the key is determined by the database programmers.
(F)
7. A deletion anomaly occurs when deleting one entity results in deleting facts about another
entity. (T)
www.arihantinfo.com
100
8. An insertion anomaly occurs when we cannot insert some data into the database without
inserting another entity first. (T)
9. Modification anomalies do not occur in tables that meet the definition of a relation. (F)
10. A table of data that meets the minimum definition of a relation is automatically in first
normal form. (T)
11. A relation is in first normal form if all of its non-key attributes are dependent on part of the
key. (F)
12. A relation is in second normal form if all of its non-key attributes are dependent on all of the
key. (T)
13. A relation can be in third normal form without being in second normal form. (F)
14. Fifth normal form is the highest normal form. (F)
15. A relation in domain/key normal form is guaranteed to have no modification anomalies. (T)
Exercise 2
1.
2.
file
field
record
row
column
6.4 Summary
Describes the relationship between attributes in a relation. For Example if A and B are attributes
of a relation R, B is functionally dependent and A (denoted A B), if each value of A is associated
with exactly one value of B. (A and B may each consist of one or more attributes)
The concept of database normalization is not unique to any particular Relational Database
Management System. It can be applied to any of several implications of relational databases
including Microsoft Access, dBase, Oracle, etc.
Normalization may have the effect of duplicating data within the database and often results in the
creation of additional tables. (While normalization tends to increase the duplication of data, it
does not introduce redundancy, which is unnecessary duplication.) Normalization is typically a
refinement process after the initial exercise of identifying the data objects that should be in the
database, identifying their relationships, and defining the tables required and the columns within
each table.
www.arihantinfo.com
101
UNIT 7
STRUCTURED QUERY LANGUAGE
7.1Introduction of sql
7.2Ddl statements
7.3Dml statements
7.4View definitions
7.5Constraints and triggers
7.6Keys and foreign keys
7.7Constraints on attributes and tuples
7.8Modification of constraints
7.9Cursors
7.10Dynamic sql
7.1 Introduction Of SQL
At the heart of every DBMS is a language that is similar to a programming language, but different
in that it is designed specifically for communicating with a database. One powerful language is
SQL. IBM developed SQL in the late 1970s and early 1980s as a way to standardize query
language across the many mainframe and microcomputer platforms that company produced.
SQL differs significantly from programming languages. Most programming languages are still
procedural. Procedural language consists of commands that tell the computer what to do -instruction by instruction, step by step. SQL is not a programming language itself, it is a data
access language. SQL may be embedded in traditional procedural programming languages (like
COBOL). SQL statement is not really command to the computer. Rather, it is a description of
some of the data contained in a database. SQL is nonprocedural because it does not give step-bystep commands to the computer or database. SQL describes data, and instructs the database to
do something with the data.
For example
SELECT [Name], [Company_Name]
FROM Contacts
WHERE ((City = "Kansas City", and ([Name] ="R.."))
www.arihantinfo.com
102
Data Definition Language is a set of SQL commands used to create, modify and delete database
structures (not data). These commands wouldn't normally be used by a general user, who should be
accessing the database via an application. They are normally used by the DBA (to a limited extent), a
database designer or application developer. These statements are immediate, they are not susceptible
to ROLLBACK commands. You should also note that if you have executed several DML updates then
issuing any DDL command will COMMIT all the updates as every DDL command implicitly issues a
COMMIT command to the database. Anybody using DDL must have the CREATE object privilege and
a Tablespace area in which to create objects.
In an Oracle database objects can be created at any time, whether users are on-line or not. Table
space need not be specified as Oracle will pick up the user defaults (defined by the DBA) or the
system defaults. Tables will expand automatically to fill disk partitions (provided this has been set up
in advance by the DBA). Table structures may be modified on-line although this can have dire effects
on an application so be careful.
Creating our two example tables
CREATE TABLE BOOK (
ISBN NUMBER(10),
TITLE VARCHAR2(200),
AUTHOR VARCHAR2(50),
COST NUMBER(8,2),
LENT_DATE DATE,
RETURNED_DATE DATE,
TIMES_LENT NUMBER(6),
SECTION_ID NUMBER(3)
)
CREATE TABLE SECTION (
SECTION_ID NUMBER(3),
SECTION_NAME CHAR(30),
BOOK_COUNT NUMBER(6)
)
The two commands above create our two sample tables and demonstrate the basic table creation
command. The CREATE keyword is followed by the type of object that we want created (TABLE,
VIEW, INDEX etc.), and that is followed by the name we want the object to be known by. Between the
outer brackets lie the parameters for the creation, in this case the names, datatypes and sizes of each
field.
A NUMBER is a numeric field, the size is not the maximum externally displayed number but the size
of the internal binary field set aside for the field (10 can hold a very large number). A number size
split with a comma denotes the field size followed by the number of digits following the decimal point
(in this case a currency field has two significant digits)
www.arihantinfo.com
103
A VARCHAR2 is a variable length string field from 0-n where n is the specified size. Oracle only takes
up the space required to hold any value in the field, it doesn't allocate the entire storage space unless
required to by a maximum sized field value (Max size 2000).
A CHAR is a fixed length string field (Max size 255).
A DATE is an internal date/time field (normally 7 bytes long).
A LONG or LONG RAW field (not shown) is used to hold large binary objects (Word documents, AVI
files etc.). No size is specified for these field types. (Max size 2Gb).
Creating our two example tables with constraints
Constraints are used to enforce table rules and prevent data dependent deletion (enforce database
integrity). You may also use them to enforce business rules (with some imagination).
Our two example tables do have some rules which need enforcing, specifically both tables need to
have a prime key (so that the database doesn't allow replication of data). And the Section ID needs to
be linked to each book to identify which library section it belongs to (the foreign key). We also want to
specify which columns must be filled in and possibly some default values for other columns.
Constraints can be at the column or table level.
Constraint
NULL
NULL
Description
NOT NOT NULL specifies that a column must have some value. NULL (default) allows NULL
values in the column.
DEFAULT
UNIQUE
PRIMARY KEY
Specifies that column(s) are the table prime key and must have unique values. Index is
automatically generated for column.
Specifies that column(s) are a table foreign key and will use referential uniqueness of
FOREIGN KEY parent table. Index is automatically generated for column. Foreign keys allow deletion
cascades and table / business rule validation.
CHECK
DISABLE
You may suffix DISABLE to any other constraint to make Oracle ignore the constraint,
the constraint will still be available to applications / tools and you can enable the
constraint later if required.
www.arihantinfo.com
104
www.arihantinfo.com
105
RETURNED_DATE DATE,
TIMES_LENT NUMBER(6),
SECTION_ID NUMBER(3),
CONSTRAINT BOOK_PRIME PRIMARY KEY (ISBN),
CONSTRAINT BOOK_SECT FOREIGN KEY (SECTION_ID) REFERENCES SECTION(SECTION_ID) ON
DELETE CASCADE)
Oracle (and any other decent RDBMS) would not allow us to delete a section which had books
assigned to it as this breaks integrity rules. If we wanted to get rid of all the book records assigned to
a particular section when that section was deleted we could implement a DELETE CASCADE. The
delete cascade operates across a foreign key link and removes all child records associated with a
parent record (we would probably want to reassign the books rather than delete them in the real
world).
To log constraint violations I have created a new table (AUDIT) and stated that all exceptions on the
SECTION table should be logged in this table, you can then view the contents of this table with
standard SELECT statements. The AUDIT table must have the shown structure but can be called
anything.
It is possible to record a description or comment against a newly created or existing table or
individual column by using the COMMENT command. The comment command writes your table /
column description into the data dictionary. You can query column comments by selecting against
dictionary views ALL_COL_COMMENTS and USER_COL_COMMENTS.
You can query table comments by selecting against dictionary views ALL_TAB_COMMENTS and
USER_TAB_COMMENTS. Comments can be up to 255 characters long.
Altering tables and constraints
Modification of database object structure is executed with the ALTER statement.
You can modify a constraint as follows :Add new constraint to column or table.
Remove constraint.
Enable / disable constraint.
You cannot change a constraint definition.
You can modify a table as follows :Add new columns.
Modify existing columns.
You cannot delete an existing column.
An example of adding a column to a table is given below :ALTER TABLE JD11.BOOK ADD (REVIEW VARCHAR2(200))
This statement adds a new column (REVIEW) to our book table, to enable library members to browse
the database and read short reviews of the books.
If we want to add a constraint to our new column we can use the following ALTER statement :-
www.arihantinfo.com
106
the
the
the
the
www.arihantinfo.com
107
Data manipulation language is the area of SQL that allows you to change data within the database. It
consists of only three command statement groups, they are INSERT, UPDATE and DELETE.
Inserting new rows into a table
We insert new rows into a table with the INSERT INTO command. A simple example is given below.
INSERT INTO JD11.SECTION VALUES (SECIDNUM.NEXTVAL, 'Computing', 0)
The INSERT INTO command is followed by the name of the table (and owning schema if required),
this in turn is followed by the VALUES keyword which denotes the start of the value list. The value
list comprises all the values to insert into the specified columns. We have not specified the columns
we want to insert into in this example so we must provide a value for each and every column in the
correct order. The correct order of values can be determined by doing a SELECT * or DESCRIBE
against the required table, the order that the columns are displayed is the order of the values that
you specify in the value list. If we want to specify columns individually (when not filling all values in a
new row) we can do this with a column list specified before the VALUES keyword. Our example is
reworked below, note that we can specify the columns in any order - our values are now in the order
that we specified for the column list.
INSERT INTO JD11.SECTION
SECIDNUM.NEXTVAL)
(SECTION_NAME,
SECTION_ID)
VALUES
('Computing',
In the above example we haven't specified the BOOK_COUNT column so we don't provide a value for
it, this column will be set to NULL which is acceptable since we don't have any constraint on the
column that would prevent our new row from being inserted.
The SQL required to generate the data in the two test tables is given below.
INSERT INTO JD11.SECTION
(SECTION_NAME, SECTION_ID)
VALUES
('Fiction', 10);
INSERT INTO JD11.SECTION
(SECTION_NAME, SECTION_ID)
VALUES
('Romance', 5);
INSERT INTO JD11.SECTION
(SECTION_NAME, SECTION_ID)
VALUES
('Science Fiction', 6);
INSERT INTO JD11.SECTION
(SECTION_NAME, SECTION_ID)
VALUES
('Science', 7);
INSERT INTO JD11.SECTION
(SECTION_NAME, SECTION_ID)
VALUES
www.arihantinfo.com
108
('Reference', 9);
INSERT INTO JD11.SECTION
(SECTION_NAME, SECTION_ID)
VALUES
('Law', 11);
INSERT INTO JD11.BOOK
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT,
VALUES
(21, 'HELP', 'B.Baker', 20.90, '20-AUG-97', NULL, 10, 9);
INSERT INTO JD11.BOOK
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT,
VALUES
(87, 'Killer Bees', 'E.F.Hammond', 29.90, NULL, NULL, NULL, 9);
INSERT INTO JD11.BOOK
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT,
VALUES
(90, 'Up the creek', 'K.Klydsy', 15.95, '15-JAN-97', '21-JAN-97', 1, 10);
INSERT INTO JD11.BOOK
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT,
VALUES
(22, 'Seven seas', 'J.J.Jacobs', 16.00, '21-DEC-97', NULL, 19, 5);
INSERT INTO JD11.BOOK
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT,
VALUES
(91, 'Dirty steam trains', 'J.SP.Smith', 8.25, '14-JAN-98', NULL, 98, 9);
INSERT INTO JD11.BOOK
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT,
VALUES
(101, 'The story of trent', 'T.Wilbury', 17.89, '10-JAN-98', '16-JAN-98', 12, 6);
INSERT INTO JD11.BOOK
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT,
VALUES
(8, 'Over the past again', 'K.Jenkins', 19.87, NULL, NULL, NULL, 10);
INSERT INTO JD11.BOOK
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT,
VALUES
(79, 'Courses for horses', 'H.Harriot', 10.34, '17-JAN-98', NULL, 12, 9);
INSERT INTO JD11.BOOK
(ISBN, TITLE, AUTHOR, COST, LENT_DATE, RETURNED_DATE, TIMES_LENT,
VALUES
(989, 'Leaning on a tree', 'M.Kilner', 19.41, '12-NOV-97', '22-NOV-97', 56, 11);
SECTION_ID)
SECTION_ID)
SECTION_ID)
SECTION_ID)
SECTION_ID)
SECTION_ID)
SECTION_ID)
SECTION_ID)
SECTION_ID)
www.arihantinfo.com
109
The UPDATE command allows you to change the values of rows in a table, you can include a WHERE
clause in the same fashion as the SELECT statement to indicate which row(s) you want values
changed in. In much the same way as the INSERT statement you specify the columns you want to
update and the new values for those specified columns. The combination of WHERE clause (row
selection) and column specification (column selection) allows you to pinpoint exactly the value(s) you
want changed. Unlike the INSERT command the UPDATE command can change multiple rows so you
should take care that you are updating only the values you want changed (see the transactions
discussion for methods of limiting damage from accidental updates).
An example is given below, this example will update a single row in our BOOK table :UPDATE JD11.BOOK SET TITLE = 'Leaning on a wall', AUTHOR = 'J.Killner', TIMES_LENT = 0,
LENT_DATE = NULL, RETURNED_DATE = NULL WHERE ISBN = 989
We specify the table to be updated after the UPDATE keyword. Following the SET keyword we specify
a comma delimited list of column names / new values, each column to be updated must be specified
here (note that you can set columns to NULL by using the NULL keyword instead of a new value). The
WHERE clause follows the last column / new value specification and is constructed in the same way
as for the SELECT statement, use the WHERE clause to pinpoint which rows to be updated. If you
don't specify a WHERE clause on an UPDATE command all rows will be updated (this may or may
not be the desired result).
Deleting rows with DELETE
The DELETE command allows you to remove rows from a table, you can include a WHERE clause in
the same fashion as the SELECT statement to indicate which row(s) you want deleted - in nearly all
cases you should specify a WHERE clause, running a DELETE without a WHERE clause deletes ALL
rows from the table. Unlike the INSERT command the DELETE command can change multiple rows
so you should take great care that you are deleting only the rows you want removed (see the
transactions discussion for methods of limiting damage from accidental deletions).
An example is given below, this example will delete a single row in our BOOK table :DELETE FROM JD11.BOOK WHERE ISBN = 989
The DELETE FROM command is followed by the name of the table from which a row will be deleted,
followed by a WHERE clause specifying the column / condition values for the deletion.
DELETE FROM JD11.BOOK WHERE ISBN <> 989
This delete removes all records from the BOOK table except the one specified. Remember that if you
omit the WHERE clause all rows will be deleted.
7.4 View Definitions
DBMaker provides several convenient methods of customizing and speeding up access to your data.
Views and synonyms are supported to allow user-defined views and names for database objects.
Indexes provide a much faster method of retrieving data from a table when you use a column with an
index in a query.
Managing Views
www.arihantinfo.com
110
DBMaker provides the ability to define a virtual table, called a view, which is based on existing tables
and is stored in the database as a definition and a user-defined view name. The view definition is
stored persistently in the database, but the actual data that you will see in the view is not physically
stored anywhere. Rather, the data is stored in the base tables from which the view's rows are derived.
A view is defined by a query which references one or more tables (or other views).
Views are a very helpful mechanism for using a database. For example, you can define complex
queries once and use them repeatedly without having to re-invent them over and over. Furthermore,
views can be used to enhance the security of your database by restricting access to a predetermined
set of rows and/or columns of a table.
Since views are derived from querying tables, you can not determine the rows of the tables to update.
Due to this limitation views can only be queried. Users can not update, insert into, or delete from
views.
Creating Views
Each view is defined by a name together with a query that references tables or other views. You can
specify a list of column names for the view different from those in the original table when creating a
view. If you do not specify any new column names, the view will use the column names from the
underlying tables.
For example, if you want users to see only three columns of the table Employees, you can create a
view with the SQL command shown below. Users can then view only the FirstName, LastName and
Telephone columns of the table Employees through the view empView.
dmSQL> create view empView (FirstName, LastName, Telephone) as
select FirstName, LastName, Phone from Employees;
The query that defines a view cannot contain the ORDER BY clause or UNION operator.
Dropping Views
You can drop a view when it is no longer required. When you drop a view, only the definition stored in
system catalog is removed. There is no effect on the base tables that the view was derived from. To
drop a view, execute the following command:
dmSQL> DROP VIEW empView;
Managing Synonyms
A synonym is an alias, or alternate name, for any table or view. Since a synonym is simply an alias, it
requires no storage other than its definition in the system catalog.
Synonyms are useful for simplifying a fully qualified table or view name. DBMaker normally identifies
tables and views with fully qualified names that are composites of the owner and object names. By
using a synonym anyone can access a table or view through the corresponding synonym without
having to use the fully qualified name. Because a synonym has no owner name, all synonyms in the
database must be unique so DBMaker can identify them.
www.arihantinfo.com
111
Creating Synonyms
You can create a synonym with the following SQL command:
dmSQL> create synonym Employees for Employees;
If the owner of the table Employees is the SYSADM, this command creates an alias named Employees
for the table SYSADM.Employees. All database users can directly reference the table
SYSADM.Employees through the synonym Employees.
Dropping Synonyms
You can drop a synonym that is no longer required. When you drop a synonym, only its definition is
removed from the system catalog.
The following SQL command drops the synonym Employees:
dmSQL> drop synonym Employees;
Managing Indexes
An index provides support for fast random access to a row. You can build indexes on a table to speed
up searching. For example, when you execute the query SELECT NAME FROM EMPLOYEES WHERE
NUMBER = 10005, it is possible to retrieve the data in a much shorter time if there is an index
created on the NUMBER column.
An index can be composed of more than one column, up to a maximum of 16 columns. Although a
table can have up to 252 columns, only the first 127 columns can be used in an index.
An index can be unique or non-unique. In a unique index, no more than one row can have the same
key value, with the exception that any number of rows may have NULL values. If you create a unique
index on a non-empty table, DBMaker will check whether all existing keys are distinct or not. If there
are duplicate keys, DBMaker will return an error message. After creating a unique index on a table,
you can insert a row in this table and DBMaker will certify that there is no existing row that already
has the same key as the new row.
When creating an index, you can specify the sort order of each index column as ascending or
descending. For example, suppose there are five keys in a table with the values 1, 3, 9, 2, and 6. In
ascending order the sequence of keys in the index is 1, 2, 3, 6, and 9, and in descending order the
sequence of keys in the index is 9, 6, 3, 2, and 1.
When you implement a query, the index order will occasionally affect the order of the data output.
For example, if you have a table name friends with NAME and AGE columns, the output will appear
as below when you execute the query SELECT NAME, AGE FROM FRIEND_TABLE WHERE AGE > 20
using a descending index on the AGE column.
name
age
---------------- ---------------Jeff
49
Kevin
40
Jerry
38
Hughes
30
Cathy
22
www.arihantinfo.com
112
As for tables, when you create an index you can specify the fillfactor for it. The fill factor denotes how
dense the keys will be in the index pages. The legal fill factor values are in the range from 1% to
100%, and the default is 100%. If you often update data after creating the index, you can set a loose
fill factor in the index, for example 60%. If you never update the data in this table, you can leave the
fill factor at the default value of 100%.
Creating Indexes
To create an index on a table, you must specify the index name and index columns. You can specify
the sort order of each column as ascending (ASC) or descending (DESC). The default sort order is
ascending.
For example, the following SQL command creates an index IDX1 on the column NUMBER of table
EMPLOYEES in descending order.
dmSQL> create index idx1 on Employees (Number desc);
Also, if you want to create a unique index you have to explicitly specify it. Otherwise DBMaker
implicitly creates non-unique indexes. The following example shows you how to create a unique index
idx1 on the column Number of the table Employees:
dmSQL> create unique index idx1 on Employees (Number);
The next example shows you how to create an index with a specified fill factor:
dmSQL> create index idx2 on Employees(Number, LastName DESC) fillfactor 60;
Dropping Indexes
You can drop indexes using the DROP INDEX statement. In general, you might need to drop an index
if it becomes fragmented, which reduces its efficiency. Rebuilding the index will create a denser,
unfragmented index.
If the index is a primary key and is referred to by other tables, it cannot be dropped.
The following SQL command drops the index idx1 from the table Employees.
dmSQL> drop index idx1 from Employees;
www.arihantinfo.com
113
Information on SQL constraints can be found in the textbook. The Oracle implementation of
constraints differs from the SQL standard, as documented in Oracle 9i SQL versus Standard SQL.
Triggers are a special PL/SQL construct similar to procedures. However, a procedure is executed
explicitly from another block via a procedure call, while a trigger is executed implicitly whenever the
triggering event happens. The triggering event is either a INSERT, DELETE, or UPDATE command.
The timing can be either BEFORE or AFTER. The trigger can be either row-level or statement-level,
where the former fires once for each row affected by the triggering statement and the latter fires once
for the whole statement.
Deferring Constraint Checking
Sometimes it is necessary to defer the checking of certain constraints, most commonly in the
"chicken-and-egg" problem. Suppose we want to say:
CREATE TABLE chicken (cID INT PRIMARY KEY,
eID INT REFERENCES egg(eID));
CREATE TABLE egg(eID INT PRIMARY KEY,
cID INT REFERENCES chicken(cID));
But if we simply type the above statements into Oracle, we'll get an error. The reason is that the
CREATE TABLE statement for chicken refers to table egg, which hasn't been created yet! Creating egg
won't help either, because egg refers to chicken.
To work around this problem, we need SQL schema modification commands. First, create chicken
and egg without foreign key declarations:
CREATE TABLE chicken(cID INT PRIMARY KEY,
eID INT);
CREATE TABLE egg(eID INT PRIMARY KEY,
cID INT);
Then, we add foreign key constraints:
ALTER TABLE chicken ADD CONSTRAINT chickenREFegg
FOREIGN KEY (eID) REFERENCES egg(eID)
INITIALLY DEFERRED DEFERRABLE;
ALTER TABLE egg ADD CONSTRAINT eggREFchicken
FOREIGN KEY (cID) REFERENCES chicken(cID)
INITIALLY DEFERRED DEFERRABLE;
INITIALLY DEFERRED DEFERRABLE tells Oracle to do deferred constraint checking. For example, to
insert (1, 2) into chicken and (2, 1) into egg, we use:
INSERT INTO chicken VALUES(1, 2);
INSERT INTO egg VALUES(2, 1);
COMMIT;
Because we've declared the foreign key constraints as "deferred", they are only checked at the commit
point. (Without deferred constraint checking, we cannot insert anything into chicken and egg,
because the first INSERT would always be a constraint violation.)
www.arihantinfo.com
114
Finally, to get rid of the tables, we have to drop the constraints first, because Oracle won't allow us to
drop a table that's referenced by another table.
ALTER TABLE egg DROP CONSTRAINT eggREFchicken;
ALTER TABLE chicken DROP CONSTRAINT chickenREFegg;
DROP TABLE egg;
DROP TABLE chicken;
Basic Trigger Syntax
Below is the syntax for creating a trigger in Oracle (which differs slightly from standard SQL syntax):
CREATE [OR REPLACE] TRIGGER <trigger_name>
{BEFORE|AFTER} {INSERT|DELETE|UPDATE} ON <table_name>
[REFERENCING [NEW AS <new_row_name>] [OLD AS <old_row_name>]]
[FOR EACH ROW [WHEN (<trigger_condition>)]]
<trigger_body>
Some important points to note:
You can create only BEFORE and AFTER triggers for tables. (INSTEAD OF triggers are only
available for views; typically they are used to implement view updates.)
You may specify up to three triggering events using the keyword OR. Furthermore, UPDATE
can be optionally followed by the keyword OF and a list of attribute(s) in <table_name>. If
present, the OF clause defines the event to be only an update of the attribute(s) listed after
OF. Here are some examples:
If FOR EACH ROW option is specified, the trigger is row-level; otherwise, the trigger is
statement-level.
respectively. Note: In the trigger body, NEW and OLD must be preceded by a colon
(":"), but in the WHEN clause, they do not have a preceding colon! See example below.
The REFERENCING clause can be used to assign aliases to the variables NEW and
OLD.
A trigger restriction can be specified in the WHEN clause, enclosed by parentheses.
The trigger restriction is a SQL condition that must be satisfied in order for Oracle to
fire the trigger. This condition cannot contain subqueries. Without the WHEN clause,
the trigger is fired for each row.
<trigger_body> is a PL/SQL block, rather than sequence of SQL statements. Oracle has placed
certain restrictions on what you can do in <trigger_body>, in order to avoid situations where
www.arihantinfo.com
115
one trigger performs an action that triggers a second trigger, which then triggers a third, and
so on, which could potentially create an infinite loop. The restrictions on <trigger_body>
include:
o You cannot modify the same relation whose modification is the event triggering the
o
trigger.
You cannot modify a relation connected to the triggering relation by another constraint
such as a foreign-key constraint.
Trigger Example
We illustrate Oracle's syntax for creating a trigger through an example based on the following two
tables:
CREATE TABLE T4 (a INTEGER, b CHAR(10));
CREATE TABLE T5 (c CHAR(10), d INTEGER);
We create a trigger that may insert a tuple into T5 when a tuple is inserted into T4. Specifically, the
trigger checks whether the new tuple has a first component 10 or less, and if so inserts the reverse
tuple into T5:
CREATE TRIGGER trig1
AFTER INSERT ON T4
REFERENCING NEW AS newRow
FOR EACH ROW
WHEN (newRow.a <= 10)
BEGIN
INSERT INTO T5 VALUES(:newRow.b, :newRow.a);
END trig1;
.
run;
Notice that we end the CREATE TRIGGER statement with a dot and run, as for all PL/SQL
statements in general. Running the CREATE TRIGGER statement only creates the trigger; it does not
execute the trigger. Only a triggering event, such as an insertion into T4 in this example, causes the
trigger to execute.
Displaying Trigger Definition Errors
As for PL/SQL procedures, if you get a message
Warning: Trigger created with compilation errors.
you can see the error messages by typing
show errors trigger <trigger_name>;
Alternatively, you can type, SHO ERR (short for SHOW ERRORS) to see the most recent compilation
error. Note that the reported line numbers where the errors occur are not accurate.
Viewing Defined Triggers
To view a list of all defined triggers, use:
select trigger_name from user_triggers;
For more details on a particular trigger:
select trigger_type, triggering_event, table_name, referencing_names, trigger_body
www.arihantinfo.com
116
from user_triggers
where trigger_name = '<trigger_name>';
Dropping Triggers
To drop a trigger:
drop trigger <trigger_name>;
Disabling Triggers
To disable or enable a trigger:
alter trigger <trigger_name> {disable|enable};
Aborting Triggers with Error
Triggers can often be used to enforce contraints. The WHEN clause or body of the trigger can check
for the violation of certain conditions and signal an error accordingly using the Oracle built-in
function RAISE_APPLICATION_ERROR. The action that activated the trigger (insert, update, or delete)
would be aborted. For example, the following trigger enforces the constraint Person.age >= 0:
create table Person (age int);
CREATE TRIGGER PersonCheckAge
AFTER INSERT OR UPDATE OF age ON Person
FOR EACH ROW
BEGIN
IF (:new.age < 0) THEN
RAISE_APPLICATION_ERROR(-20000, 'no negative age allowed');
END IF;
END;
.
RUN;
If we attempted to execute the insertion:
insert into Person values (-3);
we would get the error message:
ERROR at line 1:
ORA-20000: no negative age allowed
ORA-06512: at "MYNAME.PERSONCHECKAGE", line 3
ORA-04088: error during execution of trigger 'MYNAME.PERSONCHECKAGE'
and nothing would be inserted. In general, the effects of both the trigger and the triggering statement
are rolled back.
www.arihantinfo.com
117
According to Codd, Date, and all other experts, a key has only one meaning in relational theory: it is
a set of one or more columns whose combined values are unique among all occurrences in a given
table. A key is the relational means of specifying uniqueness.
Why Keys Are Important
Keys are crucial to a table structure for the following reasons:
They ensure that each record in a table is precisely identified. As you already know, a table represents
a singular collection of similar objects or events. (For example, a CLASSES table represents a
collection of classes, not just a single class.) The complete set of records within the table constitutes
the collection, and each record represents a unique instance of the table's subject within that
collection. You must have some means of accurately identifying each instance, and a key is the device
that allows you to do so.
They help establish and enforce various types of integrity. Keys are a major component of table-level
integrity and relationship-level integrity. For instance, they enable you to ensure that a table has
unique records and that the fields you use to establish a relationship between a pair of tables always
contain matching values.
Candidate Key
As stated above, a candidate key is any set of one or more columns whose combined values are
unique among all occurrences (i.e., tuples or rows). Since a null value is not guaranteed to be unique,
no component of a candidate key is allowed to be null.
There can be any number of candidate keys in a table. Relational pundits are not in agreement
whether zero candidate keys is acceptable, since that would contradict the (debatable) requirement
that there must be a primary key.
Primary Key
The primary key of any table is any candidate key of that table which the database designer
arbitrarily designates as "primary". The primary key may be selected for convenience, comprehension,
performance, or any other reasons. It is entirely proper (albeit often inconvenient) to change the
selection of primary key to another candidate key.
Alternate Key
The alternate keys of any table are simply those candidate keys which are not currently selected as
the primary key. According to {Date95} (page 115), "... exactly one of those candidate keys [is] chosen
as the primary key [and] the remainder, if any, are then called alternate keys." An alternate key is a
function of all candidate keys minus the primary key.
7.7 Constraints on Attributes and Tuples
Not Null Constraints
presC# INT REFERENCES MovieExec(cert#) NOT NULL
Attribute-Based CHECK Constraints
presC# INT REFERENCES MovieExec(cert#)
CHECK (presC# >= 100000)
gender CHAR(1) CHECK (gender IN (F, M)),
presC# INT CHECK (presC# IN (SELECT cert# FROM MovieExec))
www.arihantinfo.com
118
Creating cursors using the Xlib cursor font, a font of shapes and masks, and pixmaps
Managing cursors
Create CURSOR
Xlib enables clients to use predefined cursors or to create their own cursors. To create a predefined
www.arihantinfo.com
119
Xlib cursor, use the CREATE FONT CURSOR routine. Xlib cursors are predefined in
ECW$INCLUDE:CURSORFONT.H. See the X and Motif Quick Reference Guide for a list of the
constants that refer to the predefined Xlib cursors.
The following example creates a sailboat cursor, one of the predefined Xlib cursors, and associates
the cursor with a window:
Cursor fontcursor;
.
.
.
fontcursor = XCreateFontCursor(dpy, XC_sailboat);
XDefineCursor(dpy, win, fontcursor);
The DEFINE CURSOR routine makes the sailboat cursor automatically visible when the pointer is in
window win.
To create client-defined cursors, either create a font of cursor shapes or define cursors using
pixmaps. In each case, the cursor consists of the following components:
Mask---Acts as a clip mask to define how the cursor actually appears in a window
Hotspot---Defines the position on the cursor that reflects movements of the pointing device
7.10Dynamic SQL
Dynamic SQL is an enhanced form of Structured Query Language (SQL) that, unlike standard (or
static) SQL, facilitates the automatic generation and execution of program statements. This can be
helpful when it is necessary to write code that can adjust to varying databases, conditions, or servers.
It also makes it easier to automate tasks that are repeated many times. Dynamic SQL statements are
stored as strings of characters that are entered when the program runs. They can be entered by the
programmer or generated by the program itself, but unlike static SQL statements, they are not
embedded in the source program. Also in contrast to static SQL statements, dynamic SQL statements
can change from one execution to the next.
Let's go back and review the reasons we use stored procedure and what happens when we use
dynamic SQL. As a starting point we will use this procedure:
CREATE PROCEDURE general_select @tblname nvarchar(127),
@key
key_type AS -- key_type is char(3)
EXEC('SELECT col1, col2, col3
FROM ' + @tblname + '
WHERE keycol = ''' + @key + '''')
www.arihantinfo.com
120
The SELECT statement in client code and send this directly to SQL Server.
1. Permissions
If you cannot give users direct access to the tables, you cannot use dynamic SQL, it is as simple as
that. In some environments, you may assume that users can be given SELECT access. But unless
you know for a fact that permissions is not an issue, don't use dynamic SQL for INSERT, UPDATE
and DELETE statements. I should hasten to add this applies to permanent tables. If you are only
accessing temp tables, there are never any permission issues.
2. Caching Query Plans
As we have seen, SQL Server caches the query plans for both bare SQL statements and stored
procedures, but is somewhat more accurate in reusing query plans for stored procedures. In SQL 6.5
you could clearly say that dynamic SQL was slower, because there was a recompile each time. In
later versions of SQL Server, the waters are more muddy.
3. Minimizing Network Traffic
In the two previous sections we have seen that dynamic SQL in a stored procedure is not any better
than bare SQL statements from the client. With network traffic it is a different matter. There is never
any network cost for dynamic SQL in a stored procedure. If we look at our example procedure
general_select, neither is there much to gain. The bare SQL code takes about as many bytes as the
procedure call.
But say that you have a complex query which joins six tables with complex conditions, and one of the
table is one of sales0101, sales0102 etc depending on which period the user wants data about. This
is a bad table design, that we will return to, but assume for the moment that you are stuck with this.
If you solve this with a stored procedure with dynamic SQL, you only need to pass the period as a
parameter and don't have to pass the query each time. If the query is only passed once an hour the
gain is negligible. But if the query is passed every fifth second and the network is so-so, you are likely
to notice a difference.
4. Using Output Parameters
If you write a stored procedure only to gain the benefit of an output parameter, you do not in any way
affect this by using dynamic SQL. Then again, you can get OUTPUT parameters without writing your
own stored procedures, since you can call sp_executesql directly from the client.
5. Encapsulating Logic
There is not much to add to what we said in our first round on stored procedures. I like to point out,
however, that once you have decided to use stored procedure, you should have all secrets about SQL
in stored procedures, so passing table names as in general select is not a good idea. (The exception
here being sysadmin utilities.)
6. Keeping Track of what Is Used
Dynamic SQL is contradictory to this aim. Any use of dynamic SQL will hide a reference, so that it
will not show up in sysdepends. Neither will the reference reveal itself when you build the database
without the referenced object. Still, if you refrain from passing table or column names as parameters,
www.arihantinfo.com
121
you at least only have to search the SQL code to find out whether a table is used. Thus, if you use
dynamic SQL, confine table and column names to the procedure code.
www.arihantinfo.com
122
UNIT 8
RELATIONAL ALGEBRA
account-number
A-101
A-215
A-102
HillA-305
A-201
A-222
A-217
balance
500
700
400
350
900
700
750
www.arihantinfo.com
123
branch-name
Perryridge
Perryridge
loan-number
L-15
L-16
amount
1500
1300
Comparisons like =, , <, >, can also be used in the selection predicate. An example query using a
comparison is to find all tuples in which the amount lent is more than $1200 would be written:
amount > 1200 (loan)
b) Project
The project operation is a unary operation that returns its argument relation with certain attributes
left out. Since a relation is a set, any duplicate rows are eliminated. Projection is denoted by the
Greek letter pi ( ). The attributes that wish to be appear in the result are listed as a subscript to .
The argument relation follows in parentheses. For example, the query to list all loan numbers and the
amount of the loan is written as:
loan-number, amount (loan)
The result of the query is the following:
loan-number
L-17
L-23
L-15
L-14
L-93
L-11
L-16
amount
1000
2000
1500
1500
500
900
1300
Another more complicated example query is to find those customers who live in Harrison is written
as:
customer-name ( customer-city = "Harrison" (customer))
c) Union
The union operation yields the results that appear in either or both of two relations. It is a binary
operation denoted by the symbol .
An example query would be to find the name of all bank customers who have either an account or a
loan or both. To find this result we will need the information in the depositor relation and in the
borrower relation. To find the names of all customers with a loan in the bank we would write:
customer-name (borrower) and to find the names of all customers with an account in the bank, we
would write customer-name (depositor)
www.arihantinfo.com
124
Then by using the union operation on these two queries we have the query we need to obtain the
wanted results. The final query is written as:
customer-name (borrower)
customer-name (depositor)
The result of the query is the following:
customer-name
Johnson
Smith
Hayes
Turner
Jones
Lindsay
Jackson
Curry
Williams
Adams
d) Set Difference
The set-difference operation, denoted by the -, results in finding tuples taht are in one relation but
are not in another. The expression r - s results in a relation containing those tuples in r abut not in s.
For example, the query to find all customers of the bank who have an account but not a loan, is
written as:
customer-name (depositor) - customer-name (borrower)
The result of the query is the following:
customer-name
Johnson
Turner
Lindsay
For a set difference to be valid, it must be taken between compatible operations just as in the union
operation.
8.3 Extended Operators of Relational Algebra
The set intersection operation is denoted by the symbol . It is not a fundamental operation, however
it is a more convenient way to write r - (r - s).
An example query of the operation to find all customers who have both a loan and and account can
be written as:
customer-name (borrower)
customer-name (depositor)
The results of the query are the following:
customer-name
Hayes
Jones
www.arihantinfo.com
125
Smith
It has been shown that the set of relational algebra operations {, , U, , x} is a complete set; that is,
any of the other relational algebra operations can be expressed as a sequence of operations from this
set. For example, the INTERSECTION operation can be expressed by using UNION and DIFFERENCE
as follows:
R S ( R S ) ((R S ) ( S R ))
Although, strictly speaking, INTERSECTION is not required, it is inconvenient to specify this complex
expression every time we wish to specify an intersection. As another example, a JOIN operation can
be specified as a CARTESIAN PRODUCT followed by a SELECT operation, as we discussed:
R
<condition>
<condition>
(R x S )
Similarly, a NATURAL JOIN can be specified as a CARTESIAN PRODUCT proceeded by RENAME and
followed by SELECT and PROJECT operations. Hence, the various JOIN operations are also not
strictly necessary for the expressive power of the relational algebra; however, they are very important
because they are convenient to use and are very commonly applied in database applications
assignment
Cartesian product
division
natural join
project
rename
select
Set difference
Set-intersection
union
www.arihantinfo.com
126
Let
and each tuple of the relation contains a set of values, one drawn
, so that if
for all
we have
. We write
is an
of the attributes
for
. A primary
. Where
to satisfy
and
Operations on Relations
Taking the view of a relation as a set we can apply the normal set operations to relations over the
same domain. If the domain differ this is not possible. We have, of course, the normal algebraic
structure to the operations: the null relation over a domain is well defined, and the null tuple is the
sole element of the null relation.
We also have three relational operators we wish to consider: select, join and project.
First we define for each relation R the domain
map an element of an instance
Select
Now we define
to true or false.
by
is a relation over the same domain as R and is a subset of R. We notice that we can use the
same primary key for both R and
and that
www.arihantinfo.com
127
Join
The join operation
is defined by
where we have
What is the schema for this? The key? Does it satisfy it?
Project
We define
by the post-condition
is
.
If we view R as a set of map we can view the projection operator as restricting each of these maps to a
smaller domain.
The schema of
is A. If
then
forms a key for A. Otherwise, if there are no nulls in any
tuple of the projection of R we can use A as a primary key. Otherwise we can not in general identify a
primary key.
Insertion
For each relation R we define an insertion operation:
The insertion operation satisfies the invariant, since it will refuse to break it.
Update
For each relation R we define an update operation.
www.arihantinfo.com
128
2. Set Operations
i. Re-state the expression 10 60( ) age age patients s using set operations.
ii. Re-state the expression oculist( ) rank surgeon rank doctors s using set operations without
and
iii. Find all the patients who saw doctor 801 but not 802 (i.e. dnum=801, dnum 802)
3. Cartesian Product and Join
i. Form peer groups for patients, where a peer group is a pair of patients where age difference is less
than 10 years (can use the rename operator rA(R) ).
ii. Who are the surgeons who visited the patient 101 (i.e. pnum = 101)?
iii. Who has seen a surgeon in the past two years?
iv. Is there any non-surgeon doctors who performed a surgeon (a doctor performed a surgeon if the
visit record shows diagnosis=operation for him)?
4. Divison
i. Who has seen all the surgeons in the past two months?
ii. Find all patients excepts for the youngest ones. patients (pnum, pname, age)
doctors (dnum, dname, rank) visits (pnum, dnum, date, diagnosis)
8.6 Summary
Relational algebra is a procedural query language, which consists of a set of operations that
take one or two relations as input and produce a new relation as their result. The fundamental
operations that will be discussed in this tutorial are: select, project, union, and set difference.
www.arihantinfo.com
129
UNIT 9
MANAGEMENT CONSIDERATIONS
9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
9.1
Introduction
Objectives
Organisational Resistance To Dbms Tools
Conversion From An Old System To A New System
Evaluation Of A Dbms
Administration Of A Database Management System
Self Test
Summary
INTRODUCTION
This unit will comprise issues of organizational resistance, the methodology for conversion from an
old system to a new system, the importance of adopting a decentralized distributed approach and
evaluation and administration of such systems.
9.2 OBJECTIVES
After going through this unit, you should be able to:
Identify the factors causing resistance to the induction of new DBMS tools;
Determine the path that must be chosen in converting from an old existing system to a new
system;
List the various factors that are important in evaluating a DBMS system;
List the checkpoints and principles, which must be, adhered to in order that information
quality is assured.
www.arihantinfo.com
130
Political observation: The officers and managers at different levels of an organization feel
threatened with the nice long standing political equations and relationships which have
enjoyed their otherwise upward movement within the organization, and may be threatened by
a new intervention into their styles of working.
Fear of future potential: The very fact that computers can store information in a very compact
manner and it can be collated and analyzed very speedily gives rise to apprehensions of future
adverse use of this information against an individual. Mistakes n decision-making can now be
highlighted and analyzed in detail after learning spells of time. It would not have been
possible in manual file-based systems or any system where the data does not flow so readily.
Inter-departmental rivalry, fear of personal inadequacy, in comprehensive of the new regime the loss
of ones own power and the greater freedom to others and difference in work styles all add up to
produce resistance to the induction of new information processing tools. Apart from these general
considerations, there are reasons to resist installation of a new DBMS.
There are several points of resistance to new DBMS tools:
The selection and acquisition of a DBMS and related tools is one of the most important computerrelated decisions made in an organization. It is also one of the most difficult. There are many systems
from which to choose and it is very difficult to obtain the necessary information to make a good
decision. Vendors always have great things to say, convincing argument for their systems, and often
many satisfied customers. Published literature and software listing services are too cursory to provide
sufficient information on which to base a decision. The mere difficulty in gathering information and
making the selection is one point of resistance to acquiring the new DBMS tools.
The initial cost may also be a barrier to acquisition. However, the subsequent investment in training
people, developing applications, and entering and maintaining data will be much time more. Selection
of an inadequate system can greatly increase these subsequent costs to the point where the initial
acquisition cost becomes irrelevant.
In spite of the apparent resistance to acquisition, the projections for the database industry are
forecasting a multi-billion dollar industry in the 1990's. Even though an organization may acquire a
DBMS, there are still several additional points of resistance to overcome.
www.arihantinfo.com
131
Simply having a DBMS does not mean that it will be used. Several factors may contribute to the lack
of use of new DBMS tools.
Lack of familiarity with the tools and what it can do
System developers used to writing COBOL (or other language) programs prefer to build systems using
the tools they already know
The pressure to get new application development projects completed dictates using established tools
and techniques
Systems development personnel have not been thoroughly trained in the use of the new tool
DBMS tools
DP management is afraid of run away demand on the computing facilities if they allow users to
directly access the data on the host computer using an easy to use, high-level retrieval facility
Organizational policies, which do not demand, appropriate justification for the tools chosen (or not
chosen) for each system development project.
Having pointed out the transactions from which utility can arise to the intervention of a new DBMS,
it may be useful to have a summary of a few pointers, which would possibly lead to a greater success
in such an endeavor.
Reasons for success
An incremental approach to
building applications with
each new step being reasonably
small and relatively easy to
implement.
Cooperate-wide planning by a
high level, empowered,
competent data administrator.
www.arihantinfo.com
132
attempts to re-write to
many old programs.
Simplicity.
Confused thinking.
www.arihantinfo.com
133
file- based system to a database system, and to accept the implications of its operation before the
conversion.
Hardware Requirements and Processing Time: The database approach should be in a position to
delegate to the database management system some of the functions that were previously
performed by the application programmer. As a result of this delegation, a computer with a large
internal memory and greater processing power is needed. Powerful computer systems were once
the luxury enjoyed by those database users who could afford such systems but fortunately, this
trend is now changing. A recent development in hardware technology has made it possible to
acquire powerful, yet affordable system.
For some applications, the need for high-volume transaction processing may force a company to
engineer one or even several systems designed to satisfy this need. This sacrifices certain flexibility
for the system to respond to ad-hoc requests. And it is also argued that because of the easier access
to data in the database, the frequency of access will become higher. Such overuse of computing
resources will cause slips in performance, resulting in an increased demand for computing capacity.
It is sometimes difficult to determine if the increased access to the database is rally necessary.
The database approach offers a number of important and practical advantages to an organization.
Reducing data redundancy improves consistency of the data as well as making savings in storage
space. Sharing data often enables new application to be developed without having to create new data
files. Less redundancy and greater sharing also result in less confusion between organizational units
and less file spent by people resolving inconsistencies in reports. Centralized control over data
standards, security restrictions, and so on, facilitates the evolution of information systems and
organizations in response to changing business needs and strategies. Now a day, users with little or
no previous programming experience can, with the aid of powerful user- friendly query languages,
manipulate data to satisfy ad-hoc queries. Data independence helps case program development and
maintenance, raising programmer productivity. All the benefits of the database approach contribute
to reduce costs of application development and improved quality of managerial decisions.
In general, converting to a database involves the following:
Inventories current systems such as data volume, user satisfaction, present condition, and
the cost to maintain or redevelop.
Determining conversion priority in strategic information system plans, building block systems
and critical needs to replace system.
Write software utility tools to convert the files or database in the old system to the new
database.
www.arihantinfo.com
134
Modify all application programs to make use of the new data structures.
www.arihantinfo.com
135
The reasons which inspire the organization to acquire a DBMS should be clearly documented and
used to determine the properties and help in making trade offs between conflicting objectives and in
the selection of various features that the candidate DBMS may have, depending upon the end-user
requirements. The evaluation team should also be aware of technical and administrative issues.
These technical criteria could be the following:
a.
SQL
implementation
b.
Transaction
management
c.
Programming
interface
d.
Database
server
environment
e.
Data
storage
features
f.
Database
administration
g.
Connectivity
h.
DBMS integrity
Similarly there could be administrative criteria such as:
1.
Required
hardware
2.
Documentation
3.
Vendor's
financial
4.
Vendor
5.
Initial
6.
Recurring cost
platform
stability
support
cost
Each of these, especially the technical criteria could be further broken into sub-criteria. For example
the data storage features can be further sub-classified into:
a.
Lost
database
segments
b.
Clustered
indexes
c.
Clustered tables
Once this level of detailing is done, the list of features become quite large and may even run into
hundreds. If a dozen products are to be evaluated, we are talking of a fairly large matrix.
At this point, it is important for the evaluation wars and especially its technical members to segregate
these features into those, which are mandatory. Mandatory features would be those which if not
present in the candidate system, the system need not be considered further. For example, does
DBMS provide facilities for programming and non-programming users? Can be considered as one
among several mandatory conditions. Mandatory requirement may also flow from a desire to preserve
the previous investment in information systems made by an organization. The presence of the
mandatory condition means that the system is a candidate for the rating procedure.
Having done the first stage of creating a feature list, one of the simplest ways could be to develop a
table where the features and its related information for each candidate system is listed to in a tabular
form against the desired feature. Such forms can be chosen to compare the various systems and
although this cannot be enough to conclude an evaluation, it is a useful method for at least broadly
ranking and short-listing the systems. A quantitative flavour can be given to the above approach by
www.arihantinfo.com
136
awarding points for features, which are in simple Yes and No type. If all the features are not equally
important to the organization, then the summing up of the points awarded for each of the features for
any of the system is not quite appropriate. In
such a case a rating factor can be assigned to each
feature to reflect the relevant level of importance of that feature to the organization. Of course such
rating or scoring should be done after the first condition of mandatory requirements have been met
by the proposed system. Sometimes the mandatory characteristics may be expressed in the negative
as something, which the system must not have.
The points of the rates is a contentious issue and must be decided looking only to the needs of the
organization and with reference to the characteristics of any specific candidate system one of the
approaches used towards arriving at a suitable set of rating factors is to follow the Delphi method. In
brief, the Delphi approach requires key people who may be expected to be knowledgeable to make
suggestions as to what, would be the appropriate rating factor. These are collected, compiled,
averages taken and deviation from averages pointed out This data is then re-circulated to the same
set of people for wanting to change their opinions where their own views were varying largely from the
average. The details can then be carried out and it has been found that in about as few as 3 to 4
iterations in good consensus emerges.
One of the weaknesses of the methodologies discussed so far is that they are focusing on the
systems but not on the cost benefit aspects. A good evaluation methodology should be possibly to
suggest the most cost effective solution to the problem. For example, if a system is twice as good as
another system, but costs only 40% more than it ought to be a preferred solution.
In order to carry a cost after analysis one has to use a rating function with each feature to
normalize the sequence. Rather than having an approach where a feature is characterized as a
Yes/No, the attribute corresponding to its presence or absence, which in marks term could be 0 or 1,
a mark can be given on a scale which is appropriate to the feature. This can arise in issues such as
the number of terminals that are supported or the amount of main memory required. Rating
functions can be of several types of which 4 are illustrated in Figure 1.
Linear: In a linear rating function the rating increases in proportion to higher marks starting
from 0. Broken linear: There are situations where the minimum threshold is essential and similarly
there is a saturated value above which no additional value is given. Typically in general concurrent
access, few or 3 would be includable value and more than 9 is of no additional value.
Binary: This is of course a Yes/No type where a system either has or does not have the feature
or some minimum value for the feature.
Inverse: There are some attributes where a higher mark actually implies a lower rating. For
example in accessing the time to process a standard query, the mark may be simply the time scale in
an appropriate manner. Therefore, a shorter time actually has a higher rating.
For each feature, the rating function uses an appropriate and convenient scale of
measurement for determining a system's feature mark. The rating function transforms a system's
feature mark into a normalized rating indicating its value relative to a nominal mark for that feature.
The nominal mark for each feature has a nominal rating of one.
The use of rating function is more sophisticated and costly to apply than the simplified
methodologies. The greater objectivity and precision obtained must be weighted against the overall
benefits of DBMS acquisition and use. Some features will have no appropriate objective scale on
www.arihantinfo.com
137
which to mark the feature. The analyst could use a five point scale with a linear rating function as
follows:
Feature evaluation
Rating point
Excellent (A)
5
Good (B)
4
Average (C)
3
Fair (D)
2
Poor (E)
1
Variations can expand or contract the rating scale, using a nonlinear rating function, or expand the
points in the feature evaluation scale to achieve greater resolution. In extreme cases, the analyst
could simply use subjective judgment to arrive at a rating directly, remembering that a feature rating
of one applies to a nominal or average system.
Having converted all the marks to ratings, the system score is the product of the rating and the
weight summed across all features, just as before. The overall score for a nominal system would he
one (since all weights sum to one and all nominal ratings are one). This is important for determining
cost effectiveness, the ratio between the value of a system and its cost. The organization first
determines the value of a system which cams a nominal mark for all features. This is called the
nominal value. Then the actual value of a given system is the product of the overall system score and
the nominal value. The cost effectiveness of a system is the actual value divided by the cost of the
system. System cost is the present value cost of acquisition, operation and maintenance over the
estimated life of the system.
.
www.arihantinfo.com
138
www.arihantinfo.com
139
Definition, creation, revision, and retirement of data formally collected and stored within a
shared corporate database.
Making the database available to the using environment through tools such as a DBMS and
related query languages and report writers.
Informing and advising users on the data resources currently available, the proper
interpretation of the data, and the use of the availability tools. This includes educational
materials, training sessions, participation on projects, and special assistance.
Maintaining database integrity including existence control (backup and recovery), definition
control, quality control, update control, concurrency control, and access control.
Monitor and improve operations and performance, and maintain an audit trail of database
activities.
The data dictionary is one of the more important tools for the database administrator. It is
used to maintain information relating to the various resources used in information systems
(hence sometimes called an information resource dictionary)--data, input transactions, output
reports, programs, application systems, and users. It can:
www.arihantinfo.com
140
9.7
9.8
Provide a more complete definition of the data stored in the database (than is maintained by
the DBMS).
Enable an organization to assess the impact of a suggested change within the information
system or the database.
The DBA should also have tools to monitor the performance of the database system to
indicate the need for reorganization or revision of the database.
SELF TEST
1) Determine the path that must be chosen in converting from an old existing system to a new
system;
2) List the various factors that are important in evaluating a DBMS system;
3) Explain role of DBA with the help of an example.
SUMMARY
There are some aspects of change related to information system that are those great
passions such as Political observation and information transparency etc.
Organizational policies, which do not demand, appropriate justification for the tools
chosen (or not chosen) for each system development project.
The initial cost may also be a barrier to acquisition. However, the subsequent investment
in training people, developing applications, and entering and maintaining data will be
much time more. Selection of an inadequate system can greatly increase these subsequent
costs to the point where the initial acquisition cost becomes irrelevant.
Broken linear: There are situations where the minimum threshold is essential and
similarly there is a saturated value above which no additional value is given. Typically in
general concurrent access, few or 3 would be includable value and more than 9 is of no
additional value.
www.arihantinfo.com
141
UNIT 10
CONCURRENCY CONTROL
www.arihantinfo.com
142
T2
read(x) --- x=100 --x:=x+8 --- x=108 --write(x)
read(x) --- x=108 --x:=x*5 --- x=540 ----- x=540, y=50 --- write(x)
read(y) --- y=50 --y:=y-5 --- y=45 ----- x=540, y=45 --- write(y)
Serializable Schedules
Let T be a set of n transactions T1, T2, ..., Tn . If the n transactions are executed serially (call this
execution S), we assume they terminate properly and leave the database in a consistent state. A
concurrent execution of the n transactions in T (call this execution C) is called serializable if the
execution is computationally equivalent to a serial execution. There may be more than one such serial
execution. That is, the concurrent execution C always produces exactly the same effect on the
database as some serial execution S does. (Note that S is some serial execution of T, not necessarily
the order T1, T2, ..., Tn ). A serial schedule is always correct since we assume transactions do not
depend on each other and furthermore, we assume, that each transaction when run in isolation
transforms a consistent database into a new consistent state and therefore a set of transactions
executed one at a time (i.e. serially) must also be correct.
Example
1. Given the following schedule, draw a serialization (or precedence) graph and find if the
schedule is serializable.
www.arihantinfo.com
143
Solution:
There is a simple technique for testing a given schedule S for serializability. The testing is
based on constructing a directed graph in which each of the transactions is represented by
one node and an edge between
in the schedule:
executes WRITE( X) before
executes READ( X) before
and
10.2Conflict-Serializability
Two executions are conflict-equivalent, if in both executions all conflicting operations have the
same order
Conflict graph
Execution is conflict-serializable iff the conflict graph is acyclic
T1
T2
T3
W1(a) R2(a) R3(b) W2(c) R3(c) W3(b) R1(b)
www.arihantinfo.com
144
Example
Schedule
Conflict
Graph
Nodes:
transactions
Directed edges:
conflicts
between
operations
Serializablity (examples)
However, H1 is not conflict-serializable, since its conflict graph contains a cycle: w1(x,1)
occurs before w2(x,2), but w2(x,2), w2(y,1) occurs before r1(y)
Consider the schedule H3 below: H3: w1(x,1), r2(x), c2, w3(y,1), C3, r1(y), C1
This is despite the fact that T2 was executed completely and committed, before T3 even
started
Recoverability of a Schedule
A transaction T1 reads from transaction T2, if T1 reads a value of a data item that was written
into the database by T2
The schedule below is serializable, but not recoverable: H4: r1(x), w1(x), r2(x), w2(y) C2, C1
Cascadelessness of a Schedule
The schedule below is recoverable, but not cascadeless: H4: r1(x), w1(x), r2(x), C1, w2(y) C2
Strictness of a Schedule
The schedule below is cascadeless, but not strict: H5: r1(x), w1(x), r2(y), w2(x), C1, C2
www.arihantinfo.com
145
A schedule H is strongly recoverable, iff for every pair of transactions in H, their commitment
order is the same as the order of their conflicting operations.
The schedule below is strict, but not strongly recoverable: H6: r1(x) w2(x) C2 C1
Rigorousness of a Schedule
A schedule H is rigorous, if it is strict and no transaction in H reads a data item untils all
transactions that previously read this item either commit or abort
The schedule below is strongly recoverable, but not rigorous: H7: r1(x) w2(X) C1 C2
www.arihantinfo.com
146
we can easily store multiple versions and gain the benefits of high concurrency and transaction
isolation without locking.
Unfortunately the locking protocols of popular database systems, many of which were designed well
over a decade ago, form the core of those systems and replacing them seems to have been impossible,
despite recent research again finding that storing multiple versions is better than a single
version with locks
Build-up commences the start of the transaction. When a transaction is started a consistent view
of the database is frozen based on the state after the last committed transaction. This means that
the application will see this consistent view of the database during the entire transaction. This is
accomplished by the use of an internal Transaction Cache, which contains information about all
ongoing transactions in the system. The application "sees" the database through the Transaction
Cache. During the Build-up phase the system also builds up a Read Set documenting the
accesses to the database, and a Write Set of changes to be made, but does not apply any of
these changes to the database. The Build-up phase ends with the calling of the COMMIT
command by the application.
The Commit involves using the Read Set and the Transaction Cache to detect access conflicts
with other transactions. A conflict occurs when another transaction alters data in a way that
would alter the contents of the Read Set for the transaction that is checked. Other transactions
that were committed during the checked transaction's Build-up phase or during this check phase
can cause a conflict. If a transaction conflict is detected, the checked transaction is aborted. No
rollback is necessary, as no changes have been made to the database. An error code is returned
to the application, which can then take appropriate action. Often this will be to retry the
transaction without the user being aware of the conflict.
If no conflicts are detected the operations in the Write Set for the transaction are moved to
another structure, called the Commit Set that is to be secured on disk. All operations for one
transaction are stored on the same page in the Commit Set (if the transaction is not very large).
Before the operations in the Commit Set are secured on permanent storage, the system checks if
there is any other committed transactions that can be stored on the same page in the Commit
www.arihantinfo.com
147
Set. After this, all transactions stored on the Commit Set page are written to disk (to the
transaction databank TRANSDB) in one single I/O operation. This behavior is called a Group
Commit, which means that several transactions are secured simultaneously. When the Commit
Set has been secured on disk (in one I/O operation), the application is informed of the success of
the COMMIT command and can resume its operations.
During the Apply phase the changes are applied to the database, i.e. the databanks and the
shadows are updated. The Background threads in the Database Server carry out this phase.
Even though the changes are applied in the background, the transaction changes are visible to
all applications through the Transaction Cache. Once this phase is finished the transaction is
fully complete. If there is any kind of hardware failure that means that
SQL is unable to
complete this phase, it is automatically restarted as soon as the cause of the failure is corrected.
Most other DBMSs offer pessimistic concurrency control. This type of concurrency control protects a
user's reads and updates by acquiring locks on rows (or possibly database pages, depending on the
implementation), this leads to applications becoming 'contention bound' with performance limited by
other transactions. These locks may force other users to wait if they try to access the locked items.
The user that 'owns' the locks will usually complete their work, committing the transaction and
thereby freeing the locks so that the waiting users can compete to attempt to acquire the locks.
Optimistic Concurrency Control (OCC) offers a number of distinct advantages including:
Deadlocks cannot occur, so the performance overheads of deadlock detection are avoided as
well as the need for possible system administrator intervention to resolve them.
Programming is simplified as transaction aborts only occur at the Commit command whereas
deadlocks can occur at any point during a transaction. Also it is not necessary for the
programmer to take any action to avoid the potentially catastrophic effects of deadlocks, such
as carrying out database accesses in a particular order. This is particularly important as
potential deadlock situations are rarely detected in testing, and are only discovered when
systems go live.
Data cannot be left inaccessible to other users as a result of a user taking a break or being
excessively slow in responding to prompts. Locking systems leave locks set in these
circumstances denying other users access to the data.
Data cannot be left inaccessible as a result of client processes failing or losing their
connections to the server.
Delays caused by locking systems being overly cautious are avoided. This can arise as a result
of larger than necessary lock granularity, but there are also several other circumstances when
locking causes unnecessary delays even when using fine granularity locking.
Through the Group Commit concept, which is applied in SQL, the number of I/Os needed to
secure committed transactions to the disk is reduced to a minimum. The actual updates to the
database are performed in the background, allowing the originating application to continue.
www.arihantinfo.com
148
The ROLLBACK statement is supported but, because nothing is written to the actual database
during the transaction Build-up phase, this involves only a re-initialization of structures used
by the transaction control system.
Memory Usage
Shared Memory
File system
Page eviction
Improvements:
www.arihantinfo.com
149
SMP locking optimizations, Use of global kernel_lock was minimized. More subsystem based
spinlock are used. More spinlocks embedded in data structures.
Semaphores used to serialize address space access.
More of a spinlock hierarchy established. Spinlock granularity tradeoffs.
Kernel multi-threading improvements
Multiple threads can access address space data structures simultaneously.
Single mem->msem semaphore was replaced with multiple reader/single writer semaphore.
Reader lock is now acquired for reading per address space data structures.
Exclusive write lock is acquired when altering per address space data structures.
32 bit UIDs and GIDs
Increase from 16 to 32 bit UIDs allow up to 4.2 billion users.
Increase from 16 to 32 bit GIDs allow up to 4.2 billion groups.
64 bit virtual address space, Architectural limit of the virtual address space was
expanded to a full 64 bits.
IA64 currently implements 51 bits (16 disjoint 47 bit regions)
Alpha currently implements 43 bits (2 disjoint 42 bit regions)
S/390 currently implements 42 bits
Future Alpha is expanded to 48 bits (2 disjoint 47 bit regions)
Unified file system cache
Single page cache was unified from previous
Page cache read write functionality. Reduces memory consumption by eliminating double buffered
copies of file system data.
Eliminates overhead of searching two levels of data cache.
10.6Managing Hierarchies of Database Elements
These storage type are sometimes called the storage hierarchy. It contains of the archival
storage. It consist of the archival database, physical database, archival log, and current log.
Physical database: this is the online copy of the database that is stored in nonvolatile storage and
used by all active transactions.
Current Database: the current version of the database is made up of physical database plus
modifications implied by buffer in the volatile storage.
Database users
Program code
And buffer
in volatile
storage
Applicationi
Data Buffer
Applicationi
Log Buffers
www.arihantinfo.com
150
physical
database on
nonvolatile
storage
current log,
check point
on stable storage
archive copy
of database
on stable
storage
archive log
on stable
storage
Archival database in stable storage: this is the copy of the database at a given time, stored. it contain
the entire database in a quiescent mode and could have been made by simple dump routine to dump
the physical database on to stable storage. all transaction that have been executed on the database
from the time of archiving have to be redline in a global recovery database is a copy of the database in
a quiescent state, and only the committed transaction since the time of archiving are applied to this
database.
Current log: the log information required for recovery from system failure involving loss of volatile
information.
Archival log: is used for failure involving if loss of nonvolatile information.
The online or current database is made up of all the records that are accessible to the DBMS during
its operation. The current database consist of the data stored in nonvolatile storage and not yet
propagated tot the nonvolatile storage.
www.arihantinfo.com
151
distributed system, we rely on a lock manager - a server that issues locks on resources. This is
exactly the same as a centralized mutual exclusion server: a client can request a lock and then send
a message releasing a lock on a resource (by resource in this context, we mean some specific block of
data that may be read or written). One thing to watch out for, is that we still need to preserve serial
execution: if two transactions are accessing the same set of objects, the results must be the same as
if the transactions executed in some order (transaction A cannot modify some data while transaction
B modifies some other data and then transaction A accesses that modified data -- this is concurrent
modification). To ensure serial ordering on resource access, we impose a restriction that states that a
transaction is not allowed to get any new locks after it has released a lock. This is known as twophase locking. The first phase of the transaction is a growing phase in which it acquires the locks it
needs. The second phase is the shrinking phase where locks are released.
Strict two-phase locking
A problem with two-phase locking is that if a transaction aborts, some other transaction may have
already used data from an object that the aborted transaction modified and then unlocked. If this
happens, any such transactions will also have to be aborted. This situation is known as cascading
aborts. To avoid this, we can strengthen our locking by requiring that a transaction will hold all its
locks to the very end: until it commits or aborts rather than releasing the lock when the object is no
longer needed.
This is known as strict two-phase locking.
Locking granularity
A typical system will have many objects and typically a transaction will access only a small amount of
data at any given time (and it will frequently be the case that a transaction will not clash with other
transactions). The granularity of locking affects the amount of concurrency we can achieve. If we can
have a smaller granularity (lock smaller objects or pieces of objects) then we can generally achieve
higher concurrency. For example, suppose that all of a banks customers are locked for any
transaction that needs to modify a single customer datum: concurrency is severely limited because
any other transactions that need to access any customer data will be blocked. If, however, we use a
customer record as the granularity of locking, Transactions that access different customer records
will be capable of running concurrently.
Multiple readers/single writer
There is no harm having multiple transactions read from the same object as long as it has not been
modified by any of the transactions. This way we can increase concurrency by having multiple
transactions run concurrently if they are only reading from an object. However, only one transaction
should be allowed to write to an object. Once a transaction has modified an object, no other
transactions should be allowed to read or write the modified object. To support this, we now use two
locks: read locks and write locks. Read locks are also known as shared locks (since they can be
shared by multiple transactions) If a transaction needs to read an object, it will request a read lock
from the lock manager. If a transaction needs to modify an object, it will request a write lock from the
lock manager. If the lock manager cannot grant a lock, then the transaction will wait until it can get
www.arihantinfo.com
152
the lock (after the transaction with the lock committed or aborted). To summarize lock granting: If a
transaction has: another transaction may obtain: no locks read lock or write lock read lock read lock
(wait for write lock) write lock wait for read or write locks
Increasing concurrency: two-version locking
Two-version locking is an optimistic concurrency control scheme that allows one transaction to write
tentative versions of objects while other transactions read from committed versions of the same
objects. Read operations only wait if another transaction is currently committing the same object.
This scheme allows more concurrency than read-write locks, but writing transactions risk waiting (or
rejection) when they attempt to commit. Transactions cannot commit their write operations
immediately if other uncommitted transactions have read the same objects. Transactions that request
to commit in this situation have to wait until the reading transactions have completed.
Two-version locking
The two-version locking scheme requires three types of locks: read, write, and commit locks. Before
an object is read, a transaction must obtain a read lock. Before an object is written, the transaction
must obtain a write lock (same as with two-phase locking). Neither of these locks will be granted if
there is a commit lock on the object. When the transaction is ready to commit:
- all of the transactions write locks are changed to commit locks
- if any objects used by the transaction have outstanding read locks, the transaction
must wait until the transactions that set these locks have completed and the locks are released.
If we compare the performance difference between two-version locking and strict two-phase locking
(read/write locks):
- read operations in two-version locking are delayed only while transactions are being committed
rather than during the entire execution of transactions (usually the commit protocol takes far less
time than the time to perform the transaction)
- but read operations of one transaction can cause a delay in the committing of
other transactions.
www.arihantinfo.com
153
(private workspace) of the objects it updates - copy of the most recently committed version. Write
operations record new values as tentative values. Before a transaction can commit, a validation is
performed on all the data items to see whether the data conflicts with operations of other
transactions. This is the
validation phase.
If the validation fails, then the transaction will have to be aborted and restarted later. If the
transaction succeeds, then the changes in the tentative version are made permanent. This is the
update phase. Optimistic control is deadlock free and allows for maximum parallelism (at the
expense of possibly restarting transactions)
Timestamp ordering
Reed presented another approach to concurrency control in 1983. This is called timestamp
ordering. Each transaction is assigned a unique timestamp when it begins (can be from a physical or
logical clock). Each object in the system has a read and write timestamp associated with it (two
timestamps per object). The read timestamp is the timestamp of the last committed transaction that
read the object. The write timestamp is the timestamp of the last committed transaction that modified
the object (note - the timestamps are obtained from the transaction timestamp - the start of that
transaction)
The rule of timestamp ordering is:
- if a transaction wants to write an object, it compares its own timestamp with the objects read and
write timestamps. If the objects timestamps are older, then the ordering is good.
- if a transaction wants to read an object, it compares its own timestamp with the objects write
timestamp. If the objects write timestamp is older than the current transaction, then the ordering is
good.
www.arihantinfo.com
154
If a transaction attempts to access an object and does not detect proper ordering, the transaction is
aborted and restarted (improper ordering means that a newer transaction came in and modified data
before the older one could access the data or read data that the older one wants to modify).
10.9
Summary
When two or more transactions are running concurrently, the steps of the transactions
would normally be interleaved. The interleaved execution of transactions is decided by the
database scheduler, which receives a stream of user requests that arise from the active
transactions. A particular sequencing (usually interleaved) of the actions of a set of
transactions is called a schedule. A serial schedule is a schedule in which all the
operations of one transaction are completed before another transaction can begin (that is,
there is no interleaving).
www.arihantinfo.com
155
Two executions are conflict-equivalent, if in both executions all conflicting operations have the
same order.
To allow multiple concurrent transactions access to the same data, most database servers use
a two-phase locking protocol. Each transaction locks sections of the data that it reads or
updates to prevent others from seeing its uncommitted changes. Only when the transaction is
committed or rolled back can the locks be released. This was one of the earliest methods of
concurrency control, and is used by most database systems.
several Object Orientated Databases, which were more recently developed, have incorporated
OCC within their designs to gain the performance advantages inherent within this
technological approach.
www.arihantinfo.com
156
UNIT 11
TRANSACTION MANAGEMENT
www.arihantinfo.com
157
But decision-support requirements have evolved far beyond canned reports and known queries.
Today's data warehouses must give organizations the in-depth and accurate information they need to
personalize customer interactions at all touch points and convert browsers to buyers. An RDBMS
designed for transaction processing can't keep up with the demands placed on data warehouses:
support for high concurrency, mixed-workload, detail data, fast query response, fast data load, ad
hoc queries, and high-volume data mining.
The notion of concurrency control is closely tied to the notion of a ``transaction''. A transaction
defines a set of ``indivisible'' steps, that is, commands with the Atomicity, Consistency, Isolation, and
Durability (ACID) properties:
Atomicity: Either all or none of the steps of the transaction occur so that the invariants of the shared
objects are maintained. A transaction is typically aborted by the system in response to failures but it
may be aborted also by a user to ``undo'' the actions. In either case, the user is informed about the
success or failure of the transaction.
Consistency: A transaction takes a shared object from one legal state to another, that is, maintains
the invariant of the shared object.
Isolation: Events within a transaction are hidden from other concurrently executing transactions.
Techniques for achieving isolation are called synchronization schemes. They determine how these
transactions are scheduled, that is, what the relationships are between the times the different steps
of these transactions. Isolation is required to ensure that concurrent transactions do not cause an
illegal state in the shared object and to prevent cascaded rollbacks when a transaction aborts.
Durability: Once the system tells the user that a transaction has completed successfully, it ensures
that values written by the database system persist until they are explicitly overwritten by other
transactions.
Consider the schedules S1, S2, S3, S4 and S5 given below. Draw the precedence graphs for each
schedule and state whether each schedule is (conflict) serializable or not. If a schedule is serializable,
write down the equivalent serial schedule(s).
S1 : read1(X), read3(X), write1(X), read2(X), write3(X).
S2 : read1(X), read3(X), write3(X), write1(X), read2(X).
S3 : read3(X), read2(X), write3(X), read1(X), write1(X).
S4 : read3(X), read2(X), read1(X), write3(X), write1(X).
S5 : read2(X), write3(X), read2(Y), write4(Y), write3(Z), read1(Z),
write4(X), read1(X), write2(Y),
read1(Y).
www.arihantinfo.com
158
Serializability is the classical concurrency scheme. It ensures that a schedule for executing
concurrent transactions is equivalent to one that executes the transactions serially in some order. It
assumes that all accesses to the database are done using read and write operations. A schedule is
called ``correct'' if we can find a serial schedule that is ``equivalent'' to it. Given a set of transactions
T1...Tn, two schedules S1 and S2 of these transactions are equivalent if the following conditions are
satisfied:
Read-Write Synchronization: If a transaction reads a value written by another transaction in one
schedule, then it also does so in the other schedule.
Write-Write Synchronization: If a transaction overwrites the value of another transaction in one
schedule, it also does so in the other schedule.
Recoverability for changes to the other control file records sections is provided by maintaining all
the information in duplicate. Two physical blocks represent each logical block. One contains the
current information, and the other contains either an old copy of the information, or a pending
version that is yet to be committed. To keep track of which physical copy of each logical block
contains the current information, Oracle maintains a block version bitmap with the database
information entry in the first record section of the control file.
Recovery is an algorithmic process and should be kept as simple as possible, since complex
algorithms are likely to introduce errors. Therefore, an encoding scheme should be designed around a
set of principles intended to make recovery possible with simple algorithms. For processes such as
tag removal, simple mappings are more straightforward and less error prone than, say, algorithms
which require rearrangement of the sequence of elements, or which are context-dependent, etc.
Therefore, in order to provide a coherent and explicit set of recovery principles, various recovery
algorithms and related encoding principles need to be worked out, taking into account such things
as:
The role and nature of mappings (tags to typography, normalized characters, spellings, etc., with the
original ...);
Definitions and separability of the source and annotation (such as linguistic annotation,
notes, etc.);
11.3View Serializability
Serializability In this paper, we assume serializability is the underlying consistency criterion.
Serializability requires that concurrent execution of transactions have the same effect as a serial
schedule. In a serial schedule, transactions are executed one at a time. Given that the database is
initially consistent and each transaction is a unit of consistency, serial execution of all transactions
will give each transaction a consistent view of the database and will leave the database in a
consistent state. Since serializable execution is computationally equivalent to serial execution,
serializable execution is also deemed correct.
www.arihantinfo.com
159
www.arihantinfo.com
160
2. For any unaborted Ti , Tj and for any x, if Ti reads x from Tj in H then Ti reads x from Tj in H0 ,
and 3. For each x, if wi [x] is the final write of x in H then it is also the final write of x in H0. ffl
Assume that there is a transaction (Tb) which initializes the values for all the data objects. ffl A
schedule is view serializable if it is view equivalent
to a serial schedule. ffl r3 [x]w4 [x]w3 [x]w6 [x] - T3 read-from Tb.
- The final write for x is w6 [x].
- View equivalent to T3 T4 T6.
ffl r3 [x]w4 [x]r7 [x]w3 [x]w7 [x]
- T3 read-from Tb.
- T7 read-from T4.
- The final write for x is w7 [x].
- View equivalent to T3 T4 T7.
ffl r3 [x]w4 [x]w3 [x]
- T3 read-from Tb.
- The final write for x is w3 [x].
- Not serializable.
ffl w1 [x]r2 [x]w2 [x]r1 [x]
- T 2 read-from T1.
- T1 read-from T2.
- The final write for x is w2 [x].
- Not serializable.
ffl Test for view serializability. ffl Tb issues writes for all data objects (first transaction). ffl Tf read the
values for all data objects (last transaction).
ffl Construction of labeled precedence graph
1. Add an edge Ti. 0 ! Tj, if transaction Tj.
Reads from Ti.
2. For each data item Q such that - Tj
read-from Ti.- T k executes write(Q) and T k 6= Tb. (i 6= j 6= k) do the followings: (a) If Ti = Tb and Tj
6= T f , then insert the edge Tj 0 ! T k .
11.4 Resolving Deadlocks
Use the Monitor Display to understand and resolve deadlocks. This section demonstrates how this is
done. The steps assume that the deadlock is easy to recreate with the tested application running
within Optimize It Thread Debugger. If this is not the case, use the Monitor Usage Analyzer instead.
To resolve a deadlock:
1.
Recreate the deadlock.
2.
Switch to Monitor Display.
3. Identify which thread is not making progress. Usually, the thread is yellow because it is
blocking on a monitor. Call that thread the blocking thread.
4. Select the Connection button to identify where the blocking thread is not making progress.
Double-click on the method to display the source code for the blocking method, as well as
methods calling the blocking method. This provides some context for where the deadlock
occurs.
www.arihantinfo.com
161
5. Identify which thread owns the unavailable monitor. Call this the locking thread.
6. Identify why the locking thread does not release the monitor. This can happen in the
following cases:
The locking thread is itself trying to acquire a monitor owned directly or indirectly by the
blocking thread. In this case, a bug exists since both the locking and the blocking threads
enter monitors in a different order. Changing the code to always enter monitors in the
same order will resolve this deadlock.
The locking thread is not releasing the monitor because it remains busy executing the
code. In this case, the locking thread is green because it uses some CPU. This type of bug
is not a real deadlock. It is an extreme contention issue caused by the locking thread
holding the monitor for too long, sometimes called thread starvation.
The locking thread is waiting for an I/O operation. In this case the locking thread is
purple. It is dangerous for a thread to perform an I/O operation while holding a monitor,
unless the only purpose of that monitor is to protect the objects used to perform the I/O.
A blocking I/O operation may never occur, causing the program to hang. Often these
situations can be resolved by releasing the monitor before performing the I/O.
The locking thread is waiting for another monitor. In this case, the locking thread is red.
It is equally dangerous to wait for a monitor while holding another monitor. The monitor
may never be notified, causing a deadlock. Often this situation can be resolved by
releasing the monitor that the blocking thread wants to acquire before waiting on the
monitor.
11.5Distributed Databases
To support data collection needs of large networks, a distributed multitier architecture is
necessary. The advantage of this approach is that massive scalability can be implemented in an
incremental fashion, easing the operations and financial burden. Web NMS Server has been designed
to scale from a single server solution to a distributed architecture supporting very large scale needs.
The architecture supports one or more relational databases (RDBMS).
A single back-end server collects the data and stores it in a local or remote database. The system
readily supports tens of thousands of managed objects. For example, on a 400 MHz Pentium
Windows NT system, the polling engine can collect over 4000 collected variables/minute, including
storage in the MySQL database on Windows.
The bottleneck for data collection is usually the database inserts, which limits the number of entries
per second that can be inserted into the database. As we discuss below, with a distributed database,
considerably higher performance is possible through distributing the storage of data into multiple
databases.
Based on tests with different modes, one central database on commodity hardware can handle up to
100 collected variables/second; with distributed databases, this can be scaled much higher. The
achievable rate depends on the number of databases and the number and type of servers used. With
www.arihantinfo.com
162
distributed databases, there is often a need to aggregate data in a single central store for reporting
and other purposes. Multiple approaches are feasible here:
Roll-up data periodically to the central database from the different distributed databases.
Use Native database distribution for centralized views, e.g. Oracle SQL Net. This is vendor
dependent, but can provide easy consolidation of data from multiple databases.
Aggregate data using JDBC only when creating a report. This would require the report writer
to take care of collecting the data from the different databases for the report.
The solution is Distributed Polling. You can adopt this technique when you are able to distinguish
the network elements geographically. You can form a group of network elements and decide to have
one Distributed Poller for them.
This section describes Distributed Polling architecture available with Web NMS Server. It discusses
the design, and the choices available in implementing the distributed solution. It provides guidelines
on setting up the components of the distributed system.
You have Web NMS server running in one machine and Distributed Poller running in other
machines, one in each.
Each Poller is identified by a name and has an associated database (labelled as Secondary
RDBMS in the diagram)
You create PolledData and specify the Poller name if you want to perform data collection for
that PolledData via the distributed poller. In case you want Web NMS Polling Engine to collect
data you don't specify any Poller name. By default, PolledData will not be associated with any
of the Pollers.
www.arihantinfo.com
163
Once you associate the PolledData with the Poller and start the Poller , data collection is done
by poller and collected data is stored in Poller database (Secondary RDBMS).
To the user, the distributed database system should appear exactly like a non-distributed database
system.
Advantages of distributed database systems are:
o Local autonomy (in enterprises that are distributed already)
o
Improved performance (since data is stored close to where needed and a query may be split
Economics
Expandability
Shareability
Cost (software development can be much more complex and therefore costly. Also, exchange
Security (since the system is distributed the chances of security lapses are greater)
Difficult to change (since all sites have control of the their own sites)
Replication improves availability since the system would continue to be fully functional even if a site
goes down. Replication also allows increased parallelism since several sites could be operating on the
same relations at the same time. Replication does result in increased overheads on update.
Fragmentation may be horizontal, vertical or hybrid (or mixed). Horizontal fragmentation splits a
relation by assigning each tuple of the relation to a fragment of the relation. Often horizontal
fragmentation is based on predicates defined on that relation.
Vertical fragmentation splits the relation by decomposing a relation into several subsets of the
attributes. Relation R produces fragments
each of which contains a subset of
attributes of R as well as the primary key of R. Aim of vertical fragmentation is to put together those
attributes that are accessed together.
Mixed fragmentation uses both vertical and horizontal fragmentation.
To obtain a sensible fragmentation design, it is necessary to know some information about the
database as well as about applications. It is useful to know the predicates used in the application
queries - at least the 'important' ones.
www.arihantinfo.com
164
cost models
Computing cost itself can be complex since the cost is a weighted combination of the I/O, CPU and
communications costs. Often one of the two cost models are used; one may wish to minimize the total
cost (time) or the response time. Fragmentation and replication add another complexity to finding an
optimum query plan.
Date's 12 Rules for Distributed Databases
RDBMS in all other respects should behave like a non distributed RDBMS. This is sometimes called
Rule 0.
Distributed Database Characteristics
According to Oracle, these are the database characteristics and how Oracle 7 technology meets each
point:
1. Local autonomy The data is owned and managed locally. Local operations remain purely local. One
site (node) in the distributed system does not depend on another site to function successfully.
2. No reliance on a central site All sites are treated as equals. Each site has its own data dictionary.
3. Continuous operation Incorporating a new site has no effect on existing applications and does not
disrupt service.
www.arihantinfo.com
165
4. Location independence Users can retrieve and update data independent of the site.
5. Partitioning [fragmentation] independence Users can store parts of a table at different locations.
Both horizontal and vertical partitioning of data is possible.
6. Replication independence Stored copies of data can be located at multiple sites. Snapshots, a type
of database object, can provide both read-only and updatable copies of tables. Symmetric replication
using triggers makes readable and writable replication possible.
7. Distributed query processing Users can query a database residing on another node. The query is
executed at the node where the data is located.
8. Distributed transaction management A transaction can update, insert, or delete data from multiple
databases. The two-phase commit mechanism in Oracle ensures the integrity of distributed
transactions. Row-level locking ensures a high level of data concurrency.
9. Hardware independence Oracle7 runs on all major hardware platforms.
10. Operating system independence A specific operating system is not required. Oracle7 runs under a
variety of operating systems.
11. Network independence The Oracle's SQL*Net supports most popular networking software.
Network independence allows communication across homogeneous and heterogeneous networks.
Oracle's MultiProtocol Interchange enables applications to communicate with databases across
multiple network protocols.
12. DBMS independence DBMS independence is the ability to integrate different databases. Oracle's
Open Gateway technology supports ODBC-enahled connections to non-Oracle databases.
11.6Distributed Commit
To create a new user, test, and a corresponding default schema you must be connected as the ADMIN
user and then use:
CREATE USER test;
CREATE SCHEMA test AUTHORIZATION test; --sets the default schema
COMMIT;
and then connect to the new user/schema using:
CONNECT TO '' USER 'test'
Notice that the COMMIT was needed before the CONNECT because re-connecting would otherwise
rollback any uncommitted changes.
In this example the sequence of events is as follows:
The coordinator at Client A registers automatically with the Transaction Manager database at Server
B, using TM_DATABASE=TMB.
The application requester at Client A issues a DUOW request to Servers C and E. For example, the
following REXX script illustrates this:
/**/
'set DB2OPTIONS=+c' /* in order to turn off autocommit */
'db2 set client connect 2 syncpoint twophase'
'db2 connect to DBC user USERC using PASSWRDC'
'db2 create table twopc (title varchar(50) artno smallint not null)'
'db2 insert into twopc (title,artno) values("testCCC",99)'
www.arihantinfo.com
166
11.7Distributed Locking
The intent of this white paper is to convey information regarding database locks as they apply to
transactions in general and the more specific case of how they are implemented by the Progress
server. Well begin with a general overview discussing why locks are needed and how they affect
www.arihantinfo.com
167
transactions. Transactions and locking are outlined in the SQL standard so no introduction would
be complete without discussing the guidelines set forth here. Once we have a grasp on the general
concepts of locking well dive into lock modes, such as table and record locks and their effect on
different types of database operations. Next, the subject of timing will be introduced, when locks are
obtained and when they are released. From here well get into lock contention and deadlocks, which
are multiple operations or transactions all attempting to get locks on the same resource at the same
time. And to conclude our discussion on locking well take a look at how we can see locks in our
application so we know which transactions obtain which types of locks. Finally, this white paper
describes differences in locking behavior between previous and current versions of Progress and
differences in locking behavior when both 4GL and SQL92 clients are accessing the same resources.
Locks
The answer to why we lock is simple; if we didnt there would be no consistency. Consistency
provides us with successive, reliable, and uniform results without which applications such as
banking and reservation systems, manufacturing, chemical, and industrial data collection and
processing could not exist. Imagine a banking application where two clerks attempt to update an
account balance at the same time: one credits the account and the other debits the account. While
one clerk reads the account balance of $200 to credit the account $100, the other clerk has already
completed the debit of $100 and updated the account balance to $100. When the first clerk finishes
the credit of $100 to the balance of $200 and updates the balance to $300 it will be as if the debit
never happened. Great for the customer; however the bank wouldnt be in business for long.
What objects are we locking?
What database objects get locked is not as simple to answer as why theyre locked. From a user
perspective, objects such as the information schema, user tables, and user records are locked while
www.arihantinfo.com
168
being accessed to maintain consistency. There are other lower level objects that require locks that are
handled by the RDBMS; however, they are not visible to the user. For the purposes of this discussion
we will focus on the objects that the user has visibility of and control over.
Transactions
Now that we know why and what we lock, lets talk a bit about when we lock. A transaction is a unit
of work; there is a well-defined beginning and end to each unit of work. At the beginning of each
transaction certain locks are obtained and at the end of each transaction they are released. During
any given transaction, the RDBMS, on behalf of the user, can escalate, deescalate, and even release
locks as required. Well talk about this in more detail later when we discuss lock modes. The
aforementioned is all-true in the case of a normal, successful transaction; however in the case of an
abnormally terminated transaction things are handled a bit differently. When a transaction fails, for
any reason, the action performed by the transaction needs to be backed out, the change undone. To
accomplish this most RDBMS use what are known as save points. A save point marks the last
known good point prior to the abnormal termination; typically this is the beginning of the
transaction. Its the RDBMSs job to undo the changes back to the previous save point as well as
ensuring the proper locks are held until the transaction is completely undone. So, as you can see,
transactions that are in the process to be undone (rolled back) are still transactions nonetheless and
still need locks to maintain data consistency. Locking certain objects for the duration of a
transaction ensures database consistency and isolation from other concurrent transactions,
preventing the banking situation we described previously. Transactions are the basis for the ACID
ATOMICITY guarantees that all operations within a transaction are performed or none of them are
performed.
CONSISTENCY is the concept that allows an application to define consistency points and validate
the correctness of data transformations from one state to the next.
ISOLATION guarantees that concurrent transactions have no effect on each other.
DURABILITY guarantees that all transaction updates are preserved.
11.8Summary
A type of computer processing in which the computer responds immediately to user requests. Each
request is considered to be a transaction. Automatic teller machines for banks are an example of
transaction processing. The opposite of transaction processing is batch processing, in which a batch
of requests is stored and then executed all at one time. Transaction processing requires interaction
with a user, whereas batch processing can take place without a user being present.
Serializability is the classical concurrency scheme. It ensures that a schedule for executing
concurrent transactions is equivalent to one that executes the transactions serially in some order. It
assumes that all accesses to the database are done using read and write operations. A schedule is
called ``correct'' if we can find a serial schedule that is ``equivalent'' to it. Given a set of transactions
T1...Tn, two schedules S1 and S2 of these transactions are equivalent if the following conditions are
satisfied: Read-Write Synchronization: If a transaction reads a value written by another transaction in
one schedule, then it also does so in the other schedule. Write-Write Synchronization: If a transaction
overwrites the value of another transaction in one schedule, it also does so in the other schedule.
www.arihantinfo.com
169
www.arihantinfo.com
170