You are on page 1of 94

Table of Contents

Chapter 1: Introduction..............................................................................................................................................................6
1. Data Abstraction............................................................................................................................................................6
2. Instances and Schemas..................................................................................................................................................6
3. Data Models..................................................................................................................................................................7
4. Database Languages......................................................................................................................................................7
a. Data-Manipulation Language...................................................................................................................................7
b. Data-Definition Language........................................................................................................................................8
Chapter 2: Entity Relationship Model........................................................................................................................................9
1. The Entity......................................................................................................................................................................9
a. Entity Type...............................................................................................................................................................9
b. Entity Instance........................................................................................................................................................10
c. Entity Set.................................................................................................................................................................10
2. Attributes.....................................................................................................................................................................10
a. Simple or Composite Attributes.............................................................................................................................11
b. Single valued or multi-valued Attributes................................................................................................................11
c. Stored or Derived Attributes...................................................................................................................................11
3. The Keys.....................................................................................................................................................................12
a. Super key................................................................................................................................................................12
b. Candidate key.........................................................................................................................................................13
c. Primary Key............................................................................................................................................................13
d. Alternate Keys........................................................................................................................................................14
e. Secondary Key........................................................................................................................................................14
f. Foreign Key............................................................................................................................................................14
4. Relationships...............................................................................................................................................................15
a. Role of the entity in Relationship...........................................................................................................................16
b. Cardinality:.............................................................................................................................................................17
c. Dependencies..........................................................................................................................................................17
d. Relationship Degree................................................................................................................................................18
Chapter 3: Relational Data Model............................................................................................................................................19
1. Properties of database relations...................................................................................................................................19
2. Significance of Constraints:........................................................................................................................................20
a. Null Constraints......................................................................................................................................................20
b. Default Value..........................................................................................................................................................21
c. Domain Constraint..................................................................................................................................................21
Chapter 4: ER to Relational mapping.......................................................................................................................................22
1
1. Mapping Entity Types.................................................................................................................................................22
a. Composite Attributes..............................................................................................................................................22
b. Multi-valued Attributes...........................................................................................................................................22
2. Mapping Relationships................................................................................................................................................23
a. Binary Relationships...............................................................................................................................................23
b. Unary Relationship.................................................................................................................................................24
c. Super / Subtype Relationship:.................................................................................................................................24
Chapter 5: Relational Algebra..................................................................................................................................................25
1. Unary Operations........................................................................................................................................................25
a. The Select Operation:.............................................................................................................................................25
b. The Project Operator...............................................................................................................................................25
1. Binary Operations:......................................................................................................................................................26
a. The Union Operation:.............................................................................................................................................26
b. The Intersection Operation.....................................................................................................................................26
c. Difference Operator:...............................................................................................................................................27
d. Cartesian product:...................................................................................................................................................27
2. Join Operation:............................................................................................................................................................27
a. Theta Join................................................................................................................................................................28
b. EquiJoin..................................................................................................................................................................28
c. Natural Join:............................................................................................................................................................28
d. Outer Join................................................................................................................................................................29
Chapter 6: Relational Calculus.................................................................................................................................................30
1. Tuple Oriented Relational Calculus:...........................................................................................................................30
2. Domain Oriented Relational Calculus:.......................................................................................................................30
a. Functional Dependency..........................................................................................................................................30
b. Normalization.........................................................................................................................................................31
c. Level of Normalization...........................................................................................................................................32
Chapter 7: SQL.........................................................................................................................................................................36
1. Sql queries...................................................................................................................................................................36
a. Join operators..........................................................................................................................................................36
a. Sub queries..............................................................................................................................................................37
2. Sql Functions...............................................................................................................................................................40
a. Date and time functions..........................................................................................................................................40
b. Numeric Functions..................................................................................................................................................41
c. String Functions......................................................................................................................................................41
d. Conversion functions..............................................................................................................................................41
3. Procedural SQL...........................................................................................................................................................42

2
a. Triggers...................................................................................................................................................................43
b. Stored Procedures...................................................................................................................................................44
c. PL/SQL stored procedures......................................................................................................................................45
Chapter 8: Indexing and Storage..............................................................................................................................................46
1. Introduction.................................................................................................................................................................46
2. Properties of Index......................................................................................................................................................46
3. Structure of Index........................................................................................................................................................47
4. Creating Index.............................................................................................................................................................47
5. Methods of Indexing...................................................................................................................................................47
a. Ordered Indices.......................................................................................................................................................47
b. Primary indexes......................................................................................................................................................48
c. Clustered indexes....................................................................................................................................................49
d. Non-clustered or Secondary Indexing....................................................................................................................49
e. Multilevel Indexing.................................................................................................................................................50
4. Is indexing similar to hashing?....................................................................................................................................50
Chapter 9: Query processing and Optimization.......................................................................................................................51
1. Query processing.........................................................................................................................................................51
a. Parsing....................................................................................................................................................................51
b. SQL Execution Phase.............................................................................................................................................52
c. SQL Fetching Phase................................................................................................................................................52
2. Query Processing Bottlenecks.....................................................................................................................................52
3. Query optimization......................................................................................................................................................53
Chapter 10: Transaction Management.....................................................................................................................................55
1. Transactions................................................................................................................................................................55
2. Transaction Operations...............................................................................................................................................55
3. Transaction States.......................................................................................................................................................55
4. Properties of Transaction............................................................................................................................................55
5. Schedules and Conflicts..............................................................................................................................................56
Conflicts in Schedules.....................................................................................................................................................56
Equivalence of Schedules................................................................................................................................................57
6. Concurrency controlling techniques............................................................................................................................57
a. Locking Based Concurrency Control Protocols.....................................................................................................57
b. Time stamping Concurrency Control.....................................................................................................................58
7. Transaction Management with SQL...........................................................................................................................58
4. Deadlocks....................................................................................................................................................................59
a. Deadlock prevention...............................................................................................................................................59
b. Deadlock detection.................................................................................................................................................59

3
c. Deadlock avoidance................................................................................................................................................60
Chapter 11: Distributed DBMS................................................................................................................................................61
1. Features of Distributed database.................................................................................................................................61
2. Features of Distributed DBMS....................................................................................................................................61
3. Factors Encouraging DDBMS....................................................................................................................................61
4. Advantages of Distributed Databases.........................................................................................................................62
5. Adversities of Distributed Databases..........................................................................................................................62
6. Distributed Database Vs Centralized Database...........................................................................................................62
7. Types of Distributed Databases...................................................................................................................................63
a. Homogeneous Distributed Databases.....................................................................................................................63
b. Heterogeneous Distributed Databases....................................................................................................................63
8. Distributed DBMS Architectures................................................................................................................................64
a. Client - Server Architecture for DDBMS...............................................................................................................64
b. Peer- to-Peer Architecture for DDBMS..................................................................................................................64
c. Multi - DBMS Architectures..................................................................................................................................65
9. Query Optimization.....................................................................................................................................................66
a. Query Optimization Issues in DDBMS..................................................................................................................66
b. Query Processing....................................................................................................................................................66
10. Distributed Transactions.........................................................................................................................................67
a. Commits Protocols..................................................................................................................................................67
b. Concurrency Control in Distributed Systems.........................................................................................................69
c. Deadlock Handling in Distributed Systems............................................................................................................70
Chapter 12: Object Oriented Database.....................................................................................................................................72
1. Characteristics of Object oriented database................................................................................................................72
2. Key DIFFERENCE between OODBMS and RDBMS...............................................................................................72
3. Object Oriented Data Model(OODM)........................................................................................................................73
4. Components of Object Oriented Data Model:.............................................................................................................73
a. Object Structure:.....................................................................................................................................................73
b. Object Classes:........................................................................................................................................................73
5. Object, Attributes and Identity....................................................................................................................................74
6. Object oriented methodologies....................................................................................................................................74
8. Advantages of Object oriented data model over Relational model.............................................................................74
9. Advantages of OODB over RDBMS Object..............................................................................................................75
Chapter 13: Database Security and Access Control.................................................................................................................76
1. Two types of database security mechanisms:.............................................................................................................76
a. Discretionary security mechanisms........................................................................................................................76
b. Mandatory security mechanisms............................................................................................................................76

4
2. Security Issues in Databases.......................................................................................................................................76
3. Database Security and the DBA..................................................................................................................................76
4. Discretionary Privileges..............................................................................................................................................77
a. The account level:...................................................................................................................................................77
b. The relation (or table level):...................................................................................................................................77
5. Granting and Revoking of Privileges..........................................................................................................................77
6. Access Control............................................................................................................................................................78
a. Discretionary access control (DAC).......................................................................................................................78
b. MANDATORY ACCESS CONTROL (MAC)......................................................................................................79
c. Role-Based Access Control....................................................................................................................................80
Chapter 14: XML and Web Services.......................................................................................................................................81
1. Comparison with Relational Data...............................................................................................................................81
2. Structure of XML Data...............................................................................................................................................82
3. XML Document Schema.............................................................................................................................................83
a. Document Type Definition.....................................................................................................................................84
c. XML Schema..........................................................................................................................................................85
4. Querying and Transformation.....................................................................................................................................87
a. Tree Model of XML...............................................................................................................................................87
b. XPath......................................................................................................................................................................88
d. XQuery....................................................................................................................................................................89
5. Relational Databases...................................................................................................................................................91
a. Storing as string......................................................................................................................................................91
b. Tree Representation................................................................................................................................................91
c. Map to Relations.....................................................................................................................................................91
6. SQL Extension............................................................................................................................................................92
7. Application Program Interface....................................................................................................................................93
8. XML Applications......................................................................................................................................................93
a. Storing and exchanging data with complex structures...........................................................................................93
b. Web Services..........................................................................................................................................................93
c. Data Mediation.......................................................................................................................................................94

5
Chapter 1: Introduction
A database-management system (DBMS) is a collection of interrelated data and a set of programs to access
those data. The collection of data, usually referred to as the database, contains information relevant to an
enterprise. The primary goal of a DBMS is to provide a way to store and retrieve database information that is
both convenient and efficient. Management of data involves both defining structures for storage of information
and providing mechanisms for the manipulation of information. In addition, the database system must ensure the
safety of the information stored, despite system crashes or attempts at unauthorized access.

1. Data Abstraction
For the system to be usable, it must retrieve data efficiently. The need for efficiency has led designers to use
complex data structures to represent data in the database. Since many database-system users are not computer
trained, developers hide the complexity from users through several levels of abstraction, to simplify users’
interactions with the system:

 Physical level. The lowest level of abstraction describes how the data are actually stored. The physical
level describes complex low-level data structures in detail.
 Logical level. The next-higher level of abstraction describes what data are stored in the database, and
what relationships exist among those data. The logical level thus describes the entire database in terms
of a small number of relatively simple structures. Although implementation of the simple structures at
the logical level may involve complex physical-level structures, the user of the logical level does not
need to be aware of this complexity. This is referred to as physical data independence. Database
administrators, who must decide what information to keep in the database, use the logical level of
abstraction.
 View level. The highest level of abstraction describes only part of the entire database. Even though the
logical level uses simpler structures, complexity remains because of the variety of information stored in
a large database. Many users of the database system do not need all this information; instead, they need
to access only a part of the database. The view level of abstraction exists to simplify their interaction
with the system. The system may provide many views for the same database.

2. Instances and Schemas


Databases change over time as information is inserted and deleted. The collection of information stored in the
database at a particular moment is called an instance of the database. The overall design of the database is
called the database schema. Schemas are changed infrequently, if at all. The concept of database schemas and
instances can be understood by analogy to a program written in a programming language. A database schema
corresponds to the variable declarations (along with associated type definitions) in a program. Each variable has
a particular value at a given instant. The values of the variables in a program at a point in time correspond to an
instance of a database schema.

Database systems have several schemas, partitioned according to the levels of abstraction. The physical schema
describes the database design at the physical level, while the logical schema describes the database design at the
logical level. A database may also have several schemas at the view level, sometimes called subschemas that
describe different views of the database. The physical schema is hidden beneath the logical schema, and can
usually be changed easily without affecting application programs. Application programs are said to exhibit
physical data independence if they do not depend on the physical schema, and thus need not be rewritten if the
physical schema changes.

6
3. Data Models
Underlying the structure of a database is the data model: a collection of conceptual tools for describing data,
data relationships, data semantics, and consistency constraints. A data model provides a way to describe the
design of a database at the physical, logical, and view levels. There are a number of different data models that
we shall cover in the text. The data models can be classified into four different categories:

 Relational Model. The relational model uses a collection of tables to represent both data and the
relationships among those data. Each table has multiple columns, and each column has a unique name.
Tables are also known as relations. The relational model is an example of a record-based model.
Record-based models are so named because the database is structured in fixed-format records of several
types. Each table contains records of a particular type. Each record type defines a fixed number of
fields, or attributes. The columns of the table correspond to the attributes of the record type. The
relational data model is the most widely used data model, and a vast majority of current database
systems are based on the relational model.
 Entity-Relationship Model. The entity-relationship (E-R) data model uses a collection of basic objects,
called entities, and relationships among these objects. An entity is a “thing” or “object” in the real world
that is distinguishable from other objects. The entity-relationship model is widely used in database
design.
 Object-Based Data Model. Object-oriented programming (especially in Java, C++, or C#) has become
the dominant software-development methodology. This led to the development of an object-oriented
data model that can be seen as extending the E-R model with notions of encapsulation, methods
(functions), and object identity. The object-relational data model combines features of the object-
oriented data model and relational data model.
 Semi structured Data Model. The semi structured data model permits the specification of data where
individual data items of the same type may have different sets of attributes. This is in contrast to the data
models mentioned earlier, where every data item of a particular type must have the same set of
attributes. The Extensible Markup Language (XML) is widely used to represent semi structured data.

Historically, the network data model and the hierarchical data model preceded the relational data model. These
models were tied closely to the underlying implementation, and complicated the task of modeling data.

4. Database Languages
A database system provides a data-definition language to specify the database schema and a data-manipulation
language to express database queries and updates. In practice, the data-definition and data-manipulation
languages are not two separate languages; instead they simply form parts of a single database language, such as
the widely used SQL language.

a. Data-Manipulation Language
A data-manipulation language (DML) is a language that enables users to access or manipulate data as organized
by the appropriate data model. The types of access are: Retrieval of information stored in the database, Insertion
of new information into the database, Deletion of information from the database and Modification of
information stored in the database. There are basically two types:

 Procedural DMLs require a user to specify what data are needed and how to get those data.
 Declarative DMLs (also referred to as nonprocedural DMLs) require a user to specify what data are
needed without specifying how to get those data.

7
Declarative DMLs are usually easier to learn and use than are procedural DMLs. However, since a user does not
have to specify how to get the data, the database system has to figure out an efficient means of accessing data. A
query is a statement requesting the retrieval of information. The portion of a DML that involves information
retrieval is called a query language. Although technically incorrect, it is common practice to use the terms query
language and data-manipulation language synonymously. There are a number of database query languages in
use, either commercially or experimentally. The query processor component of the database system translates
DML queries into sequences of actions at the physical level of the database system.

b. Data-Definition Language
We specify a database schema by a set of definitions expressed by a special language called a data-definition
language (DDL). The DDL is also used to specify additional properties of the data We specify the storage
structure and access methods used by the database system by a set of statements in a special type of DDL called
a data storage and definition language. These statements define the implementation details of the database
schemas, which are usually hidden from the users. The data values stored in the database must satisfy certain
consistency constraints. For example, suppose the university requires that the account balance of a department
must never be negative. The DDL provides facilities to specify such constraints. The database system checks
these constraints every time the database is updated. In general, a constraint can be an arbitrary predicate
pertaining to the database. However, arbitrary predicates may be costly to test. Thus, database systems
implement integrity constraints that can be tested with minimal overhead:

Domain Constraints: A domain of possible values must be associated with every attribute (for example, integer
types, character types, date/time types). Declaring an attribute to be of a particular domain acts as a constraint on
the values that it can take. Domain constraints are the most elementary form of integrity constraint. They are
tested easily by the system whenever a new data item is entered into the database.

Referential Integrity: There are cases where we wish to ensure that a value that appears in one relation for a
given set of attributes also appears in a certain set of attributes in another relation (referential integrity). For
example, the department listed for each course must be one that actually exists. More precisely, the dept name
value in a course record must appear in the dept name attribute of some record of the department relation.
Database modifications can cause violations of referential integrity. When a referential integrity constraint is
violated, the normal procedure is to reject the action that caused the violation.

Assertions: An assertion is any condition that the database must always satisfy. Domain constraints and
referential-integrity constraints are special forms of assertions. However, there are many constraints that we
cannot express by using only these special forms. For example, “Every department must have at least five
courses offered every semester” must be expressed as an assertion. When an assertion is created, the system
tests it for validity. If the assertion is valid, then any future modification to the database is allowed only if it does
not cause that assertion to be violated.

Authorization: We may want to differentiate among the users as far as the type of access they are permitted on
various data values in the database. These differentiations are expressed in terms of authorization, the most
common being: read authorization, which allows reading, but not modification, of data; insert authorization,
which allows insertion of new data, but not modification of existing data; update authorization, which allows
modification, but not deletion, of data; and delete authorization, which allows deletion of data. We may assign
the user all, none, or a combination of these types of authorization.

8
Chapter 2: Entity Relationship Model
The entity-relationship (E-R) data model was developed to facilitate database design by allowing specification
of an enterprise schema that represents the overall logical structure of a database. The E-R model is very useful
in mapping the meanings and interactions of real-world enterprises onto a conceptual schema. The E-R data
model employs three basic concepts: entity sets, relationship sets, and attributes.

1. The Entity
Entity is basic building block of the E-R data model. The term entity is used in three different meanings or for
three different terms and that are:

a. Entity Type
The entity type can be defined as a name/label assigned to items/objects that exist in an environment and that
have similar properties. It could be person, place, event or even concept, that is, an entity type can be defined for
physical as well as not-physical things. An entity type is distinguishable from other entity types on the basis of
properties and the same thing provides the basis for the identification of an entity type. We can identify or
associate certain properties with each of the existing in that environment.

Generally, the entity types and their distinguishing properties are established by nature, by very existence of the
things. For example, a cricket bat is a sports item, a computer is an electronic device, a shirt is a clothing item
etc. However, many times the grouping of things in an environment is dictated by the specific interest of the
organization or system that may supersede the natural classification of entity types. For example, in an
organization, entity types may be identified as donated items, purchased items, manufactured items; then the
items of varying nature may belong to these entity types, like air conditioners, tables, frying pan, shoes, and car.

The process of identifying entity types, their properties and relationships between them is called abstraction. The
abstraction process is also supported by the requirements gathered during initial study phase. Entity types are
identified through abstraction process, then the items possessing the properties associated with a particular entity
type are said to be belonging to that entity type or instances of that entity type.

Weak Entity Types: (dependent ETs)


An entity type whose instances cannot exist without being linked with instances of some other entity type, i.e.,
they cannot exist independently. For example, in an organization we want to maintain data about the vehicles
owned by the employees. Now a particular vehicle can exist in this organization only if the owner already exists
there as employee. Similarly, if employee leaves the job and the organization decides to delete the record of the
employee then the record of the vehicle will also be deleted since it cannot exist without being linked to an
instance of employee.

Strong Entity Type: (Regular ETs)


An entity type whose instances can exist independently, that is, without being linked to the instances of any
other entity type is called strong entity type. A major property of the strong entity types is that they have their
own identification, which is not always the case with weak entity types. For example, employee in the previous
example is an independent or strong entity type, since its instances can exist independently.

Naming Entity Types: Following are some recommendations for naming entity types. But they are just
recommendations; practices considered good in general. If one, some or all of them are ignored in a design, the
design will still be valid if it satisfies the requirements otherwise, but good designs usually follow these
practices:

9
 Singular noun recommended, but still plurals can also be used
 Organization specific names, like customer, client, gahak anything will work
 Write in capitals, yes, this is something that is generally followed, otherwise will also work.
 Abbreviations can be used, be consistent. Avoid using confusing abbreviations.

Symbols for Entity Types: A rectangle is used to represent an entity type in E-R data model. For strong entity
types rectangle with a single line is used whereas double lined rectangle is drawn to represent a weak entity type
as is shown below:

b. Entity Instance
A particular object belonging to a particular entity type and how does an item becomes an instance of or belongs
to an entity type? By possessing the defining properties associated with an entity type. For example, following
table lists the entity types and their defining properties:

Each entity instance possesses certain values against the properties with the entity type to which it belongs. For
example, in the above table we have identified that entity type EMPLOYEE has name, father name, registration
number, qualification, designation. Now an instance of this entity type will have values against each of these
properties, like (M. Sajjad, Abdul Rehman, EN-14289, BCS, and Programmer) may be one instance of entity
type EMPLOYEE. There could be many others.

c. Entity Set
A group of entity instances of a particular entity type is called an entity set. For example, all employees of an
organization form an entity set. Like all students, all courses, all of them form entity set of different entity types

As has been mentioned before that the term entity is used for all of the three terms mentioned above, and it is
not wrong. Most of the time it is used to mention an entity type, next it is used for an entity instance and least
times for entity set. We will be precise most of the time, but if otherwise you can judge the particular meaning
from the context.

2. Attributes
Def 1: An attribute is any detail that serves to identify, qualify, classify, quantify, or otherwise express the state
of an entity occurrence or a relationship.

10
Def 2: Attributes are data objects that either identify or describe entities.

Identifying entity type and then assigning attributes or other way round; it’s an “egg or hen” first problem. It
works both ways; differently for different people. It is possible that we first identify an entity type, and then we
describe it in real terms, or through its attributes keeping in view the requirements of different users’ groups. Or,
it could be other way round; we enlist the attribute included in different users’ requirements and then group
different attributes to establish entity types. Attributes are specific pieces of information, which need to be
known or held. An attribute is either required or optional. When it's required, we must have a value for it, a
value must be known for each entity occurrence. When it's optional, we could have a value for it, a value may be
known for each entity occurrence. Attributes may be of different types.

a. Simple or Composite Attributes


An attribute that is a single whole is a simple attribute. The value of a simple attribute is considered as a whole,
not as comprising of other attributes or components. For example, attributes stName, stFatherName,
stDateOfBorth of an entity type STUDENT are example of simple attributes. On the other hand if an attribute
consists of collection of other simple or composite attributes then it is called a composite attributes. For
example, stAdres attribute may comprise of houseNo, streetNo, areaCode, city etc. In this case stAdres will be a
composite attribute.

b. Single valued or multi-valued Attributes


Some attribute have single value at a time, whereas some others may have multiple values. For example, hobby
attribute of STUDENT or skills attribute of EMPLOYEE, since a student may have multiple hobbies, likewise
an employee may have multiple skills so they are multi-valued attributes. On the other hand, name, father name,
designation are generally single valued attributes.

c. Stored or Derived Attributes


Normally attributes are stored attributes, that is, their values are stored and accessed as such from the database.
However, sometimes attributes’ values are not stored as such, rather they are computed or derived based on
some other value. This other value may be stored in the database or obtained some other way. For example, we
may store the name, father name, address of employees, but age can be computed from date of birth. The
advantage of declaring age as derived attribute is that whenever we will access the age, we will get the accurate,
current age of employee since it will be computed right at the time when it is being accessed.

How a particular attribute is stored or defined, it is decided first by the environment and then it has to be
designer’s decision; your decision. Because, the organization or system will not object rather they will not even
know the form in which you have defined an attribute. You have to make sure that the system works properly, it
fulfills the requirement; after that you do it as per your convenience and in an efficient way.

11
3. The Keys
Attributes act as differentiating agents among different entity types, that is, the differences between entity types
must be expressed in terms of attributes. An entity type can have many instances; each instance has got a certain
value against each attribute defined as part of that particular entity type. A key is a set of attributes that can be
used to identify or access a particular entity instance of an entity type (or out of an entity set). An entity type
may have many instances, from a few to several thousands and even more. Now out of many instances, when
and if we want to pick a particular/single instance, and many times we do need it, then key is the solution.

For example, think of whole population of Pakistan, the data of all Pakistanis lying at one place, say with
NADRA people. Now if at some time we need to identify a particular person out of all this data, how can we do
that? Can we use name for that, well think of any name, like Mirza Zahir Iman Afroz, now we may find many
people with this name in Pakistan. Another option is the combination of name and father name, then again,
Amjad Malik s/o Mirza Zahir Iman Afroz, there could be so many such pairs. There could be many such
examples. However, if you think about National ID Card number, then no matter whatever is the population of
Pakistan, you will always be able to pick precisely a single person. That is the key. While defining an entity type
we also generally define the key of that entity type. A key can be simple, that is, consisting of single attribute, or
it could be composite which consists of two or more attributes. Following are the major types of key:

a. Super key
A super key is a set of one or more attributes which taken collectively, allow us to identify uniquely an entity
instance in the entity set. This definition is same as of a key; it means that the super key is the most general type
of key.

For example, consider the entity type STUDENT with attributes registration number, name, father name,
address, phone, class, admission date. Now which attribute can we use that can uniquely identify any instance of
STUDENT entity type. Of course, none of the name, father name, address, phone number, class, admission date
can be used for this purpose. Why? Because if we consider name as super key, and situation arises that we need
to contact the parents of a particular student. Now if we say to our registration department that give us the phone
number of the student whose name is Ilyas Hussain, the registration department conducts a search and comes up
with 10 different Ilyas Hussain, could be anyone. So the value of the name attribute cannot be used to pick a
particular instance. Same happens with other attributes.

However, if we use the registration number, then it is 100% sure that with a particular value of registration
number we will always find exactly a single unique entity instance. Once you identified the instance, you have
all its attributes available, name, father name, everything. The entity type STUDENT and its attributes are
shown graphically in the figure below, with its super key “regNo” underlined.

12
Once specific characteristic with super key is that, as per its definition any combination of attributes with the
super key is also a super key. Like, in the example just discussed where we have identified regNo as super key,
now if we consider any combination of regNo with any other attribute of STUDENT entity type, the
combination will also be a super key. For example, “regNo, name”, “regNo, fName, address”, “name, fName,
regNo” and many others, all are super keys.

b. Candidate key
A super key for which no subset is a super key is called a candidate key, or the minimal super key is the
candidate key. It means that there are two conditions for the candidate key, one, it identifies the entity instances
uniquely, as is required in case of super key, second, it should be minimum, that is, no proper subset of
candidate key is a key. So if we have a simple super key, that is, that consists of single attribute, it is definitely a
candidate key, 100%. However, if we have a composite super key and if we take any attribute out of it and
remaining part is not a super key anymore then that composite super key is also a candidate key since it is
minimal super key.

For example, one of the super keys that we identified from the entity type STUDENT of figure 1 is “regNo,
name”, this super key is not a candidate key, since if we remove the regNo attribute from this combination,
name attribute alone is not able to identify the entity instances uniquely, so it does not satisfy the first condition
of candidate key. On the other hand if we remove the attribute name from this composite key then the regNo
alone is sufficient to identify the instances uniquely, so “regNo, name” does have a proper subset (regNo) that
can act as a super key; violation of second condition. So the composite key “regNo, name” is a super key but it
is not a candidate key. From here we can also establish a fact that every candidate key is a super key but not the
other way round.

c. Primary Key
A candidate key chosen by the database designer to act as key is the primary key. An entity type may have more
than one candidate keys; in that case the database designer has to designate one of them as primary key, since
there is always only a single primary key in an entity type. If there is just one candidate key then obviously the
same will be declared as primary key. The primary key can also be defined as the successful candidate key.
Figure 2 below contains the entity type STUDENT of figure but with an additional attribute nIdNumber.

In figure 2, we can identify two different attributes that can individually identify the entity instances of
STUDENT and they are regNo and nIdNumber, both are minimal super keys so both are candidate keys. Now in
this situation we have got two candidate keys. The one that we choose will be declared as primary key, other
will be the alternate key. Any of the candidate keys can be selected as primary key, it mainly depends on the
database designer which choice he/she makes. The relation that holds between super and candidate keys also
holds between candidate and primary keys, that is, every primary key (PK) is a candidate key and every
candidate key is a super key.

13
A certain value that may be associated with any attribute is NULL, that means “not given” or “not defined”. A
major characteristic of the PK is that it cannot have the NULL value. If PK is a composite, then none of the
attributes included in the PK can have the NULL, for example, if we are using “name, fName” as PK of entity
type STUDENT, then none of the instances may have NULL value in either of the name or fName or both.

Entity Integrity Constraint: It states that in a relation no attribute of a primary key (PK) can have null value. If
a PK consists of single attribute, this constraint obviously applies on this attribute, so it cannot have the Null
value. However, if a PK consists of multiple attributes, then none of the attributes of this PK can have the Null
value in any of the instances.

d. Alternate Keys
Candidate keys which are not chosen as the primary key are known as alternate keys. For example, we have two
candidate keys of STUDENT in figure 2, regNo and nIdNumber, if we select regNo as PK then the nIdNumber
will be alternate key.

e. Secondary Key
Many times we need to access certain instances of an entity type using the value(s) of one or more attributes
other than the PK. The difference in accessing instances using the value of a key or non-key attribute is that the
search on the value of PK will always return a single instance (if it exists), whereas uniqueness is not guaranteed
in case of non-key attribute. Such attributes on which we need to access the instances of an entity type that may
not necessarily return unique instance is called the secondary key.

For example, we want to see how many of our students belong to Multan; in that case we will access those
instances of the STUDENT entity type that contain “Multan” in their address. In this case address will be called
secondary key, since we are accessing instances on the basis of its value, and there is no compulsion that we will
get a single instance. Keep one thing in mind here, that a particular access on the value of a secondary key MAY
return a single instance, but that will be considered as chance or due to that particular state of entity set. There is
not the compulsion or it is not necessary for secondary key to return unique instance, where as in case of super,
candidate, primary and alternate keys it is compulsion that they will always return unique instance against a
particular value.

f. Foreign Key
A foreign key (FK) is an attribute whose values match the primary key values in the related table. For Example,
consider the following two tables EMP and DEPT:

EMP (empId, empName, qual, depId)

DEPT (depId, depName, numEmp)

In this example there are two relations; EMP is having record of employees, whereas DEPT is having record of
different departments of an organization. Now in EMP the primary key is empId, whereas in DEPT the primary
key is depId. The depId which is primary key of DEPT is also present in EMP so this is a foreign key.

The primary key is represented by underlining with a solid line, whereas foreign key is underlined by dashed or
dotted line.

Referential Integrity Constraint: This constraint is applied to foreign keys. Foreign key is an attribute or
attribute combination of a relation that is the primary key of another relation. This constraint states that if a
foreign key exists in a relation, either the foreign key value must match the primary key value of some tuple in
its home relation or the foreign key value must be completely null.
14
Other constraints are: There can be more than zero, one or multiple foreign keys in a table, depending on how
many tables a particular table is related with. For example in the above example the EMP table is related with
the DEPT table, so there is one foreign key depId, whereas DEPT table does not contain any foreign key.
Similarly, the EMP table may also be linked with DESIG table storing designations, in that case EMP will have
another foreign key and alike.

The foreign key attribute and the one present in another relation as primary key can have different names, but
both must have same domains. In DEPT, EMP example, both the PK and FK have the same name; they could
have been different, it would not have made any difference however they must have the same domain.

4. Relationships
After two or more entities are identified and defined with attributes, the participants determine if a relationship
exists between the entities. A relationship is any association, linkage, or connection between the entities of
interest to the business; it is a two directional, significant association between two entities, or between an entity
and itself. Each relationship has a name, an optionality (optional or mandatory), and a degree (how many). A
relationship is described in real terms. Assigning a name, optionality, and a degree to a relationship helps
confirm the validity of that relationship. Relationship represents an association between two or more entities. An
example of a relationship would be:

 Employees are assigned to projects


 Projects have subtasks
 Departments manage one or more projects

Relationships are the connections and interactions between the entities instances e.g. DEPT_EMP associates
Department and Employee. A relationship type is an abstraction of a relationship i.e. a set of relationships
instances sharing common attributes. Entities enrolled in a relationship are called its participants. The
participation of an entity in a relationship is total when all entities of that set might be participant in the
relationship otherwise it is partial e.g. if every Part is supplied by a Supplier then the SUPP_PART relationship
is total. If certain parts are available without a supplier than it is partial.

Naming Relationships: If there is no proper name of the association in the system then participants’ names of
abbreviations are used. STUDENT and CLASS have ENROLL relationship. However, it can also be named as
STD_CLS.

Roles: Entity set of a relationship need not be distinct. For example: The labels “manager” and “worker” are
called “roles”. They specify how employee entities interact via the “works-for” relationship set. Roles are
indicated in ER diagrams by labeling the lines that connect diamonds to rectangles. Roles are optional. They
clarify semantics of a relationship.

Symbol for Relationships:

15
 Shown as a Diamond: Diamond is doubled if one of the participants is dependent on the other •
Participants are connected by continuous lines, labeled to indicate cardinality.
 In partial relationships roles (if identifiable) are written on the line connecting the partially participating
entity rectangle to the relationship diamond.
 Total participation is indicated by double lines.

a. Role of the entity in Relationship


The way an entity is involved in a relationship is called the role of the entity in the relationship. These details
provide more semantics of the database. The role is generally clear from the relationship, but in some cases it is
necessary to mention the role explicitly. Two situations to mention the role explicitly

Recursive Relationship:
A recursive relationship is one in which a relationship can exist between occurrences of the same entity set. This
is the situation when any attribute of one entity is associated with another attribute of the same entity. Such a
link initiates from one entity and terminates on the same entity.

Figure above shows the recursive relationship which tells that in the faculty of a certain institute we can have
one faculty member from among the same faculty as the head of the faculty. Now the role mentioned on the
relationship tell that many Faculty instance are headed by one of the entity instance from the same faculty
relation.

Multiple Relationships:
This is the second situation which needs the role to be mentioned on the relationship link when there is more
than one relationship.

16
As an example we can have a relationship of Faculty members and students as one faculty member may teach a
number of students and at the same time one student may have been taught by a number of faculty members.
This is one side of the picture. Now on the other side we can say that a faculty member may be supervising a
number of students for their final projects. It shows two types of associations between the faculty and the
students. So in this type of situation it is necessary to mention the role of the entities involved in the
relationship.

b. Cardinality:
The term connectivity is used to describe the relationship classification. Cardinality expresses the minimum and
maximum number of entity occurrences associated with one occurrence of the related entity. In the ERD,
cardinality is indicated by placing the appropriate numbers beside the entities, using the format (x,y). As the
name suggests that the minimum cardinality is the inverse of the maximum cardinality so we can say that the
minimum cardinality show us that how many instance of one entity can be placed in another relation at least. In
simple words it can be said that the minimum cardinality tells that whether the link between two relations is
optional or compulsory. It is very important to determine the minimum cardinality when designing a database
because it defines the way a database system will be implemented.

c. Dependencies
Dependency is a type of constraint, for example once we define the cardinality or relationship among two
entities it also is a constraint or check that tells that cardinality should be followed while populating data in
relations. Similarly the dependency is a constraint. There are a number of dependency types which are expressed
below:

Existence dependency:
This is the type of dependency which exists when one entity instance needs instance of another entity for its
existence. As we have seen earlier in case of employee of and organization and the projects associated with the
employees there we see that employees are dependent on projects, it means that if no project is assigned to an
employee it cannot exist. In other words we can say that at a certain time an employee must be working on at
least one project.

Identifier Dependency:
It means that the dependent entity in case of existence dependency does not have its own identifier and any
external identifier is used to pick data for that entity. And to define a key in this entity the key of the parent
entity is to be used in the key for this entity may be used as composite keys.

Referential Dependency:
This is the situation when the dependent entity has it own key for unique identification but the key used to show
the reference with the parent entity is shown with the help of an attribute of the parent entity. Means to show the
link of the parent entity with this entity there will be an attribute and a record in this entity will not exist without
having a record in the parent entity. Despite of having its own identifier attribute. This type of identifier or
attribute in the weak entity is known as foreign key.

17
d. Relationship Degree
A relationship degree indicates the number of entities or participants associated with a relationship. A unary
relationship exists when an association is maintained within a single entity. A binary relationship exists when
two entities are associated. A ternary relationship exists when three entities are associated. Although higher
degrees exist, they are rare and are not specifically named.

Unary Relationships
In the case of the unary relationship shown in Figure, an employee within the EMPLOYEE entity is the manager
for one or more employees within that entity. In this case, the existence of the “manages” relationship means
that EMPLOYEE requires another EMPLOYEE to be the manager—that is, EMPLOYEE has a relationship
with itself. Such a relationship is known as a recursive relationship.

Binary Relationships
A binary relationship exists when two entities are associated in a relationship. Binary relationships are most
common. In fact, to simplify the conceptual design, whenever possible, most higher-order (ternary and higher)
relationships are decomposed into appropriate equivalent binary relationships. In Figure 4.15, the relationship “a
PROFESSOR teaches one or more CLASSes” represents a binary relationship.

Ternary and Higher-Degree Relationships


Although most relationships are binary, the use of ternary and higher-order relationships does allow the designer
some latitude regarding the semantics of a problem. A ternary relationship implies an association among three
different entities. For example, note the relationships (and their consequences) in Figure, which are represented
by the following business rules:

 A DOCTOR writes one or more PRESCRIPTIONs.


 A PATIENT may receive one or more PRESCRIPTIONs.
 A DRUG may appear in one or more PRESCRIPTIONs. (To simplify this example, assume that the
business rule states that each prescription contains only one drug. In short, if a doctor prescribes more
than one drug, a separate prescription must be written for each drug.)

18
Chapter 3: Relational Data Model
The RDM is popular due to its two major strengths and they are: Simplicity and Strong Mathematical
Foundation. The RDM is simple, why, there is just one structure and that is a relation or a table. Even this single
structure is very easy to understand, so a user of even of a moderate genius can understand it easily. Secondly, it
has a strong mathematical foundation that gives many advantages, like:

 Anything included/defined in RDM has got a precise meaning since it is based on mathematics, so there
is no confusion.
 If we want to test something regarding RDM we can test it mathematically, if it works mathematically it
will work with RDM (apart from some exceptions).
 The mathematics not only provided the RDM the structure (relation) but also well-defined manipulation
languages (relational algebra and relational calculus).
 It provided RDM certain boundaries, so any modification or addition we want to make in RDM, we
have to see if it complies with the relational mathematics or not. We cannot afford to cross these
boundaries since we will be losing the huge advantages provided by the mathematical backup.

“An IBM scientist E.F. Codd proposed the relational data model in 1970. At that time most database systems
were based on one of two older data models (the hierarchical model and the network model); the relational
model revolutionized the database field and largely replaced these earlier models. Prototype relational database
management systems were developed in pioneering research projects at IBM and UC-Berkeley by the mid-70s,
and several vendors were offering relational database products shortly thereafter.”

The RDM is mainly used for designing/defining external and conceptual schemas; however to some extent
physical schema is also specified in it. Separation of conceptual and physical levels makes data and schema
manipulation much easier, contrary to previous data models. So the relational data model also truly supports
“Three Level Schema Architecture”.

Introduction to the Relational Data model: The RDM is based on a single structure and that is a relation.
Speaking in terms of the E-R data model, both the entity types and relationships are represented using relations
in RDM. The relation in RDM is similar to the mathematical relation however database relation is also
represented in a two dimensional structure called table. A table consists of rows and columns. Rows of a table
are also called tuples. A row or tuple of a table represents a record or an entity instance, whereas the columns of
the table represent the properties or attributes.

1. Properties of database relations


There are six basic properties of the database relations which are:

 Each cell of a table contains atomic/single value

A cell is the intersection of a row and a column, so it represents a value of an attribute in a particular row. The
property means that the value stored in a single cell is considered as a single value. In real life we see many
situations when a property/attribute of any entity contains multiple values, like, degrees that a person has, the
cars owned by a person, the jobs of an employee. All these attributes have multiple values; these values cannot
be placed as the value of a single attribute or in a cell of the table. It does not mean that the RDM cannot handle
such situations, however, there are some special means that we have to adopt in these situations, and they cannot
be placed as the value of an attribute because an attribute can contain only a single value.

 Each column has a distinct name; the name of the attribute it represents

19
Each column has a heading that is basically the name of the attribute that the column represents. It has to be
unique, that is, a table cannot have duplicated column/attribute names.

 The values of the attributes come from the same domain

Each attribute is assigned a domain along with the name when it is defined. The domain represents the set of
possible values that an attribute can have. Once the domain has been assigned to an attribute, then all the rows
that are added into the table will have the values from the same domain for that particular column. For example,
the attribute doB (date of birth) is assigned the domain “Date”, now all the rows have the date value.

 The order of the columns is immaterial: If the order of the columns in a table is changed, the table
still remains the same. Order of the columns does not matter.
 The order of the rows is immaterial: As with the columns, if rows’ order is changed the table remains
the same.
 Each row/tuple/record is distinct, no two rows can be same: Two rows of a table cannot be same.
The value of even a single attribute has to be different that makes the entire row distinct.

There are three components of the RDM, which are, construct (relation), manipulation language (SQL) and
integrity constraints (two).

2. Significance of Constraints:
By definition a PK is a minimal identifier that is used to identify tuples uniquely. If we were to allow a null
value for any part of the primary key, we would be demonstrating that not all of the attributes are needed to
distinguish between tuples, which would contradict the definition. Referential integrity constraint plays a vital
role in maintaining the correctness, validity or integrity of the database. This means that when we have to ensure
the proper enforcement of the referential integrity constraint to ensure the consistency and correctness of
database. How? In the DEPT, EMP example above deptId in EMP is foreign key; this is being used as a link
between the two tables. Now in every instance of EMP table the attribute deptId will have a value, this value
will be used to get the name and other details of the department in which a particular employee works. If the
value of deptId in EMP is Null in a row or tuple, it means this particular row is not related with any instance of
the DEPT. From real-world scenario it means that this particular employee has not been assigned any
department or his/her department has not been specified. These were two possible conditions that are being
reflected by a legal value or Null value of the foreign key attribute. Now consider the situation when referential
integrity constrains is being violated, that is, EMP.deptId contains a value that does not match with any of the
value of DEPT.deptId. In this situation, if we want to know the department of an employee, then ooops, there is
no department with this Id, that means, an employee has been assigned a department that does not exist in the
organization or an illegal department.

a. Null Constraints
A Null value of an attribute means that the value of attribute is not yet given, not defined yet. It can be assigned
or defined later however. Through Null constraint we can monitor whether an attribute can have Null value or
not. This is important and we have to make careful use of this constraint. This constraint is included in the
definition of a table (or an attribute more precisely). By default a non-key attribute can have Null value,
however, if we declare an attribute as Not Null, then this attribute must be assigned value while entering a
record/tuple into the table containing that attribute. It is generally an organizational constraint. For example, in a
bank, a potential customer has to fill in a form that may comprise of many entries, but some of them would be
necessary to fill in, like, the residential address, or the national Id card number. There may be some entries that
may be optional, like fax number. When defining a database system for such a bank, if we create a CLIENT
20
table then we will declare the must attributes as Not Null, so that a record cannot be successfully entered into the
table until at least those attributes are not specified.

b. Default Value
This constraint means that if we do not give any value to any particular attribute, it will be given a certain
(default) value. This constraint is generally used for the efficiency purpose in the data entry process. Sometimes
an attribute has a certain value that is assigned to it in most of the cases. For example, while entering data for the
students, one attribute holds the current semester of the student. The value of this attribute is changed as a
student’s passes through different exams or semesters during its degree. However, when a student is registered
for the first time, it is generally registered in the first semesters. So in the new records the value of current
semester attribute is generally 1. Rather than expecting the person entering the data to enter 1 in every record,
we can place a default value of 1 for this attribute. So the person can simply skip the attribute and the attribute
will automatically assume its default value.

c. Domain Constraint
This is an essential constraint that is applied on every attribute, that is, every attribute has got a domain. Domain
means the possible set of values that an attribute can have. For example, some attributes may have numeric
values, like salary, age, marks etc. Some attributes may possess text or character values, like, name and address.
Yet some others may have the date type value, like date of birth, joining date. Domain specification limits an
attribute the nature of values that it can have. Domain is specified by associating a data type to an attribute while
defining it. Exact data type name or specification depends on the particular tool that is being used. Domain helps
to maintain the integrity of the data by allowing only legal type of values to an attribute. For example, if the age
attribute has been assigned a numeric data type then it will not be possible to assign a text or date value to it. As
a database designer, this is your job to assign an appropriate data type to an attribute. Another perspective that
needs to be considered is that the value assigned to attributes should be stored efficiently. That is, domain
should not allocate unnecessary large space for the attribute.

For example, age has to be numeric, but then there are different types of numeric data types supported by
different tools that permit different range of values and hence require different storage space. Some of more
frequently supported numeric data types include Byte, Integer, and Long Integer and these types support
different range of numeric values and takes 1, 4 or 8 bytes to store. Now, if we declare the age attribute as Long
Integer, it will definitely serve the purpose, but allocating unnecessarily large space. A Byte type would have
been sufficient for this purpose since you won’t find students or employees of age more than upper limit 255.
Rather we can further restrict the domain of an attribute by applying a check constraint on the attribute. For
example, the age attribute although assigned type Byte, still if a person by mistake enters the age of a student as
200, if this is year then it is not a legal age from today’s age, yet it is legal from the domain constraint
perspective. So we can limit the range supported by a domain by applying the check constraint by limiting it up
to say 30 or 40, whatever is the rule of the organization. So domain should be a bit larger than that is required
today. In short, domain is also a very useful constraint.

21
Chapter 4: ER to Relational mapping
Logical data base design is obtained from conceptual database design. We have seen that initially we studied the
whole system through different means. Then we identified different entities, their attributes and relationship in
between them. Then with the help of E-R data model we achieved an E-R diagram through different tools
available in this model. This is our conceptual database design. Then as we had to use relational data model so
then we came to implementation phase for designing logical database through relational data model.

1. Mapping Entity Types


Following are the rules for mapping entity types: Each regular entity type (ET) is transformed straightaway into
a relation. Whatever entities had identified they would simply be converted into a relation and will have the
same name of relation as kept earlier. PK of the entity is declared as PK of relation and underlined. Simple
attributes of ET are included into the relation.

a. Composite Attributes
These are those attributes which are a combination of two or more than two attributes. For address can be a
composite attribute as it can have house no, street no, city code and country , similarly name can be a
combination of first and last names. Now in relational data model composite attributes are treated differently.
Since tables can contain only atomic values composite attributes need to be represented as a separate relation

b. Multi-valued Attributes
These are those attributes which can have more than one value against an attribute. For Example a student can
have more than one hobby like riding, reading listening to music etc. So these attributes are treated differently in
relational data model. Following are the rules for multi-valued attributes:-

 An Entity type with a multi-valued attribute is transformed into two relations


 One contains the entity type and other simple attributes whereas the second one has the multi-valued
attribute. In this way only single atomic value is stored against every attribute
 The Primary key of the second relation is the primary key of first relation and the attribute value itself.
So in the second relation the primary key is the combination of two attributes.
 All values are accessed through reference of the primary key that also serves as foreign key.

22
2. Mapping Relationships
Before establishing any relationship in between relations, it is must to study the cardinality and degree of the
relationship. Relation is a structure, which is obtained by converting an entity type in E-R model into a relation,
whereas a relationship is in between two relations of relational data model. Relationships in relational data
model are mapped according to their degree and cardinalities.

a. Binary Relationships
The relationships established between two entity types. Following are the types of cardinalities:

One to Many:
In this type of cardinality one instance of a relation or entity type is mapped with many instances of second
entity type, and inversely one instance of second entity type is mapped with one instance of first entity type. The
participating entity types will be transformed into relations as has been already discussed. The relationship in
this particular case will be implemented by placing the PK of the entity type (or corresponding relation) against
one side of relationship will be included in the entity type on the many side of the relationship as FK. By
declaring the PK-FK link between the two relations the referential integrity constraint is implemented
automatically, which means that value of FK is either null or match with its value in the home relation. For
Example, consider the binary relationship given in the figure 1 involving two entity types PROJET and
EMPLOYEE. Now there is a one to many relationships between these two. On any one project many employees
can work and one employee can work on only one project.

PROJECT (prId, prDura, prCost) EMPLOYEE (empId, empName, empSal, prId)

Minimum Cardinality: This is a very important point, as minimum cardinality on one side needs special
attention. Like in previous example an employee cannot exist if project is not assigned. So in that case the
minimum cardinality has to be one. On the other hand if an instance of EMPLOYEE can exist without being
linked with an instance of the PROJECT then the minimum cardinality has to be zero. If the minimum
cardinality is zero, then the FK is defined as normal and it can have the Null value, on the other hand if it is one
then we have to declare the FK attribute(s) as Not Null. The Not Null constraint makes it a must to enter the
value in the attribute(s) whereas the FK constraint will enforce the value to be a legal one. So you have to see
the minimum cardinality while implementing a one to many relationship.

Many to Many Relationship:


In this type of relationship one instance of first entity can be mapped with many instances of second entity.
Similarly one instance of second entity can be mapped with many instances of first entity type. In many to many
relationship a third table is created for the relationship, which is also called as associative entity type. Generally,
the primary keys of the participating entity types are used as PK of the third table. For Example, there are two
entity types BOOK and STD (student). Now many students can borrow a book and similarly many books can be
issued to a student, so in this manner there is a many to many relationship. Now there would be a third relation
as well which will have its primary key after combining primary keys of BOOK and STD. We have named that
as transaction TRANS. Following are the attributes of these relations: -

STD (stId, sName, sFname) BOOK (bkId, bkTitle, bkAuth) TRANS (stId,bkId, isDate,rtDate)

23
One to One Relationship:
This is a special form of one to many relationship, in which one instance of first entity type is mapped with one
instance of second entity type and also the other way round. In this relationship PK of one entity type has to be
included on other as foreign key. Normally primary key of compulsory side is included in the optional side. For
example, there are two entities STD and STAPPLE (student application for scholarship). Now the relationship
from STD to STAPPLE is optional whereas STAPPLE to STD is compulsory. That means every instance of
STAPPLE must be related with one instance of STD, whereas it is not a must for an instance of STD to be
related to an instance of STAPPLE, however, if it is related then it will be related to one instance of STAPPLE,
that is, one student can give just one scholarship application. This relationship is shown in the figure:

STD (stId, stName) STAPPLE (scId, scAmount, stId)

b. Unary Relationship
These are the relationships, which involve a single entity. These are also called recursive relationships. Unary
relationships may have one to one, one to many and many to many cardinalities. In unary one to one and one to
may relationships, the PK of same entity type is used as foreign key in the same relation and obviously with the
different name since same attribute name cannot be used in the same table. The example is given in figure 3:

In many to many relationships another relation is created with composite key. For example there is an entity
type PART may have many to many recursive relationships, meaning one part consists of many parts and one
part may be used in many parts. So in this case this is a many to many relationship. The treatment of such a
relationship is shown in the figure 4 below:

c. Super / Subtype Relationship:


Separate relations are created for each super type and subtypes. It means if there is one super type and there are
three subtypes, so then four relations are to be created. After creating these relations then attributes are assigned.
Common attributes are assigned to super type and specialized attributes are assigned to concerned subtypes. PK
of super type is included in all relations that work for both link and identity. Now to link the super type with
concerned subtype there is a requirement of descriptive attribute, which is called as discriminator. It is used to
identify which subtype is to be linked. For Example there is an entity type EMP which is a super type, now there
are three subtypes, which are salaried, hourly and consultants. So now there is a requirement of a determinant,
which can identify that which subtypes to be consulted, so with empId a special character can be added which
can be used to identify the concerned subtype.
24
Chapter 5: Relational Algebra
The relational algebra is a procedural query language. It consists of a set of operations that take one or two
relations as input and produce a new relation as their result. There are five basic operations of relational algebra.
They are broadly divided into two categories:

1. Unary Operations
These are those operations, which involve only one relation or table. These are Select and Project

a. The Select Operation:


The select operation is performed to select certain rows or tuples of a table, so it performs its action on the table
horizontally. This command works on a single table and takes rows that meet a specified condition. Lower
Greek letter sigma (σ) is used to denote the selection.

All the attributes of the resulting table are same, which means that degree of the new and old tables, are same.
Only selected rows / tuples are picked up by the given condition. The degree of the resulting relation will be the
same as of the relation itself. | σ | = | r(R) |

The select operation is commutative, which is as under: - σ c1 (σc2(R)) = σc2 (σc1(R))

For example: Student table

The resulting relation of the command would contain record of those students whose semester is greater than
three as under:

In selection operation the comparison operators like: =, <=, >=, <> can be used in the predicate. Similarly, we
can also combine several simple predicates into a larger predicate using the connectives and (∧) and or (∨).
Some other examples on the STUDENT table are: σstId = ‘S1015’ (STUDENT) and σ prName <> ‘MCS’ (STUDENT).

b. The Project Operator


The PROJECT operation is used to select certain set of attributes from a relation, that is, it produces a vertical
subset of the table, extracting the values of specified columns, eliminating duplicates, and placing the values in a
new table. It is unary operation that returns its argument relation, with certain attributes left out. Projection is
denoted by a Greek letter (∏).For example consider a relation FACULTY.

25
Some other examples are: ∏ Fname, Rank (Faculty) and ∏ Facid, Salary,Rank (Faculty)

Composition of Relational Operators:

The relational operators like select and project can also be used in nested forms iteratively. As the result of an
operation is a relation so this result can be used as an input for other operation. Example, if want the names of
faculty members along with departments, who are assistant professors then perform both the select and project
operations on the FACULTY table. First selection operator is applied for selecting the associate professors, the
operation outputs a relation that is given as input to the projection operation for the required attributes.

∏ facName, dept (σ rank=’Asst Prof’ (FACULTY))

If change the sequence of operations and bring the projection first then the relation provided to select operation
as input will not have the attribute of rank and so then selection operator can’t be applied, so there would be an
error. So although the sequence can be changed, but the required attributes should be there either for selection or
projection.

1. Binary Operations:
The binary operations, which are also called as set operations. These are those operations, which involve pairs of
relations and are, therefore called as binary operations. The input for these operations is two relations and they
produce a new relation without changing the original relations. These operations are:

a. The Union Operation:


It is denoted by U. The first requirement for union operator is that the both the relations should be union
compatible. The relations must meet the following two conditions:

 Both the relations should be of same degree, which means that the number of attributes in both relations
should be exactly same
 The domains of corresponding attributes in both the relations should be same.

If R and S are two relations, which are union compatible, if take union of these two then the resulting relation
would be the set of tuples either in R or S or both with no duplicate tuples. The union operator is commutative.

For Example: COURSE1 and COURSE2 storing the courses being offered at different campuses of an institute?
Now To know what courses are being offered at both the campuses, will take the union of two tables:

b. The Intersection Operation


The intersection operation also has the requirement that both the relations should be union compatible, which
means they are of same degree and same domains. It is represented by ∩. If R and S are two relations and we
take intersection of these two relations then the resulting relation would be the set of tuples, which are common
in both R and S. Just like union intersection is also commutative. R ∩ S = S ∩ R
26
For Example, COURSE1 ∩ COURSE2

The union and intersection operators are used less as compared to selection and projection operators.

c. Difference Operator:
If R and S are two relations which are union compatible then difference of these two relations will be set of
tuples that appear in R but do not appear in S. It is denoted by (-) for example:

d. Cartesian product:
It is also known as CROSS JOIN or CROSS PRODUCT. The Cartesian product needs not to be union
compatible. It is denoted by X. The Cartesian product will be: R X S with (m+n) columns and (C*D) rows

The resulting relation will be containing all the attributes of R and all of S. Moreover, all the rows of R will be
merged with all the rows of S. The Cartesian product is also commutative and associative.

The CARTESIAN product creates tuples with the combined attributes of two relations the operation applied by
itself is generally meaning less. It is useful when it is followed by a SELECT operation that selects only related
tuples from two relations according to the selection condition. , Temp1 is a Cartesian product

2. Join Operation:
The CARTESIAN product followed by SELECT is commonly used to select related tuples from two relations.
Hence a special operation called JOIN was created to specify this sequence as a single operation.

27
a. Theta Join
In theta join we apply the condition on input relation(s) and then only those selected rows are used in the cross
product to be merged and included in the output. It means that in normal cross product all the rows of one
relation are mapped/merged with all the rows of second relation, but here only selected rows of a relation are
made cross product with second relation. It is denoted as: - R X Ө S. Theta is one of the comparison operators.

Example: there are two relations of FACULTY and COURSE, now first apply select operation on the
FACULTY relation for selection certain specific rows then these rows will have a cross product with COURSE.

b. EquiJoin
This is the most used type of join. In equijoin, relations are joined on the basis of common attributes between
them on the basis of primary key, which is a foreign key in another relation. Common attributes in both tables
appear twice in the output but only those rows, which are selected. If primary and foreign keys of two relations
are having the same names by taking equijoin of both then in the output relation the relation name will precede
the attribute name. R ⋈θ S

c. Natural Join:
This is the most common and general form of join. It is same as equi–join but the difference is that in natural
join, the common attribute appears only once. Now, it does not matter which common attribute should be part of
the output relation as the values in both are same. For Example: R (⋈) S

Semi Join: In semi join, first we take the natural join of two relations then we project the attributes of first table
only. So after join and matching the common attribute of both relations only attributes of first relation are
projected. For Example:-

28
d. Outer Join
In outer join all the tuples of left and right relations are part of the output. It means that all those tuples of left
relation which are not matched with right relation are left as Null and vice versa. R   S

Left Outer Join: In left outer join all the tuples of left relation remain part of the output. The tuples that have a
matching tuple in the second relation do have the corresponding tuple from the second relation. However, for
the tuples of the left relation, which do not have a matching record in the right tuple have Null values against the
attributes of the right relation. Left outer join is the equi-join plus the non-matching rows of the left side relation
having Null against the attributes of right side relation.

Right Outer Join: In right outer join all the tuples of right relation remain part of the output relation, whereas
on the left side the tuples, which do not match with the right relation, are left as null. It means that right outer
join will always have all the tuples of right relation and those tuples of left relation which are not matched are
left as Null.

29
Chapter 6: Relational Calculus
Relational Calculus is a nonprocedural formal relational data manipulation language in which the user simply
specifies what data should be retrieved, but not how to retrieve it. It is an alternative standard for relational data
manipulation languages. The relational calculus is not related to the familiar differential and integral calculus in
mathematics, but takes its name from a branch of symbolic logic called the predicate calculus. It has two
following forms: -

1. Tuple Oriented Relational Calculus:


In tuple oriented relational calculus we are interested primarily in finding relation tuples for which a predicate is
true. To do so we need tuple variables. A tuple variable is a variable that takes on only the tuples of some
relation or relations as its range of values. It actually corresponds to a mathematical domain. We specify the
range of a tuple variable by a statement such as: - RANGE OF S IS STUDENT

Here, S is the tuple variable and STUDENT is the range, so that S always represents a tuple of STUDENT. It is
expressed as {S | P (S)}. Read it as find the set of all tuples S such that P(S) is true, where P implies the
predicate condition now suppose range of R is STUDENT {R | R.Credits > 50} (find the stuId, stuName,
majors etc of all students having more than 50 credits.)

2. Domain Oriented Relational Calculus:


a. Functional Dependency
A functional dependency is a type of relationship between attributes. If A and B are attributes or sets of
attributes of relation R, we say that B is functionally dependent on A if each value of A in R has associated with
it exactly one value of B in R.

We write this as A B, read as “A functionally determines B” or “A determines B”. This does not mean that A
causes B or that the value of B can be calculated from the value of A by a formula, although sometimes that is
the case. It simply means that if we know the value of A and we examine the table of relation R, we will find
only one value of B in all the rows that have the given value of A at any one time. Thus then the two rows have
the same A value, they must also have the same B value. However, for a given B value, there may be several
different A values. When a functional dependency exits, the attributes or set of attributes on the left side of the
arrow is called a determinant. Attribute of set of attributes on left side are called determinant and on right are
called dependants. If there is a relation R with attributes (a,b,c,d,e) A  b,c,d and d  e

For Example: The functional dependency of different attributes of students: -

STD (stId,stName,stAdr,prName,credits)

stId  stName,stAdr,prName,credits prName  credits

Now in this example if we know the stID we can tell the complete information about that student. Similarly if
we know the prName, we can tell the credit hours for any particular subject.

Functional Dependencies and Keys: We can determine the keys of a relation after seeing its functional
dependencies. The determinant of functional dependency that determines all attributes of that table is the super
key. Super key is an attribute or a set of attributes that identifies an entity uniquely. In a table, a super key is any

30
column or set of columns whose values can be used to distinguish one row from another. A minimal super key is
the candidate key , so if a determinant of functional dependency determines all attributes of that relation then it
is definitely a super key and if there is no other functional dependency whereas a subset of this determinant is a
super key then it is a candidate key. So the functional dependencies help to identify keys. We have an example
as under: -

EMP (eId,eName,eAdr,eDept,prId,prSal)

eId  (eName,eAdr,eDept) eId,prId  prSal

Now in this example in the employee relation eId is the key from which we can uniquely determine the
employee name address and department. Similarly if we know the employee ID and project ID we can find the
project salary as well. So FDs help in finding out the keys and their relation as well.

Inference Rules: Rules of Inference for functional dependencies, called inference axioms or Armstrong axioms,
after their developer, can be used to find all the FDs logically implied by a set of FDs. These rules are sound ,
meaning that they are an immediate consequence of the definition of functional dependency and that any FD
that can be derived from a given set of FDs using them is true. They are also complete, meaning they can be
used to derive every valid reference about dependencies .Let A,B,C and D be subsets of attributes of a relation R
then following are the different inference rules: -

 Reflexivity: If B is a subset of A, then A  B. This also implies that A  A always holds. Functional
dependencies of this type are called trivial dependencies.

For Example StName,stAdr  stName and stName  stName

 Augmentation: If we have A  B then AC.  BC.

For Example If stId  stName then StId,stAdr  stName,stadr

 Transitivity: If A  B and B  C, then A  C

If stId  prName and prName  credits then stId  credits

 Additivity or Union: If A  B and A  C, then A  BC

If empId  eName and empId  qual Then we can write it as empId  qual

 Projectivity or Decomposition: If A  BC then A  B and A  C

If empId  eName,qual Then we can write it as empId  eName and empID  qual

 Pseudo transitivity: If A  B and CB  D, then AC  D

If stID  stName and stName,fName  stAdr Then we can write it as StId,fName  stAdr

b. Normalization
There are four types of anomalies, which are of concern, redundancy, insertion, deletion and updation.
Normalization is not compulsory, but it is strongly recommended that normalization must be done. Because
normalized design makes the maintenance of database much easier. While carrying out the process of
normalization, it should be applied on each table of database. It is performed after the logical database design.

31
This process is also being followed informally during conceptual database design as well. Normalization is
based on the concept of functional dependency.

c. Level of Normalization
There are different forms or levels of normalization. They are called as first, second and so on. Each normalized
form has certain requirements or conditions, which must be fulfilled. If a table or relation fulfills any particular
form then it is said to be in that normal form. The process is applied on each relation of the database. The
minimum form in which all the tables are in is called the normal form of entire database. The main objective of
normalization is to place the database in highest form of normalization.

There are two goals of the normalization process: eliminate redundant data (for example, storing the same data
in more than one table) and ensure data dependencies make sense (only storing related data in a table). Both of
these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically
stored.

First Normal Form: A relation is in first normal form if and only if every attribute is single valued for each
tuple. This means that each attribute in each row, or each cell of the table, contains only one value. No repeating
fields or groups are allowed. An alternative way of describing first normal form is to say that the domains of
attributes of a relation are atomic, that is they consist of single units that cannot be broken down further. There
is no multivalued (repeating group) in the relation multiple values create problems in performing operations like
select or join. For Example there is a relation of Student.

Now this table is in first normal form and for every tuple there is a unique value.

Second Normal Form: A relation is in second normal form (2NF) if and only if it is in first normal form and all
the nonkey attributes are fully functionally dependent on the key. Clearly, if a relation is in 1NF and the key
consists of a single attribute, the relation is automatically in 2NF. The only time we have to be concerned about
2NF is when the key is composite. Second normal form (2NF) addresses the concept of removing duplicative
data. It removes subsets of data that apply to multiple rows of a table and place them in separate tables. It creates
relationships between these new tables and their predecessors through the use of foreign keys.

CLASS (crId, stId, stName, fId, room, grade)

crId, stId  stName, fId, room, grade

stId  stName
32
crId  fId, room

Now in this relation the key is course ID and student ID. The requirement of 2NF is that all non-key attributes
should be fully dependent on the key there should be no partial dependency of the attributes. But in this relation
student ID is dependent on student name and similarly course ID is partially dependent on faculty ID and room,
so it is not in second normal form. At this level of normalization, each column in a table that is not a determiner
of the contents of another column must itself be a function of the other columns in the table. For example, in a
table with three columns containing customer ID, product sold, and price of the product when sold, the price
would be a function of the customer ID (entitled to a discount) and the specific product. If a relation is not in
2NF then there are some anomalies, which are: Redundancy, Insertion Anomaly, Deletion Anomaly and
Updation Anomaly

The general requirements of 2NF are:-

 Remove subsets of data that apply to multiple rows of a table and place them in separate rows.
 Create relationships between these new tables and their predecessors through the use of foreign keys.

Consider the following table which has the anomalies:-

Now the first thing is that the table is in 1NF because there are no duplicate values in any tuple and all cells
contain atomic value. The first thing is the redundancy. Like in this table of CLASS the course ID C3456 is
being repeated for faculty ID F2345 and similarly the room no 104 is being repeated twice. Second is the
insertion anomaly. Suppose we want to insert a course in the table, but this course has not been registered to any
student. But we cannot enter the student ID, because no student has registered this course yet. So we can also
not insert this course. This is called as insertion anomaly which is wrong state of database.

Next is the deletion anomaly. Suppose there is a course which has been enrolled by one student only. Now due
to some reason, we want to delete the record of student. But here the information about the course will also be
deleted, so in this way this is the incorrect state of database in which infact we want to delete the information
about the student record but along with this the course information has also been deleted. So it is not reflecting
the actual system.

Now the next is updation anomaly. Suppose a course has been registered by 50 students and now we want to
change the class rooms of all the students. So in this case we will have to change the records of all the 50
students. So this is again a deletion anomaly. The process for transforming a 1NF table to 2NF is:

 Identify any determinants other than the composite key, and the columns they determine.
 Create and name a new table for each determinant and the unique columns it determines.
 Move the determined columns from the original table to the new table. The determinate becomes the
primary key of the new table.
 Delete the columns you just moved from the original table except for the determinant which will serve
as a foreign key.
 The original table may be renamed to maintain semantic meaning.

33
Now the table has been decomposed into three tables to remove all these anomalies from the table as under:-

STD (stId, stName), COURSE (crId, fId, room) and CLASS (crId, stId, grade)

So now these three tables are in second normal form. There are no anomalies available now in this form and we
say this as 2NF.

Third Normal Form: A relational table is in third normal form (3NF) if it is already in 2NF and every non-key
column is non-transitively dependent upon its primary key. In other words, all non-key attributes are
functionally dependent only upon the primary key.

Transitive Dependency: Transitive dependency is one that carries over another attribute. Transitive
dependency occurs when one non-key attribute determines another non-key attribute. For third normal form we
concentrate on relations with one candidate key, and we eliminate transitive dependencies. Transitive
dependencies cause insertion, deletion, and update anomalies. We will now see it with an example:-

Now here the table is in second normal form. As there is no partial dependency of any attributes here. The key is
student ID. The problem is of transitive dependency in which a non-key attribute can be determined by a non-
key attribute. Like here the program credits can be determined by program name, which is not in 3NF. It also
causes same four anomalies, which are due to transitive dependencies. For Example:-

Now in this table all the four anomalies are exists in the table. So we will have to remove these anomalies by
decomposing this table after removing the transitive dependency. We will see it as under: -

The process of transforming a table into 3NF is:

 Identify any determinants, other the primary key, and the columns they determine.
 Create and name a new table for each determinant and the unique columns it determines.
 Move the determined columns from the original table to the new table. The determinate becomes the
primary key of the new table.
 Delete the columns you just moved from the original table except for the determinate which will serve
as a foreign key.
 The original table may be renamed to maintain semantic meaning.

STD (stId, stName, stAdr, prName) PROGRAM (prName, prCrdts)

34
We have now decomposed the relation into two relations of student and program. So the relations are in third
normal form and are free of all the anomalies

Boyce - Codd Normal Form: A relation is in Boyce-Codd normal form if and only if every determinant is a
candidate key. A relation R is said to be in BCNF if whenever X -> A holds in R, and A is not in X, then X is a
candidate key for R. It should be noted that most relations that are in 3NF are also in BCNF. Infrequently, a 3NF
relation is not in BCNF and this happens only if

 the candidate keys in the relation are composite keys (that is, they are not single attributes),
 there are more than one candidate keys in the relation, and
 the keys are not disjoint, that is, some attributes in the keys are common.

The BCNF differs from the 3NF only when there are more than one candidate keys and the keys are composite
and overlapping. Consider for example, the relationship: enrol (sno, sname, cno, cname, date-enrolled)

Let us assume that the relation has the following candidate keys:

(sno,cno) (sno,cname) (sname,cno) (sname, cname) (we have assumed sname and cname are unique identifiers).
The relation is in 3NF but not in BCNF because there are dependencies (sno -> sname ) (cno -> cname)

Where attributes are part of a candidate key are dependent on part of another candidate key. Such dependencies
indicate that although the relation is about some entity or association that is identified by the candidate keys e.g.
(sno, cno), there are attributes that are not about the whole thing that the keys identify. For example, the above
relation is about an association (enrolment) between students and subjects and therefore the relation needs to
include only one identifier to identify students and one identifier to identify subjects. Provided that two
identifiers about the students (sno, sname) and two keys about subjects (cno, cname) means that some
information about the students and subjects which is not needed is being provided. This provision of information
will result in repetition of information and the anomalies that we discussed at the beginning of this chapter. If
we wish to include further information about students and courses in the database, it should not be done by
putting the information in the present relation but by creating new relations that represent information about
entities student and subject. These difficulties may be overcome by decomposing the above relation in the
following three relations: (sno, sname) (cno, cname) (sno, cno, date-of-enrolment)

We now have a relation that only has information about students, another only about subjects and the third only
about enrolments. All the anomalies and repetition of information have been removed.

Higher Normal Forms: After BCNF are the fourth, a fifth and domain key normal form exists. Although till
BCNF normal form tables are in required form, but if we want we can move on to fourth and fifth normal forms
as well. 4NF deals with multivalued dependency, fifth deals with possible loss less decompositions; DKNF
reduces further chances of any possible inconsistency.

35
Chapter 7: SQL

1. Sql queries
a. Join operators
The relational join operation merges rows from two tables and returns the rows with one of the following
conditions:

 Have common values in common columns (natural join).


 Meet a given join condition (equality or inequality).
 Have common values in common columns or have no matching values (outer join).

The preceding SQL join syntax is sometimes referred to as an “old-style” join. Note that the FROM clause
contains the tables being joined and that the WHERE clause contains the condition(s) used to join the tables.
Note the following points about the preceding query:

 The FROM clause indicates which tables are to be joined. If three or more tables are included, the join
operation takes place two tables at a time, from left to right. For example, if you are joining tables T1,
T2, and T3, the first join is table T1 with T2; the results of that join are then joined to table T3.
 The join condition in the WHERE clause tells the SELECT statement which rows will be returned. In
this case, the SELECT statement returns all rows for which the V_CODE values in the PRODUCT and
VENDOR tables are equal.
 The number of join conditions is always equal to the number of tables being joined minus one. For
example, if you join three tables (T1, T2, and T3), you will have two join conditions (j1 and j2). All join
conditions are connected through an AND logical operator. The first join condition (j1) defines the join
criteria for T1 and T2. The second join condition (j2) defines the join criteria for the output of the first
join and T3.

Join operations can be classified as inner joins and outer joins. The inner join is the traditional join in which
only rows that meet a given criteria are selected. The join criteria can be an equality condition (also called a
natural join or an equijoin) or an inequality condition (also called a theta join). An outer join returns not only the
matching rows but also the rows with unmatched attribute values for one table or both tables to be joined. The
SQL standard also introduces a special type of join, called a cross join, that returns the same result as the
Cartesian product of two sets or tables.

36
a. Sub queries

Where subquery
The most common type of subquery uses an inner SELECT subquery on the right side of a WHERE comparison
expression. For example, to find all products with a price greater than or equal to the average product price, you
write the following query:

Note that this type of query, when used in a >, =, or <= conditional expression, requires a subquery that returns
only one single value (one column, one row). The value generated by the subquery must be of a “comparable”
data type; if the attribute to the left of the comparison symbol is a character type, the subquery must return a
character string. Also, if the query returns more than a single value, the DBMS will generate an error.

In the preceding example, the inner query finds the P_CODE for the product “Claw hammer.” The P_CODE is
then used to restrict the selected rows to only those where the P_CODE in the LINE table matches the P_CODE
for “Claw hammer.” But what happens if the original query encounters the “Claw hammer” string in more than
one product description? You get an error message. To compare one value to a list of values, you must use an IN
operand

IN subqueries
What would you do if you wanted to find all customers who purchased a “hammer” or any kind of saw or saw
blade? Note that the product table has two different types of hammers: “Claw hammer” and “Sledge hammer.”
Also note that there are multiple occurrences of products that contain “saw” in their product descriptions. There
are saw blades, jigsaws, and so on. In such cases, you need to compare the P_CODE not to one product code
(single value) but to a list of product code values. When you want to compare a single attribute to a list of
values, you use the IN operator. When the P_CODE values are not known beforehand, but they can be derived
using a query, you must use an IN subquery. The following example lists all customers who have purchased
hammers, saws, or saw blades.

Having subqueries
Just as you can use subqueries with the WHERE clause, you can use a subquery with a HAVING clause.
Remember that the HAVING clause is used to restrict the output of a GROUP BY query by applying a
37
conditional criteria to the grouped rows. For example, to list all products with the total quantity sold greater than
the average quantity sold, you would write the following query:

ANY and ALL subqueries (Multirow subquery)


So far, you have learned that you must use an IN subquery when you need to compare a value to a list of values.
But the IN subquery uses an equality operator; that is, it selects only those rows that match (are equal to) at least
one of the values in the list. What happens if you need to do an inequality comparison (> or <) of one value to
all list of values? For example, suppose that you want to know which products have a product cost that is greater
than all individual product costs for products provided by vendors from Florida

It’s important to note the following points about the query and its output in Figure:

 The query is a typical example of a nested query.


 The query has one outer SELECT statement with a SELECT subquery (call it sqA) containing a second
SELECT subquery (call it sqB ).
 The last SELECT subquery (sqB ) is executed first and returns a list of all vendors from Florida.
 The first SELECT subquery (sqA) uses the output of the SELECT subquery (sqB ). The sqA subquery
returns the list of product costs for all products provided by vendors from Florida.
 The use of the ALL operator allows you to compare a single value (P_QOH * P_PRICE) with a list of
values returned by the first subquery (sqA) using a comparison operator other than equals.
 For a row to appear in the result set, it has to meet the criterion P_QOH * P_PRICE > ALL, of the
individual values returned by the subquery sqA. The values returned by sqA are a list of product costs.
In fact, “greater than ALL” is equivalent to “greater than the highest product cost of the list.” In the
same way, a condition of “less than ALL” is equivalent to “less than the lowest product cost of the list.”

Another powerful operator is the ANY multirow operator (the near cousin of the ALL multirow operator). The
ANY operator allows you to compare a single value to a list of values, selecting only the rows for which the
inventory cost is greater than any value of the list or less than any value of the list. You could use the equal to
ANY operator, which would be the equivalent of the IN operator.

From Subqueries
As you already know, the FROM clause specifies the table(s) from which the data will be drawn. Because the
output of a SELECT statement is another table (“virtual” table), use a SELECT subquery in the FROM clause.
For example, assume that you want to know all customers who have purchased products 13-Q2/P2 and 23109-
HB. All product purchases are stored in the LINE table. It is easy to find out who purchased any given product
38
by searching the P_CODE attribute in the LINE table. But in this case, you want to know all customers who
purchased both products, not just one. You could write the following query:

SELECT DISTINCT CUSTOMER.CUS_CODE, CUSTOMER.CUS_LNAME

FROM CUSTOMER, (SELECT INVOICE.CUS_CODE FROM INVOICE NATURAL JOIN LINE

WHERE P_CODE = '13-Q2/P2') CP1, (SELECT INVOICE.CUS_CODE FROM INVOICE


NATURAL JOIN LINE WHERE P_CODE = '23109-HB') CP2

WHERE CUSTOMER.CUS_CODE = CP1.CUS_CODE AND CP1.CUS_CODE = CP2.CUS_CODE;

The first subquery returns all customers who purchased product 13-Q2/P2, while the second subquery returns all
customers who purchased product 23109-HB. So in this FROM subquery, you are joining the CUSTOMER
table with two virtual tables. The join condition selects only the rows with matching CUS_CODE values in each
table (base or virtual).

Attribute list subqueries


The SELECT statement uses the attribute list to indicate what columns to project in the resulting set. Those
columns can be attributes of base tables, computed attributes, or the result of an aggregate function. The
attribute list can also include a subquery expression, also known as an inline subquery. A subquery in the
attribute list must return one single value; otherwise, an error code is raised. For example, a simple inline query
can be used to list the difference between each product’s price and the average product price:

In Figure, note that the inline query output returns one single value (the average product’s price) and that the
value is the same in every row. Note also that the query used the full expression instead of the column aliases
when computing the difference. In fact, if you try to use the alias in the difference expression, you will get an
error message. The column alias cannot be used in computations in the attribute list when the alias is defined in
the same attribute list. That DBMS requirement is the result of the way the DBMS parses and executes queries.

Correlated subqueries
Until now, all subqueries you have learned execute independently. That is, each subquery in a command
sequence executes in a serial fashion, one after another. The inner subquery executes first; its output is used by
the outer query, which then executes until the last outer query executes (the first SQL statement in the code). In
contrast, a correlated subquery is a subquery that executes once for each row in the outer query. That process is
similar to the typical nested loop in a programming language.

That process is the opposite of that of the subqueries as already seen. The query is called a correlated subquery
because the inner query is related to the outer query by the fact that the inner query references a column of the
outer subquery. To see the correlated subquery in action, suppose that you want to know all product sales in
39
which the units sold value is greater than the average units sold value for that product. In that case, the following
procedure must be completed:

 Compute the average-units-sold value for a product.


 Compare the average computed in Step 1 to the units sold in each sale row and then select only the rows
in which the number of units sold is greater.

In the top query and its result in Figure, note that the LINE table is used more than once, so you must use table
aliases. In that case, the inner query computes the average units sold of the product that matches the P_CODE of
the outer query P_CODE. That is, the inner query runs once, using the first product code found in the (outer)
LINE table, and returns the average sale for that product. When the number of units sold in that (outer) LINE
row is greater than the average computed, the row is added to the output. Then the inner query runs again, this
time using the second product code found in the (outer) LINE table. The process repeats until the inner query
has run for all rows in the (outer) LINE table. In that case, the inner query will be repeated as many times as
there are rows in the outer query.

2. Sql Functions
a. Date and time functions

40
b. Numeric Functions

c. String Functions

d. Conversion functions

41
3. Procedural SQL
Thus far, you have learned to use SQL to read, write, and delete data in the database. For example, you learned
to update values in a record, to add records, and to delete records. Unfortunately, SQL does not support the
conditional execution of procedures that are typically supported by a programming language. SQL also fails to
support the looping operations in programming languages that permit the execution of repetitive actions
typically encountered in a programming environment.

Traditionally, if you wanted to perform a conditional (IF-THEN-ELSE) or looping (DO-WHILE) type of


operation (that is, a procedural type of programming), you would use a programming language such as Visual
Basic .NET, C#, or COBOL. That’s why many older (so-called legacy) business applications are based on
enormous numbers of COBOL program lines. Although that approach is still common, it usually involves the
duplication of application code in many programs. Therefore, when procedural changes are required, program
modifications must be made in many different programs. An environment characterized by such redundancies
often creates data management problems. A better approach is to isolate critical code and then have all
application programs call the shared code. The advantage of that modular approach is that the application code
is isolated in a single program, thus yielding better maintenance and logic control. In any case, the rise of
distributed databases and object-oriented databases required that more application code be stored and executed
within the database. To meet that requirement, most RDBMS vendors created numerous programming language
extensions. Those extensions include:

 Flow-control procedural programming structures (IF-THEN-ELSE, DO-WHILE) for logic


representation.
 Variable declaration and designation within the procedures.
42
 Error management.

To remedy the lack of procedural functionality in SQL and to provide some standardization within the many
vendor offerings, the SQL-99 standard defined the use of persistent stored modules. A persistent stored module
(PSM) is a block of code containing standard SQL statements and procedural extensions that is stored and
executed at the DBMS server. The PSM represents business logic that can be encapsulated, stored, and shared
among multiple database users. A PSM lets an administrator assign specific access rights to a stored module to
ensure that only authorized users can use it. Support for persistent stored modules is left to each vendor to
implement. In fact, for many years, some RDBMSs (such as Oracle, SQL Server, and DB2) supported stored
procedure modules within the database before the official standard was promulgated. MS SQL Server
implements persistent stored modules via Transact-SQL and other language extensions, the most notable of
which are the .NET family of programming languages. Procedural SQL (PL/SQL) is a language that makes it
possible to use and store procedural code and SQL statements within the database and to merge SQL and
traditional programming constructs, such as variables, conditional processing (IF-THEN-ELSE), basic loops
(FOR and WHILE loops,) and error trapping. The procedural code is executed as a unit by the DBMS when it is
invoked (directly or indirectly) by the end user. End users can use PL/SQL to create:

 Anonymous PL/SQL blocks.


 Triggers
 Stored procedures
 PL/SQL functions

a. Triggers
Automating business procedures and automatically maintaining data integrity and consistency are critical in a
modern business environment. One of the most critical business procedures is proper inventory management.
For example, you want to make sure that current product sales can be supported with sufficient product
availability. Therefore, it is necessary to ensure that a product order be written to a vendor when that product’s
inventory drops below its minimum allowable quantity on hand. Better yet, how about ensuring that the task is
completed automatically? To accomplish automatic product ordering, you first must make sure the product’s
quantity on hand reflects an up-to-date and consistent value. After the appropriate product availability
requirements have been set, two key issues must be addressed:

 Business logic requires an update of the product quantity on hand each time there is a sale of that
product.
 If the product’s quantity on hand falls below its minimum allowable inventory (quantity-on-hand) level,
the product must be reordered.

To accomplish those two tasks, you could write multiple SQL statements: one to update the product quantity on
hand and another to update the product reorder flag. Next, you would have to run each statement in the correct
order each time there was a new sale. Such a multistage process would be inefficient because a series of SQL
statements must be written and executed each time a product is sold. Even worse, that SQL environment
requires that somebody must remember to perform the SQL tasks. A trigger is procedural SQL code that is
automatically invoked by the RDBMS upon the occurrence of a given data manipulation event. It is useful to
remember that:

 A trigger is invoked before or after a data row is inserted, updated, or deleted.


 A trigger is associated with a database table.
 Each database table may have one or more triggers.
43
 A trigger is executed as part of the transaction that triggered it.

Triggers are critical to proper database operation and management. For example:

 Triggers can be used to enforce constraints that cannot be enforced at the DBMS design and
implementation levels.
 Triggers add functionality by automating critical actions and providing appropriate warnings and
suggestions for remedial action. In fact, one of the most common uses for triggers is to facilitate the
enforcement of referential integrity.
 Triggers can be used to update table values, insert records in tables, and call other stored procedures.

Triggers play a critical role in making the database truly useful; they also add processing power to the RDBMS
and to the database system as a whole. Oracle recommends triggers for:

 Auditing purposes (creating audit logs).


 Automatic generation of derived column values.
 Enforcement of business or security constraints.
 Creation of replica tables for backup purposes

A trigger definition contains the following parts:

 The triggering timing: BEFORE or AFTER. This timing indicates when the trigger’s PL/SQL code
executes; in this case, before or after the triggering statement is completed.
 The triggering event: the statement that causes the trigger to execute (INSERT, UPDATE, or DELETE).
 The triggering level: There are two types of triggers: statement-level triggers and row-level triggers.
o A statement-level trigger is assumed if you omit the FOR EACH ROW keywords. This type of
trigger is executed once, before or after the triggering statement is completed. This is the default
case.
o A row-level trigger requires use of the FOR EACH ROW keywords. This type of trigger is
executed once for each row affected by the triggering statement. (In other words, if you update
10 rows, the trigger executes 10 times.)
 The triggering action: The PL/SQL code enclosed between the BEGIN and END keywords. Each
statement inside the PL/SQL code must end with a semicolon “;”

b. Stored Procedures
A stored procedure is a named collection of procedural and SQL statements. Just like database triggers, stored
procedures are stored in the database. One of the major advantages of stored procedures is that they can be used
to encapsulate and represent business transactions. For example, you can create a stored procedure to represent a
product sale, a credit update, or the addition of a new customer. By doing that, you can encapsulate SQL
statements within a single stored procedure and execute them as a single transaction. There are two clear
advantages to the use of stored procedures:

44
 Stored procedures substantially reduce network traffic and increase performance. Because the procedure
is stored at the server, there is no transmission of individual SQL statements over the network. The use
of stored procedures improves system performance because all transactions are executed locally on the
RDBMS, so each SQL statement does not have to travel over the network.
 Stored procedures help reduce code duplication by means of code isolation and code sharing (creating
unique PL/SQL modules that are called by application programs), thereby minimizing the chance of
errors and the cost of application development and maintenance.

To create a stored procedure, you use the following syntax:

Note the following important points about stored procedures and their syntax:

 Argument specifies the parameters that are passed to the stored procedure. A stored procedure could
have zero or more arguments or parameters.
 IN/OUT indicates whether the parameter is for input, output, or both.
 Data-type is one of the procedural SQL data types used in the RDBMS. The data types normally match
those used in the RDBMS table-creation statement.
 Variables can be declared between the keywords IS and BEGIN. You must specify the variable name,
its data type, and (optionally) an initial value.

To execute the stored procedure, you must use the following syntax: EXEC procedure_name[(parameter_list)];

c. PL/SQL stored procedures


Using programmable or procedural SQL, you can also create your own stored functions. Stored procedures and
functions are very similar. A stored function is basically a named group of procedural and SQL statements that
returns a value (indicated by a RETURN statement in its program code). To create a function, you use the
following syntax:

Stored functions can be invoked only from within stored procedures or triggers and cannot be invoked from
SQL statements (unless the function follows some very specific compliance rules). Remember not to confuse
built-in SQL functions (such as MIN, MAX, and AVG) with stored functions.

45
Chapter 8: Indexing and Storage
1. Introduction
In a book, the index is an alphabetical listing of topics, along with the page number where the topic appears. The
idea of an INDEX in a Database is similar. Any subset of the fields of a relation can be the search key for an
index on the relation. Search key is not the same as key (e.g. doesn’t have to be unique ID). An index contains a
collection of data entries, and supports efficient retrieval of all records with a given search key value k.
typically, index also contains auxiliary information that directs searches to the desired data entries. There can be
multiple (different) indexes per file. Indexes on primary key and on attribute(s) in the unique constraint are
automatically created.

In a database, an index allows the database program to find data in a table without scanning the entire table. An
index in a database is a list of values in a table with the storage locations of rows in the table that contain each
value. Indexes can be created on either a single column or a combination of columns in a table and are
implemented in the form of B-trees. An index contains an entry with one or more columns (the search key) from
each row in a table. A B-tree is sorted on the search key, and can be searched efficiently on any leading subset
of the search key. For example, an index on columns A, B, C can be searched efficiently on A, on A, B, and A,
B, C. Databases contain individual indexes for selected types or columns of data. When you create a database
and tune it for performance, you should create indexes for the columns used in queries to find data.

In the pubs sample database provided with Microsoft® SQL Server™ 2000, the employee table has an index on
the emp_id column. The following illustration shows that how the index stores each emp_id value and points to
the rows of data in the table with each value.

When SQL Server executes a statement to find data in the employee table based on a specified emp_id value, it
recognizes the index for the emp_id column and uses the index to find the data. If the index is not present, it
performs a full table scan starting at the beginning of the table and stepping through each row, searching for the
specified emp_id value. SQL Server automatically creates indexes for certain types of constraints (for example,
PRIMARY KEY and UNIQUE constraints). You can further customize the table definitions by creating indexes
that are independent of constraints.

The performance benefits of indexes, however, do come with a cost. Tables with indexes require more storage
space in the database. Also, commands that insert, update, or delete data can take longer and require more
processing time to maintain the indexes. When you design and create indexes, you should ensure that the
performance benefits outweigh the extra cost in storage space and processing resources.

2. Properties of Index
Following are the major properties of indexes:

46
 Indexes can be defined even when there is no data in the table
 Existing values are checked on execution of this command
 It support selections of form as under: field <operator> constant
 It supports equality selections as under: Either “tree” or “hash” indexes help here.
 It support Range selections (operator is one among , <=, >=, BETWEEN)

3. Structure of Index
We can create indices using some columns of the database.

 The search key is the database’s first column, and it contains a duplicate or copy of the table’s candidate
key or primary key. The primary key values are saved in sorted order so that the related data can be
quickly accessible.
 The data reference is the database’s second column. It contains a group of pointers that point to the disk
block where the value of a specific key can be found.

4. Creating Index
CREATE [ UNIQUE ] [ CLUSTERED | NONCLUSTERED ] INDEX index_name

ON { table | view } ( column [ ASC | DESC ] [ ,...n ]

We will now see an example of creating index as under:

CREATE UNIQUE INDEX pr_prName ON program(prName)

It can also be created on composite attributes. We will see it with an example.

CREATE UNIQUE INDEX St_Name ON Student(stName ASC, stFName DESC)

5. Methods of Indexing

a. Ordered Indices
To make searching easier and faster, the indices are frequently arranged/sorted. Ordered indices are indices that
have been sorted.

Example

Let’s say we have a table of employees with thousands of records, each of which is ten bytes large. If their IDs
begin with 1, 2, 3,…, etc., and we are looking for the student with ID-543:

47
 We must search the disk block from the beginning till it reaches 543 in the case of a DB without an
index. After reading 543*10=5430 bytes, the DBMS will read the record.
 We will perform the search using indices in the case of an index, and the DBMS would read the record
after it reads 542*2 = 1084 bytes, which is significantly less than the prior example.

b. Primary indexes

Primary indexing refers to the process of creating an index based on the table’s primary key. These primary keys
are specific to each record and establish a 1:1 relationship between them. Thus, if we just know the primary key
attribute value of the first record in each block, we can determine quite quickly whether a given record can be
found in some block or not. This is the idea used to generate the Primary Index file for the table. The searching
operation is fairly efficient because primary keys are stored in sorted order. There are two types of primary
indexes: dense indexes and sparse indexes.

Dense Index
Every search key value in the data file has an index record in the dense index. It speeds up the search process.
The total number of records present in the index table and the main table are the same in this case. It requires
extra space to hold the index record. A pointer to the actual record on the disk and the search key are both
included in the index records.

Sparse Index
Only a few items in the data file have index records. Each and every item points to a certain block. Rather than
pointing to each item in the main database, the index, in this case, points to the records that are present in the
main table that is in a gap.

48
c. Clustered indexes
A clustered index determines the storage order of data in a table. A clustered index is analogous to a telephone
directory, which arranges data by last name. Because the clustered index dictates the physical storage order of
the data in the table, a table can contain only one clustered index. However, the index can comprise multiple
columns (a composite index), like the way a telephone directory is organized by last name and first name. A
clustered index is particularly efficient on columns often searched for ranges of values. Once the row with the
first value is found using the clustered index, rows with subsequent indexed values are guaranteed to be
physically adjacent.

For example, if an application frequently executes a query to retrieve records between a range of dates, a
clustered index can quickly locate the row containing the beginning date, and then retrieve all adjacent rows in
the table until the last date is reached. This can help increase the performance of this type of query. Also, if there
is a column(s) which is used frequently to sort
the data retrieved from a table, it can be
advantageous to cluster (physically sort) the
table on that column(s) to save the cost of a
sort each time the column(s) is queried.

Students studying in each semester are


grouped together. i.e. 1 st Semester students,
2nd semester students, 3 rd semester students
etc. are grouped. 

d. Non-clustered or Secondary Indexing 


When using sparse indexing, the size of the mapping grows in sync with the size of the table. These mappings
are frequently stored in primary memory to speed up address fetching. The secondary memory then searches the
actual data using the address obtained through mapping. Fetching the address becomes slower as the mapping
size increases. The sparse index will be ineffective in this scenario, so secondary indexing is used to solve this
problem. Following are the three-implementation approaches of Indexes: • Inverted files or inversions • Linked
lists • B+ Trees

Another level of indexing is introduced in secondary indexing to reduce the size of the mapping. The massive
range for the columns is chosen first in this method, resulting in a small mapping size at the first level. Each
range is then subdivided into smaller groups. Because the first level’s mapping is kept in primary memory,
fetching the addresses is faster. The second-level mapping, as well as the actual data, is kept in secondary
memory (or hard disk).

49
e. Multilevel Indexing 
With the growth of the size of the database, indices also grow. As the index is stored in the main memory, a
single-level index might become too large a size to store with multiple disk accesses. The multilevel indexing
segregates the main block into various smaller blocks so that the same can stored in a single block. The outer
blocks are divided into inner blocks which in turn are pointed to the data blocks. This can be easily stored in the
main memory with fewer overheads. Use binary search on outer index. Scan index block found until correct
index record found. Use index record as before - scan block pointed to for desired record. For very large files,
additional levels of indexing may be required. Indices must be updated at all levels when insertions or deletions
require it. Frequently, each level of index corresponds to a unit of physical storage

4. Is indexing similar to hashing?


Indexing is a data structure technique to efficiently retrieve records from the database files based on some
attributes on which the indexing took place. On the other hand, hashing is an effective technique to calculate the
direct location of a data record on the disk without using index structure.

Hashing uses mathematical methods called hash functions to generate direct locations of data records on the
disc, whereas indexing uses data references that contain the address of the disc block with the value
corresponding to the key. As a result, there is a significant difference between hashing and indexing.

Another difference between indexing and hashing is that the hashing works well for large databases than
indexing.

50
Chapter 9: Query processing and Optimization
1. Query processing
In simple terms, the DBMS processes a query in three phases:

1. Parsing. The DBMS parses the SQL query and chooses the most efficient access/execution plan.
2. Execution. The DBMS executes the SQL query using the chosen execution plan.
3. Fetching. The DBMS fetches the data and sends the result set back to the client.

The processing of SQL DDL statements (such as CREATE TABLE) is different from the processing required by
DML statements. The difference is that a DDL statement actually updates the data dictionary tables or system
catalog, while a DML statement (SELECT, INSERT, UPDATE, and DELETE) mostly manipulates end-user
data. Figure shows the general steps required for query processing.

a. Parsing
The optimization process includes breaking down—parsing—the query into smaller units and transforming the
original SQL query into a slightly different version of the original SQL code, but one that is fully equivalent and
more efficient. Fully equivalent means that the optimized query results are always the same as the original
query. More efficient means that the optimized query will almost always execute faster than the original query.
(Note that it almost always executes faster because, as explained earlier, many factors affect the performance of
a database. Those factors include the network, the client computer’s resources, and other queries running
concurrently in the same database.) To determine the most efficient way to execute the query, the DBMS may
use the database statistics. The SQL parsing activities are performed by the query optimizer, which analyzes the
SQL query and finds the most efficient way to access the data. This process is the most time-consuming phase
in query processing. Parsing a SQL query requires several steps, in which the SQL query is:

 Validated for syntax compliance.


 Validated against the data dictionary to ensure that tables and column names are correct.
 Validated against the data dictionary to ensure that the user has proper access rights.
 Analyzed and decomposed into more atomic components.
 Optimized through transformation into a fully equivalent but more efficient SQL query.

51
 Prepared for execution by determining the most efficient execution or access plan.

Once the SQL statement is transformed, the DBMS creates what is commonly known as an access or execution
plan. An access plan is the result of parsing a SQL statement; it contains the series of steps a DBMS will use to
execute the query and to return the result set in the most efficient way. First, the DBMS checks to see if an
access plan already exists for the query in the SQL cache. If it does, the DBMS reuses the access plan to save
time. If it doesn’t, the optimizer evaluates various plans and makes decisions about what indexes to use and how
to best perform join operations. The chosen access plan for the query is then placed in the SQL cache and made
available for use and future reuse.

b. SQL Execution Phase


In this phase, all I/O operations indicated in the access plan are executed. When the execution plan is run, the
proper locks—if needed—are acquired for the data to be accessed, and the data are retrieved from the data files
and placed in the DBMSs data cache. All transaction management commands are processed during the parsing
and execution phases of query processing.

c. SQL Fetching Phase


After the parsing and execution phases are completed, all rows that match the specified condition(s) are
retrieved, sorted, grouped, and/or aggregated (if required). During the fetching phase, the rows of the resulting
query result set are returned to the client. The DBMS might use temporary table space to store temporary data.
In this stage, the database server coordinates the movement of the result set rows from the server cache to the
client cache. For example, a given query result set might contain 9,000 rows; the server would send the first 100
rows to the client and then wait for the client to request the next set of rows, until the entire result set is sent to
the client.

2. Query Processing Bottlenecks


The main objective of query processing is to execute a given query in the fastest way possible with the least
amount of resources. As you have seen, the execution of a query requires the DBMS to break down the query
into a series of interdependent I/O operations to be executed in a collaborative manner. The more complex a
query is, the more complex the operations are, and the more likely it is that there will be bottlenecks. A query
processing bottleneck is a delay introduced in the processing of an I/O operation that causes the overall system
to slow down. In the same way, the more components a system has, the more interfacing among the components
is required, and the more likely it is that there will be bottlenecks. Within a DBMS, there are five components
that typically cause bottlenecks:

 CPU. The CPU processing power of the DBMS should match the system’s expected work load. A high
CPU utilization might indicate that the processor speed is too slow for the amount of work performed.
However, heavy CPU utilization can be caused by other factors, such as a defective component, not
enough RAM (the CPU spends too much time swapping memory blocks), a badly written device driver,
or a rogue process. A CPU bottleneck will affect not only the DBMS but all processes running in the
system.
 RAM. The DBMS allocates memory for specific usage, such as data cache and SQL cache. RAM must
be shared among all running processes (operating system, DBMS, and all other running processes). If
there is not enough RAM available, moving data among components that are competing for scarce
RAM can create a bottleneck.
 Hard disk. Another common cause of bottlenecks is hard disk speed and data transfer rates. Current
hard disk storage technology allows for greater storage capacity than in the past; however, hard disk
space is used for more than just storing end-user data. Current operating systems also use the hard disk
52
for virtual memory, which refers to copying areas of RAM to the hard disk as needed to make room in
RAM for more urgent tasks. Therefore, the greater the hard disk storage space and the faster the data
transfer rates, the less the likelihood of bottlenecks.
 Network. In a database environment, the database server and the clients are connected via a network.
All networks have a limited amount of bandwidth that is shared among all clients. When many network
nodes access the network at the same time, bottlenecks are likely.
 Application code. Not all bottlenecks are caused by limited hardware resources. One of the most
common sources of bottlenecks is badly written application code. No amount of coding will make a
poorly designed database perform better. We should also add: you can throw unlimited resources at a
badly written application, and it will still perform as a badly written application!

3. Query optimization
Query optimization is the central activity during the parsing phase in query processing. In this phase, the DBMS
must choose what indexes to use, how to perform join operations, which table to use first, and so on. Each
DBMS has its own algorithms for determining the most efficient way to access the data. The query optimizer
can operate in one of two modes:

 A rule-based optimizer uses preset rules and points to determine the best approach to execute a query.
The rules assign a “fixed cost” to each SQL operation; the costs are then added to yield the cost of the
execution plan. For example, a full table scan has a set cost of 10, while a table access by row ID has a
set cost of 3.
 A cost-based optimizer uses sophisticated algorithms based on the statistics about the objects being
accessed to determine the best approach to execute a query. In this case, the optimizer process adds up
the processing cost, the I/O costs, and the resource costs (RAM and temporary space) to come up with
the total cost of a given execution plan.

The optimizer’s objective is to find alternative ways to execute a query—to evaluate the “cost” of each
alternative and then to choose the one with the lowest cost. To understand the function of the query optimizer,
let’s use a simple example. Assume that you want to list all products provided by a vendor based in Florida. To
acquire that information, you could write the following query:

SELECT P_CODE, P_DESCRIPT, P_PRICE, V_NAME, V_STATE FROM PRODUCT, VENDOR

WHERE PRODUCT.V_CODE = VENDOR.V_CODE AND VENDOR.V_STATE = 'FL';

Furthermore, let’s assume that the database statistics indicate that:

 The PRODUCT table has 7,000 rows.


 The VENDOR table has 300 rows.
 Ten vendors are located in Florida.
 One thousand products come from vendors in Florida.

It’s important to point out that only the first two items are available to the optimizer. The second two items are
assumed to illustrate the choices that the optimizer must make. Armed with the information in the first two
items, the optimizer would try to find the most efficient way to access the data. The primary factor in
determining the most efficient access plan is the I/O cost. (Remember, the DBMS always tries to minimize I/O
operations.) Table shows two sample access plans for the previous query and their respective I/O costs.

53
To make the example easier to understand, the I/O Operations and I/O Cost columns in Table estimate only the
number of I/O disk reads the DBMS must perform. For simplicity’s sake, it is assumed that there are no indexes
and that each row read has an I/O cost of 1. For example, in step A1, the DBMS must calculate the Cartesian
product of PRODUCT and VENDOR. To do that, the DBMS must read all rows from PRODUCT (7,000) and
all rows from VENDOR (300), yielding a total of 7,300 I/O operations. The same computation is done in all
steps. In Table 11.4, you can see how plan A has a total I/O cost that is almost 30 times higher than plan B. In
this case, the optimizer will choose plan B to execute the SQL.

Optimizer hints are special instructions for the optimizer that are embedded inside the SQL command text.
Table summarizes a few of the most common optimizer hints used in standard SQL.

54
Chapter 10: Transaction Management
1. Transactions
A transaction is a program including a collection of database operations, executed as a logical unit of data
processing. The operations performed in a transaction include one or more of database operations like insert,
delete, update or retrieve data. It is an atomic process that is either performed into completion entirely or is not
performed at all. A transaction involving only data retrieval without any data update is called read-only
transaction. Each high level operation can be divided into a number of low level tasks or operations. For
example, a data update operation can be divided into three tasks –

 read_item() − reads data item from storage to main memory.


 modify_item() − change value of item in the main memory.
 write_item() − write the modified value from main memory to storage.

Database access is restricted to read_item() and write_item() operations. Likewise, for all transactions, read and
write forms the basic database operations.

2. Transaction Operations
The low level operations performed in a transaction are –

 begin_transaction − A marker that specifies start of transaction execution.


 read_item or write_item − Database operations that may be interleaved with main memory operations as
a part of transaction.
 end_transaction − A marker that specifies end of transaction.
 commit − A signal to specify that the transaction has been successfully completed in its entirety and will
not be undone.
 rollback − A signal to specify that the transaction has been unsuccessful and so all temporary changes in
the database are undone. A committed transaction cannot be rolled back.

3. Transaction States
A transaction may go through a subset of five states, active, partially committed, committed, failed and aborted.

 Active − The initial state where the transaction enters is the active state. The transaction remains in this
state while it is executing read, write or other operations.
 Partially Committed − The transaction enters this state after the last statement of the transaction has
been executed.
 Committed − The transaction enters this state after successful completion of the transaction and system
checks have issued commit signal.
 Failed − The transaction goes from partially committed state or active state to failed state when it is
discovered that normal execution can no longer proceed or system checks fail.
 Aborted − This is the state after the transaction has been rolled back after failure and the database has
been restored to its state that was before the transaction began.

4. Properties of Transaction
Each individual transaction must display atomicity, consistency, isolation, and durability. These properties are
sometimes referred to as the ACID test. In addition, when executing multiple transactions, the DBMS must

55
schedule the concurrent execution of the transaction’s operations. The schedule of such transaction’s operations
must exhibit the property of serializability. Let’s look briefly at each of the properties.

 Atomicity requires that all operations (SQL requests) of a transaction be completed; if not, the
transaction is aborted. If a transaction T1 has four SQL requests, all four requests must be successfully
completed; otherwise, the entire transaction is aborted. In other words, a transaction is treated as a
single, indivisible, logical unit of work.
 Consistency indicates the permanence of the database’s consistent state. A transaction takes a database
from one consistent state to another consistent state. When a transaction is completed, the database must
be in a consistent state; if any of the transaction parts violates an integrity constraint, the entire
transaction is aborted.
 Isolation means that the data used during the execution of a transaction cannot be used by a second
transaction until the first one is completed. In other words, if a transaction T1 is being executed and is
using the data item X, that data item cannot be accessed by any other transaction (T2 ... Tn) until T1
ends. This property is particularly useful in multiuser database environments because several users can
access and update the database at the same time.
 Durability ensures that once transaction changes are done (committed), they cannot be undone or lost,
even in the event of a system failure.
 Serializability ensures that the schedule for the concurrent execution of the transactions yields consistent
results. This property is important in multiuser and distributed databases, where multiple transactions
are likely to be executed concurrently. Naturally, if only a single transaction is executed, serializability
is not an issue.

A single-user database system automatically ensures serializability and isolation of the database because only
one transaction is executed at a time. The atomicity, consistency, and durability of transactions must be
guaranteed by the single-user DBMSs. Multiuser databases are typically subject to multiple concurrent
transactions. Therefore, the multiuser DBMS must implement controls to ensure serializability and isolation of
transactions—in addition to atomicity and durability—to guard the database’s consistency and integrity.

5. Schedules and Conflicts


In a system with a number of simultaneous transactions, a schedule is the total order of execution of operations.
Given a schedule S comprising of n transactions, say T1, T2, T3………..Tn; for any transaction Ti, the
operations in Ti must execute as laid down in the schedule S. Types of Schedules There are two types of
schedules –

 Serial Schedules − In a serial schedule, at any point of time, only one transaction is active, i.e. there is
no overlapping of transactions. This is depicted in the following graph –
 Parallel Schedules − In parallel schedules, more than one transactions are active simultaneously, i.e. the
transactions contain operations that overlap at time. This is depicted in the following graph –

Conflicts in Schedules
In a schedule comprising of multiple transactions, a conflict occurs when two active transactions perform non-
compatible operations. Two operations are said to be in conflict, when all of the following three conditions
exists simultaneously –

 The two operations are parts of different transactions.


 Both the operations access the same data item.
 At least one of the operations is a write_item() operation, i.e. it tries to modify the data item.
56
A serializable schedule of ‘n’ transactions is a parallel schedule which is equivalent to a serial schedule
comprising of the same ‘n’ transactions. A serializable schedule contains the correctness of serial schedule while
ascertaining better CPU utilization of parallel schedule.

Equivalence of Schedules
Equivalence of two schedules can be of the following types –

 Result equivalence − Two schedules producing identical results are said to be result equivalent.
 View equivalence − Two schedules that perform similar action in a similar manner are said to be view
equivalent.
 Conflict equivalence − Two schedules are said to be conflict equivalent if both contain the same set of
transactions and has the same order of conflicting pairs of operations.

6. Concurrency controlling techniques


Its ensure that multiple transactions are executed simultaneously while maintaining the ACID properties of the
transactions and serializability in the schedules.

a. Locking Based Concurrency Control Protocols


Locking-based concurrency control protocols use the concept of locking data items. A lock is a variable
associated with a data item that determines whether read/write operations can be performed on that data item.
Generally, a lock compatibility matrix is used which states whether a data item can be locked by two
transactions at the same time. At the most fundamental level, locks can be classified into (in increasingly
restrictive order) shared, update, and exclusive locks. A shared lock signifies that another transaction can take an
update or another shared lock on the same piece of data. Shared locks are used when data is read (usually in
pessimistic locking mode). An update lock ensures that another transaction can take only a shared lock on the
same data. Update locks are held by transactions that intend to change data (not just read it). If a transaction
locks a piece of data with an exclusive lock, no other transaction may take a lock on the data. For example, a
transaction with an isolation level of read uncommitted does not result in any locks on the data read by the
transaction, and a transaction with repeatable read isolation can take only a share lock on data it has read.

Locking-based concurrency control systems can use either one-phase or two-phase locking protocols.

One-phase Locking Protocol


In this method, each transaction locks an item before use and releases the lock as soon as it has finished using it.
This locking method provides for maximum concurrency but does not always enforce serializability.

Two-phase Locking Protocol


In this method, all locking operations precede the first lock-release or unlock operation. The transaction
comprise of two phases. In the first phase, a transaction only acquires all the locks it needs and do not release
any lock. This is called the expanding or the growing phase. In the second phase, the transaction releases the
locks and cannot request any new locks. This is called the shrinking phase. Every transaction that follows two-
phase locking protocol is guaranteed to be serializable. However, this approach provides low parallelism
between two conflicting transactions.

For applications, the implications of 2PL are that long-running transactions will hold locks for a long time.
When designing applications, lock contention should be considered. In order to reduce the probability of
deadlock and achieve the best level of concurrency possible, the following guidelines are helpful.

57
 When accessing multiple databases, design all transactions so that they access the files in the same
order.
 If possible, access your most hotly contested resources last (so that their locks are held for the shortest
time possible).
 If possible, use nested transactions to protect the parts of your transaction most likely to deadlock.

b. Time stamping Concurrency Control


The time stamping approach to scheduling concurrent transactions assigns a global, unique time stamp to each
transaction. The time stamp value produces an explicit order in which transactions are submitted to the DBMS.
Time stamps must have two properties: uniqueness and monotonicity. Uniqueness ensures that no equal time
stamp values can exist, and monotonicity ensures that time stamp values always increase.

All database operations (Read and Write) within the same transaction must have the same time stamp. The
DBMS executes conflicting operations in time stamp order, thereby ensuring serializability of the transactions.
If two transactions conflict, one is stopped, rolled back, rescheduled, and assigned a new time stamp value. The
disadvantage of the time stamping approach is that each value stored in the database requires two additional
time stamp fields: one for the last time the field was read and one for the last update. Time stamping thus
increases memory needs and the database’s processing overhead. Time stamping demands a lot of system
resources because many transactions might have to be stopped, rescheduled, and restamped.

Some of timestamp based concurrency control algorithms are –

 Basic timestamp ordering algorithm.


 Conservative timestamp ordering algorithm.
 Multiversion algorithm based upon timestamp ordering.

Timestamp based ordering follow three rules to enforce serializability –

 Access Rule − When two transactions try to access the same data item simultaneously, for conflicting
operations, priority is given to the older transaction. This causes the younger transaction to wait for the
older transaction to commit first.
 Late Transaction Rule − If a younger transaction has written a data item, then an older transaction is not
allowed to read or write that data item. This rule prevents the older transaction from committing after
the younger transaction has already committed.
 Younger Transaction Rule − A younger transaction can read or write a data item that has already been
written by an older transaction.

7. Transaction Management with SQL


The American National Standards Institute (ANSI) has defined standards that govern SQL database transactions.
Transaction support is provided by two SQL statements: COMMIT and ROLLBACK. The ANSI standards
require that when a transaction sequence is initiated by a user or an application program, the sequence must
continue through all succeeding SQL statements until one of the following four events occurs:

 A COMMIT statement is reached, in which case all changes are permanently recorded within the
database. The COMMIT statement automatically ends the SQL transaction.
 A ROLLBACK statement is reached, in which case all changes are aborted and the database is rolled
back to its previous consistent state.

58
 The end of a program is successfully reached, in which case all changes are permanently recorded
within the database. This action is equivalent to COMMIT.
 The program is abnormally terminated, in which case the changes made in the database are aborted and
the database is rolled back to its previous consistent state. This action is equivalent to ROLLBACK.

The coordination of the simultaneous execution of transactions in a multiuser database system is known as
concurrency control. The objective of concurrency control is to ensure the serializability of transactions in a
multiuser database environment. Concurrency control is important because the simultaneous execution of
transactions over a shared database can create several data integrity and consistency problems. The three main
problems are lost updates, uncommitted data, and inconsistent retrievals.

4. Deadlocks
The three basic techniques to control deadlocks are:

a. Deadlock prevention
A transaction requesting a new lock is aborted when there is the possibility that a deadlock can occur. If the
transaction is aborted, all changes made by this transaction are rolled back and all locks obtained by the
transaction are released. The transaction is then rescheduled for execution. Deadlock prevention works because
it avoids the conditions that lead to deadlocking.

One of the most popular deadlock prevention methods is pre-acquisition of all the locks. In this method, a
transaction acquires all the locks before starting to execute and retains the locks for the entire duration of
transaction. If another transaction needs any of the already acquired locks, it has to wait until all the locks it
needs are available. Using this approach, the system is prevented from being deadlocked since none of the
waiting transactions are holding any lock.

b. Deadlock detection
The DBMS periodically tests the database for deadlocks. If a deadlock is found, one of the transactions (the
“victim”) is aborted (rolled back and restarted) and the other transaction continues.

The deadlock detection and removal approach runs a deadlock detection algorithm periodically and removes
deadlock in case there is one. It does not check for deadlock when a transaction places a request for a lock.
When a transaction requests a lock, the lock manager checks whether it is available. If it is available, the
transaction is allowed to lock the data item; otherwise the transaction is allowed to wait. Since there are no
precautions while granting lock requests, some of the transactions may be deadlocked. To detect deadlocks, the
lock manager periodically checks if the wait-for graph has cycles. If the system is deadlocked, the lock manager
chooses a victim transaction from each cycle. The victim is aborted and rolled back; and then restarted later.
Some of the methods used for victim selection are –

 Choose the youngest transaction.


 Choose the transaction with fewest data items.
 Choose the transaction that has performed least number of updates.
 Choose the transaction having least restart overhead.
 Choose the transaction which is common to two or more cycles.

This approach is primarily suited for systems having transactions low and where fast response to lock requests is
needed.

59
c. Deadlock avoidance
The transaction must obtain all of the locks it needs before it can be executed. This technique avoids the rolling
back of conflicting transactions by requiring that locks be obtained in succession. However, the serial lock
assignment required in deadlock avoidance increases action response times.

The method can be briefly stated as follows. Transactions start executing and request data items that they need
to lock. The lock manager checks whether the lock is available. If it is available, the lock manager allocates the
data item and the transaction acquires the lock. However, if the item is locked by some other transaction in
incompatible mode, the lock manager runs an algorithm to test whether keeping the transaction in waiting state
will cause a deadlock or not. Accordingly, the algorithm decides whether the transaction can wait or one of the
transactions should be aborted. There are two algorithms for this purpose, namely wait-die and wound-wait. Let
us assume that there are two transactions, T1 and T2, where T1 tries to lock a data item which is already locked
by T2. The algorithms are as follows –

 Wait-Die − If T1 is older than T2, T1 is allowed to wait. Otherwise, if T1 is younger than T2, T1 is
aborted and later restarted.
 Wound-Wait − If T1 is older than T2, T2 is aborted and later restarted. Otherwise, if T1 is younger than
T2, T1 is allowed to wait.

The choice of the best deadlock control method to use depends on the database environment. For example, if the
probability of deadlocks is low, deadlock detection is recommended. However, if the probability of deadlocks is
high, deadlock prevention is recommended. If response time is not high on the system’s priority list, deadlock
avoidance might be employed. All current DBMSs support deadlock detention in transactional databases, while
some DBMSs use a blend of prevention and avoidance techniques for other types of data, such as data
warehouses or XML data

60
Chapter 11: Distributed DBMS
A distributed database is a set of interconnected databases that is distributed over the computer network or
internet. A Distributed Database Management System (DDBMS) manages the distributed database and provides
mechanisms so as to make the databases transparent to the users. In these systems, data is intentionally
distributed among multiple nodes so that all computing resources of the organization can be optimally used. A
distributed database is a collection of multiple interconnected databases, which are spread physically across
various locations that communicate via a computer network.

1. Features of Distributed database


 Databases in the collection are logically interrelated with each other. Often they represent a single
logical database.
 Data is physically stored across multiple sites. Data in each site can be managed by a DBMS
independent of the other sites.
 The processors in the sites are connected via a network. They do not have any multiprocessor
configuration.
 A distributed database is not a loosely connected file system.
 A distributed database incorporates transaction processing, but it is not synonymous with a transaction
processing system.

A distributed database management system (DDBMS) is a centralized software system that manages a
distributed database in a manner as if it were all stored in a single location.

2. Features of Distributed DBMS


 It is used to create, retrieve, update and delete distributed databases.
 It synchronizes the database periodically and provides access mechanisms by the virtue of which the
distribution becomes transparent to the users.
 It ensures that the data modified at any site is universally updated.
 It is used in application areas where large volumes of data are processed and accessed by numerous
users simultaneously.
 It is designed for heterogeneous database platforms.
 It maintains confidentiality and data integrity of the databases.

3. Factors Encouraging DDBMS


 Distributed Nature of Organizational Units − Most organizations in the current times are subdivided into
multiple units that are physically distributed over the globe. Each unit requires its own set of local data.
Thus, the overall database of the organization becomes distributed.
 Need for Sharing of Data − The multiple organizational units often need to communicate with each
other and share their data and resources. This demands common databases or replicated databases that
should be used in a synchronized manner.
 Support for Both OLTP and OLAP − Online Transaction Processing (OLTP) and Online Analytical
Processing (OLAP) work upon diversified systems which may have common data. Distributed database
systems aid both these processing by providing synchronized data.
 Database Recovery − One of the common techniques used in DDBMS is replication of data across
different sites. Replication of data automatically helps in data recovery if database in any site is
damaged. Users can access data from other sites while the damaged site is being reconstructed. Thus,
database failure may become almost inconspicuous to users.
61
 Support for Multiple Application Software − Most organizations use a variety of application software
each with its specific database support. DDBMS provides a uniform functionality for using the same
data among different platforms.

4. Advantages of Distributed Databases


 Modular Development − If the system needs to be expanded to new locations or new units, in
centralized database systems, the action requires substantial efforts and disruption in the existing
functioning. However, in distributed databases, the work simply requires adding new computers and
local data to the new site and finally connecting them to the distributed system, with no interruption in
current functions.
 More Reliable − In case of database failures, the total system of centralized databases comes to a halt.
However, in distributed systems, when a component fails, the functioning of the system continues may
be at a reduced performance. Hence DDBMS is more reliable.
 Better Response − If data is distributed in an efficient manner, then user requests can be met from local
data itself, thus providing faster response. On the other hand, in centralized systems, all queries have to
pass through the central computer for processing, which increases the response time.
 Lower Communication Cost − In distributed database systems, if data is located locally where it is
mostly used, then the communication costs for data manipulation can be minimized. This is not feasible
in centralized systems.

5. Adversities of Distributed Databases


 Need for complex and expensive software − DDBMS demands complex and often expensive software
to provide data transparency and co-ordination across the several sites.
 Processing overhead − Even simple operations may require a large number of communications and
additional calculations to provide uniformity in data across the sites.
 Data integrity − The need for updating data in multiple sites pose problems of data integrity.
 Overheads for improper data distribution − Responsiveness of queries is largely dependent upon proper
data distribution. Improper data distribution often leads to very slow response to user requests.

6. Distributed Database Vs Centralized Database


 In Centralized DBMS the database are stored in a only one site and In Distributed DBMS the database
are stored in different site and help of network it can access it
 If the data is stored at a single computer site, which can be used by multiple users In Centralized DBMS
while Database and DBMS software distributed over many sites, connected by a computer network In
Distributed DBMS.
 Database is maintained at one site In Centralized DBMS while Database is maintained at a number of
different sites In Distributed DBMS.
 If centralized system fails, entire system is halted while If one system fails, system continues work with
other site.
 It is a less reliable and It is a more reliable.

62
7. Types of Distributed Databases
Distributed databases can be broadly classified into homogeneous and heterogeneous distributed database
environments

a. Homogeneous Distributed Databases


In a homogeneous distributed database, all the sites use identical DBMS and operating systems. Its properties
are –

 The sites use very similar software.


 The sites use identical DBMS or DBMS from the same vendor.
 Each site is aware of all other sites and cooperates with other sites to process user requests.
 The database is accessed through a single interface as if it is a single database.

There are two types of homogeneous distributed database

Autonomous:
Each database is independent that functions on its own. They are integrated by a controlling application and use
message passing to share data updates.

Non-autonomous
Data is distributed across the homogeneous nodes and a central or master DBMS co-ordinates data updates
across the sites.

b. Heterogeneous Distributed Databases


In a heterogeneous distributed database, different sites have different operating systems, DBMS products and
data models. Its properties are –

 Different sites use dissimilar schemas and software.


 The system may be composed of a variety of DBMSs like relational, network, hierarchical or object
oriented.
 Query processing is complex due to dissimilar schemas. Transaction processing is complex due to
dissimilar software.
 A site may not be aware of other sites and so there is limited co-operation in processing user requests.

63
Types of Heterogeneous Distributed Databases

Federated
The heterogeneous database systems are independent in nature and integrated together so that they function as a
single database system.

Un-federated
The database systems employ a central coordinating module through which the databases are accessed.

8. Distributed DBMS Architectures


DDBMS architectures are generally developed depending on three parameters –

 Distribution − It states the physical distribution of data across the different sites.
 Autonomy − It indicates the distribution of control of the database system and the degree to which each
constituent DBMS can operate independently.
 Heterogeneity − It refers to the uniformity or dissimilarity of the data models, system components and
databases.

a. Client - Server Architecture for DDBMS


This is a two-level architecture where the functionality is divided into servers and clients. The server functions
primarily encompass data management, query processing, optimization and transaction management. Client
functions include mainly user interface. However, they have some functions like consistency checking and
transaction management. Distinguish the functionality and divide these functions into two classes, server
functions and client functions.

Server does most of the data management work – query processing – data management – Optimization –
Transaction management etc. Client performs – Application – User interface – DBMS Client model.

The two different client - server architecture are

Single Server Multiple Client Single Server accessed by multiple clients • A client server architecture has a
number of clients and a few servers connected in a network. • A client sends a query to one of the servers. The
earliest available server solves it and replies. • A Client-server architecture is simple to implement and execute
due to centralized server system.

b. Peer- to-Peer Architecture for DDBMS


In these systems, each peer acts both as a client and a server for imparting database services. The peers share
their resource with other peers and co-ordinate their activities. This architecture generally has four levels of
schemas –

 Individual internal schema definition at each site, local internal schema


 Enterprise view of data is described the global conceptual schema.
 Local organization of data at each site is described in the local conceptual schema.
 User applications and user access to the database is supported by external schemas

Local conceptual schemas are mappings of the global schema onto each site. Databases are typically designed in
a top-down fashion, and, therefore all external view definitions are made globally.

64
Major Components of a Peer-to-Peer System
User Processor

 User-interface handler responsible for interpreting user commands, and formatting the result data
 Semantic data controller checks if the user query can be processed.
 Global Query optimizer and decomposer determine an execution strategy. Translates global queries into
local one.
 Distributed execution coordinates the distributed execution of the user request

Data processor

 Local query optimizer acts as the access path selector and Responsible for choosing the best access path
 Local Recovery Manager makes sure local database remains consistent
 Run-time support processor is the interface to the operating system and contains the database buffer
which is responsible for maintaining the main memory buffers and managing the data access.

c. Multi - DBMS Architectures


This is an integrated database system formed by a collection of two or more autonomous database systems.
Multi-DBMS can be expressed through six levels of schemas –

 Multi-database View Level − Depicts multiple user views comprising of subsets of the integrated
distributed database.
 Multi-database Conceptual Level − Depicts integrated multi-database that comprises of global logical
multi-database structure definitions.
 Multi-database Internal Level − Depicts the data distribution across different sites and multi-database to
local data mapping.
 Local database View Level − Depicts public view of local data.
 Local database Conceptual Level − Depicts local data organization at each site.
 Local database Internal Level − Depicts physical data organization at each site.

There are two design alternatives for multi-DBMS − Model with multi-database conceptual level and Model
without multi-database conceptual level.

65
9. Query Optimization
The process of choosing the most appropriate execution strategy for query processing is called query
optimization.

a. Query Optimization Issues in DDBMS


In DDBMS, query optimization is a crucial task. The complexity is high since number of alternative strategies
may increase exponentially due to the following factors –

 The presence of a number of fragments.


 Distribution of the fragments or tables across various sites.
 The speed of communication links.
 Disparity in local processing capabilities.

Hence, in a distributed system, the target is often to find a good execution strategy for query processing rather
than the best one. The time to execute a query is the sum of the following

 Time to communicate queries to databases.


 Time to execute local query fragments.
 Time to assemble data from different sites.
 Time to display results to the application.

b. Query Processing
Query processing is a set of all activities starting from query placement to displaying the results of the query.
Semi-join strategies are technique for query processing in distributed database systems. Used for reducing
communication cost. A semi-join between two tables returns rows from the first table where one or more
matches are found in the second table. The difference between a semi-join and a conventional join is that rows
in the first table will be returned at most once. Even if the second table contains two matches for a row in the
first table, only one copy of the row will be returned. Semi-joins are written using EXISTS or IN.

A Simple Semi-Join Example: “Give a list of departments with at least one employee.” Query written with a
conventional join:

SELECT D.deptno, D.dname FROM dept D, emp E

WHERE E.deptno = D.deptno ORDER BY D.deptno;

 A department with N employees will appear in the list N times.


 We could use a DISTINCT keyword to get each department to appear only once.

A Simple Semi-Join Example “Give a list of departments with at least one employee.” Query written with a
semi-join:

SELECT D.deptno, D.dname FROM dept D

WHERE EXISTS (SELECT 1 FROM emp E WHERE E.deptno = D.deptno)

ORDER BY D.deptno;

 No department appears more than once.


 Oracle stops processing each department as soon as the first employee in that department is found.
66
10. Distributed Transactions
A distributed transaction is a database transaction in which two or more network hosts are involved. Usually,
hosts provide transactional resources, while the transaction manager is responsible for creating and managing a
global transaction that encompasses all operations against such resources. Each site has a local transaction
manager (LTM) which is capable of implementing local transactions. The relationship between distributed
transaction management and local transaction management is represented in the reference model. At the bottom
level we have the local transaction managers which do not need communication between them. The LTM’s
implement interface Local_begin. Local_commit, and local_abort. At the next higher level we have the
distributed transaction manager. DTM is by its nature a distributed a distributed level; DTM will be
implemented by a set of local DTM agents which exchanges messages between them. DTM implements
interface begin_transasction, commit, abort, and create. At the higher level we have the distributed transaction,
constituted by the root agent and the other agents.

a. Commits Protocols

The 2-phase commits protocol


Distributed two-phase commit reduces the vulnerability of one-phase commit protocols. The steps performed in
the two phases are as follows –

Phase 1: Prepare Phase

 After each slave has locally completed its transaction, it sends a “DONE” message to the controlling
site. When the controlling site has received “DONE” message from all slaves, it sends a “Prepare”
message to the slaves.
 The slaves vote on whether they still want to commit or not. If a slave wants to commit, it sends a
“Ready” message.
 A slave that does not want to commit sends a “Not Ready” message. This may happen when the slave
has conflicting concurrent transactions or there is a timeout.

Phase 2: Commit/Abort Phase

 After the controlling site has received “Ready” message from all the slaves –
o The controlling site sends a “Global Commit” message to the slaves.
o The slaves apply the transaction and send a “Commit ACK” message to the controlling site.

67
o When the controlling site receives “Commit ACK” message from all the slaves, it considers the
transaction as committed.
 After the controlling site has received the first “Not Ready” message from any slave –
o The controlling site sends a “Global Abort” message to the slaves.
o The slaves abort the transaction and send a “Abort ACK” message to the controlling site.
o When the controlling site receives “Abort ACK” message from all the slaves, it considers the
transaction as aborted.

Reliability: Reliability is defined as a measure of the success with which the system conforms to some
authoritative specification of its behavior. When the behavior deviates from that which is specified for it, this is
called as Failure. The reliability of the system is inversely related to the frequency of failures.

Termination protocol for 2-phase-commitment: The termination protocol for the 2-phase-commitment
protocol must allow the transactions to be terminated at all operational sites when a failure of the coordinator
site occurs. This is possible in the following two cases:

1. At least one of the participants has received the command. In this case, the other participants can be told
by this participant of the outcome of the transactions and can terminate it.
2. None of the participants has received the command, and only the coordinator site has crashed, so that all
participants are operational. In this case, the participants can elect a new coordinator and resume the
protocol.

In above cases, the transactions can be correctly terminated at all operational sites. Termination is impossible
when no operational participants has received the command and at least one participant failed, because the
operational participants cannot know the failed participant has done and cannot take an independent decision.
So, if a coordinator fails termination is impossible. This problem can be eliminated by modifying the 2-phase-
commitment protocol in the 3- phase-commitment protocol.

The 3-phase-commitment protocol


In this protocol, the participants do not directly commit the transactions during the second phase of
commitment, instead they reach in this phase a new prepared-to-commit (PC) state. So an additional third phase
is required for actually committing the transactions. This protocol eliminates the blocking problem of the 2-
phase-commitment protocol because,

1. If one of the operational participants has received the command and the command was ABORT, then
the operational participant can abort the transactions. The failed participant will abort the transaction at
restart if it has not done it already.
2. If one of the operational participants has received the command and the command was ENTER-
PREPARED-STATE, then all the operational participants can commit the transactions, terminating the
second phase if necessary m performing the third phase.
3. If none of the operational participants has received the ENTER-PREPARED-STATE command, 2-
phase-commitment protocol cannot be terminated. But with our new protocol, the operational
participants can abort the transactions, because the failed participants has not committed. The failed
transactions therefore abort the transactions at restart.

This new protocol requires three phases for committing a transaction and two phases for aborting it.

Termination protocol for 3-phase-commitment: “If at least one operational participant has not entered the
prepared-to-commit state, then the transactions can be aborted. If at least one operational participant has entered
68
the prepared-to-commit state, then the transactions can be safely committed.” Since the above two condition are
not mutually exclusive, in several cases the termination protocol can decide whether to commit or abort the
transactions. The protocol which always commits the transactions when both cases are possible is called
progressive. The simplest termination protocol is the centralized, non-progressive protocol. First a new
coordinator is elected by the operational participants. Then the new coordinator behaves as follows:

1. If the new coordinator is in the prepared-to-commit state, it issues to all operational participants the
command to enter also in this state. When it has received all the OK messages, it issues the COMMIT
command.
2. If the new coordinator is in commit state, i.e. it has committed the transactions; it issues the COMMIT
command to all participants.
3. If the new coordinator is in the abort state, it issues the ABORT command to all participants.
4. Otherwise, the new coordinator orders all participants to go back to a state previous to the prepared-to-
commit and after it has received all the acknowledgements, it issues the ABORT command.

b. Concurrency Control in Distributed Systems

Distributed Two-phase Locking Algorithm


The basic principle of distributed two-phase locking is same as the basic two-phase locking protocol. However,
in a distributed system there are sites designated as lock managers. A lock manager controls lock acquisition
requests from transaction monitors. In order to enforce coordination between the lock managers in various sites,
at least one site is given the authority to see all transactions and detect lock conflicts. Depending upon the
number of sites who can detect lock conflicts, distributed two-phase locking approaches can be of three types –

 Centralized two-phase locking − In this approach, one site is designated as the central lock manager. All
the sites in the environment know the location of the central lock manager and obtain lock from it
during transactions.
 Primary copy two-phase locking − In this approach, a number of sites are designated as lock control
centers. Each of these sites has the responsibility of managing a defined set of locks. All the sites know
which lock control center is responsible for managing lock of which data table/fragment item.
 Distributed two-phase locking − In this approach, there are a number of lock managers, where each lock
manager controls locks of data items stored at its local site. The location of the lock manager is based
upon data distribution and replication.

Distributed Timestamp Concurrency Control


In a centralized system, timestamp of any transaction is determined by the physical clock reading. But, in a
distributed system, any site’s local physical/logical clock readings cannot be used as global timestamps, since
they are not globally unique. So, a timestamp comprises of a combination of site ID and that site’s clock
reading. For implementing timestamp ordering algorithms, each site has a scheduler that maintains a separate
queue for each transaction manager. During transaction, a transaction manager sends a lock request to the site’s
scheduler. The scheduler puts the request to the corresponding queue in increasing timestamp order. Requests
are processed from the front of the queues in the order of their timestamps, i.e. the oldest first. Deadlock is a
state of a database system having two or more transactions, when each transaction is waiting for a data item that
is being locked by some other transaction. A deadlock can be indicated by a cycle in the wait-for-graph. This is
a directed graph in which the vertices denote transactions and the edges denote waits for data items.

69
For example, in the following wait-for-graph, transaction T1 is waiting for data item X which is locked by T3.
T3 is waiting for Y which is locked by T2 and T2 is waiting for Z which is locked by T1. Hence, a waiting cycle
is formed, and none of the transactions can proceed executing.

c. Deadlock Handling in Distributed Systems


Transaction processing in a distributed database system is also distributed, i.e. the same transaction may be
processing at more than one site. The two main deadlock handling concerns in a distributed database system that
are not present in a centralized system are transaction location and transaction control. Once these concerns are
addressed, deadlocks are handled through any of deadlock prevention, deadlock avoidance or deadlock detection
and removal.

Transaction Location: Transactions in a distributed database system are processed in multiple sites and use
data items in multiple sites. The amount of data processing is not uniformly distributed among these sites. The
time period of processing also varies. Thus the same transaction may be active at some sites and inactive at
others. When two conflicting transactions are located in a site, it may happen that one of them is in inactive
state. This condition does not arise in a centralized system. This concern is called transaction location issue.
This concern may be addressed by Daisy Chain model. In this model, a transaction carries certain details when it
moves from one site to another. Some of the details are the list of tables required, the list of sites required, the
list of visited tables and sites, the list of tables and sites that are yet to be visited and the list of acquired locks
with types. After a transaction terminates by either commit or abort, the information should be sent to all the
concerned sites.

Transaction Control: Transaction control is concerned with designating and controlling the sites required for
processing a transaction in a distributed database system. There are many options regarding the choice of where
to process the transaction and how to designate the center of control, like –

 One server may be selected as the center of control.


 The center of control may travel from one server to another.
 The responsibility of controlling may be shared by a number of servers.

Distributed Deadlock Prevention


Just like in centralized deadlock prevention, in distributed deadlock prevention approach, a transaction should
acquire all the locks before starting to execute. This prevents deadlocks. The site where the transaction enters is
designated as the controlling site. The controlling site sends messages to the sites where the data items are
located to lock the items. Then it waits for confirmation. When all the sites have confirmed that they have
locked the data items, transaction starts. If any site or communication link fails, the transaction has to wait until
they have been repaired. Though the implementation is simple, this approach has some drawbacks –

 Pre-acquisition of locks requires a long time for communication delays. This increases the time required
for transaction.
 In case of site or link failure, a transaction has to wait for a long time so that the sites recover.
Meanwhile, in the running sites, the items are locked. This may prevent other transactions from
executing.
 If the controlling site fails, it cannot communicate with the other sites. These sites continue to keep the
locked data items in their locked state, thus resulting in blocking.

70
Distributed Deadlock Avoidance
As in centralized system, distributed deadlock avoidance handles deadlock prior to occurrence. Additionally, in
distributed systems, transaction location and transaction control issues needs to be addressed. Due to the
distributed nature of the transaction, the following conflicts may occur –

 Conflict between two transactions in the same site.


 Conflict between two transactions in different sites.

In case of conflict, one of the transactions may be aborted or allowed to wait as per distributed wait-die or
distributed wound-wait algorithms. Let us assume that there are two transactions, T1 and T2. T1 arrives at Site P
and tries to lock a data item which is already locked by T2 at that site. Hence, there is a conflict at Site P. The
algorithms are as follows –

Distributed Wound-Die

 If T1 is older than T2, T1 is allowed to wait. T1 can resume execution after Site P receives a message
that T2 has either committed or aborted successfully at all sites.
 If T1 is younger than T2, T1 is aborted. The concurrency control at Site P sends a message to all sites
where T1 has visited to abort T1. The controlling site notifies the user when T1 has been successfully
aborted in all the sites.

Distributed Wait-Wait

 If T1 is older than T2, T2 needs to be aborted. If T2 is active at Site P, Site P aborts and rolls back T2
and then broadcasts this message to other relevant sites. If T2 has left Site P but is active at Site Q, Site
P broadcasts that T2 has been aborted; Site L then aborts and rolls back T2 and sends this message to all
sites.
 If T1 is younger than T1, T1 is allowed to wait. T1 can resume execution after Site P receives a message
that T2 has completed processing.

Distributed Deadlock Detection


Just like centralized deadlock detection approach, deadlocks are allowed to occur and are removed if detected.
The system does not perform any checks when a transaction places a lock request. For implementation, global
wait-for-graphs are created. Existence of a cycle in the global wait-for-graph indicates deadlocks. However, it is
difficult to spot deadlocks since transaction waits for resources across the network. Alternatively, deadlock
detection algorithms can use timers. Each transaction is associated with a timer which is set to a time period in
which a transaction is expected to finish. If a transaction does not finish within this time period, the timer goes
off, indicating a possible deadlock. Another tool used for deadlock handling is a deadlock detector. In a
centralized system, there is one deadlock detector. In a distributed system, there can be more than one deadlock
detectors. A deadlock detector can find deadlocks for the sites under its control. There are three alternatives for
deadlock detection in a distributed system, namely.

 Centralized Deadlock Detector − One site is designated as the central deadlock detector.
 Hierarchical Deadlock Detector − A number of deadlock detectors are arranged in hierarchy.
 Distributed Deadlock Detector − All the sites participate in detecting deadlocks and removing them.

71
Chapter 12: Object Oriented Database
An object-oriented database is a database that subscribes to a model with information represented by objects.
Object-oriented databases are a niche offering in the relational database management system (RDBMS) field
and are not as successful or well-known as mainstream database engines. As the name implies, the main feature
of object-oriented databases is allowing the definition of objects, which are different from normal database
objects. Objects, in an object-oriented database, reference the ability to develop a product, then define and name
it. The object can then be referenced, or called later, as a unit without having to go into its complexities. This is
very similar to objects used in object-oriented programming. A real-life parallel to objects is a car engine. It is
composed of several parts: the main cylinder block, the exhaust system, intake manifold and so on. Each of
these is a standalone component; but when machined and bolted into one object, they are now collectively
referred to as an engine. Examples of object-oriented database engines include db4o, Smalltalk and Cache.

1. Characteristics of Object oriented database


 It keeps up a direct relation between real world and database objects as if objects do not lose their
integrity and identity.
 OODBs provide system generated object identifier for each object so that an object can easily be
identified and operated upon.
 OODBs are extensible, which identifies new data types and the operations to be performed on them.
 Provides encapsulation, feature which, the data representation and the methods implementation is
hidden from external entities.
 Also provides inheritance properties in which an object inherits the properties of other objects.

2. Key DIFFERENCE between OODBMS and RDBMS


The main difference between these two systems of database managing is that the way they access and process
information. In relational database managing system data is transferred in a relational way. This means that each
access control table that stores data has a key field that identifies a row. In both network and hierarchical
database accessing information is performed differently. Relational database connects data tables with rows to
transfer information. In object oriented database management we have an entirely different approach where the
information is represented in objects. If you are familiar with object-oriented programming, you’ll recognize the
pattern. The main difference between object oriented database management system and a relation model is their
approach on a digital transcript of information and the programming language. When data is stored in an object-
oriented database system, it is in the form of an object. Each object consists of two elements where one is a
piece of data (sound, text, video, etc.) and another is instruction for software. The instruction determines how
the information will be transferred to another data file and the piece of data regulated where a specific type of
this information will be heading to. In this complex system of managing data, it’s not enough to simply know a
specific language but to understand commands as well.

72
3. Object Oriented Data Model(OODM)
Object oriented data models are a logical data models that capture the semantics of objects supported on object
oriented programming. OODMs implement conceptual models directly and can represent complexities that are
beyond the capabilities of relational systems. OODBs have adopted many of the concepts that were developed
originally for object oriented programming language. An object oriented database is a collection of objects
defined by an object oriented data model. An object oriented database can extend the existence of objects so that
they are stored permanently. Therefore, the objects persist beyond program termination and can be retrieved
later and shared by other programs.

4. Components of Object Oriented Data Model:


The OODBMS is based on three major components, namely: Object structure, Object classes, and Object
identity. These are explained as following below.

a. Object Structure:
The structure of an object refers to the properties that an object is made up of. These properties of an object are
referred to as an attribute. Thus, an object is a real world, world entity with certain attributes that makes up the
object structure. Also an object encapsulates the data code into a single unit which in turn provides data
abstraction by hiding the implementation details from the user. The object structure is further composed of three
types of components: Messages, Methods, and Variables. These are explained as following below.

Messages
A message provides an interface or acts as a communication medium between an object and the outside world.
A message can be of two types:

 Read-only message: If the invoked method does not change the value of a variable, then the invoking
message is said to be a read-only message.
 Update message: If the invoked method changes the value of a variable, then the invoking message is
said to be an update message.

Methods
When a message is passed then the body of code that is executed is known as a method. Every time when a
method is executed, it returns a value as output. A method can be of two types:

 Read-only method: When the value of a variable is not affected by a method, then it is known as read-
only method.
 Update-method: When the value of a variable changes by a method, then it is known as an update
method.

Variables
It stores the data of an object. The data stored in the variables makes the object distinguishable from one
another.

b. Object Classes:
An object which is a real world entity is an instance of a class. Hence first we need to define a class and then the
objects are made which differ in the values they store but share the same class definition. The objects in turn
corresponds to various messages and variables stored in it. An OODBMS also supports inheritance in an
extensive manner as in a database there may be many classes with similar methods, variables and messages.
Thus, the concept of class hierarchy is maintained to depict the similarities among various classes. The concept

73
of encapsulation that is the data or information hiding is also supported by object oriented data model. And this
data model also provides the facility of abstract data types apart from the built-in data types like char, int, float.
ADT’s are the user defined data types that hold the values within it and can also have methods attached to it.
Thus, OODBMS provides numerous facilities to its users, both built-in and user defined. It incorporates the
properties of an object oriented data model with a database management system, and supports the concept of
programming paradigms like classes and objects along with the support for other concepts like encapsulation,
inheritance and the user defined ADT’s (abstract data types). When the database techniques are combined with
object oriented concepts, the result is an object oriented management system (ODBMS). Today’s trend in
programming languages is to utilize objects, thereby making OODBMS is ideal for Object Oriented
programmers because they can develop the product, store them as objects, and can replicate or modify existing
objects to make new objects within the OODBMS. Object databases based on persistent programming acquired
a niche in application areas such as engineering and spatial databases, telecommunications, and scientific areas
such as high energy physics and molecular biology.

5. Object, Attributes and Identity


 Attributes : The attributes are the characteristics used to describe objects. Attributes are also known as
instance variables. When attributes are assigned values at a given time, it is assumed that the object is in
a given state at that time.
 Object : An object is an abstract representation of the real world entity which has a unique identity,
embedded properties, and the ability to interact with other objects and itself.
 Identity : The identity is an external identifier- the object ID- maintained for each object. The Object ID
is assigned by the system when the object is created, and cannot be changed. It is unlike the relational
database, for example, where a data value stored within the object is used to identify the object.

6. Object oriented methodologies


There are certain object oriented methodologies are use in OODB. These are:

 Class: A class is assumed as a group of objects with the same or similar attributes and behavior. ▪
 Encapsulation: It is the property that the attributes and methods of an object are hidden from outside
world. A published interface is used to access an object’s methods.
 Inheritance: It is the property which, when classes are arranged in a hierarchy, each class assumes the
attributes and methods of its ancestors. For example, class students are the ancestor of undergraduate
students and post graduate students.
 Polymorphism: It allows several objects to represent to the same message in different ways. In the
object oriented database model, complex objects are modeled more naturally and easily.

8. Advantages of Object oriented data model over Relational model


The ODBMS which is an abbreviation for object oriented database management system, is the data model in
which data is stored in form of objects, which are instances of classes. These classes and objects together makes
an object oriented data model. When compared with the relational model, the object oriented data model has the
following advantages.

 Reusability: generic objects can be defined and then reused in numerous application.
 Complex data types: Can manage complex data such as document, graphics, images, voice messages,
etc.
 Distributed databases: Due to mode of communication between objects, OODBMS can support
distribution of data across networks more easily.
74
9. Advantages of OODB over RDBMS Object
 Objects do not require assembly and dis-assembly saving coding time and execution time to assemble or
disassemble objects.
 Reduced paging
 Easier navigation
 Better concurrency control
 Data model is based on the real world
 Works well for distributed architectures
 Less code required when applications are object oriented.

An RDBMS might commonly involve SQL statements such as these and Most current SQL databases allow the
crafting of custom functions, which would allow the query to appear as in right.

In an object-relational database, one might see something like this, with user-defined datatypes and expressions
such as BirthDay():

The object-relational model can offer another advantage in that the database can make use of the relationships
between data to easily collect related records. In an address bookapplication, an additional table would be added
to the ones above to hold zero or more addresses for each customer. Using a traditional RDBMS, collecting
information for both the user and their address requires a "join":

75
Chapter 13: Database Security and Access Control
Legal and ethical issues — Policy issues — System-related issues

A DBMS typically includes a database security and authorization subsystem that is responsible for ensuring the
security portions of a database against unauthorized access.

1. Two types of database security mechanisms:


a. Discretionary security mechanisms
The typical method of enforcing discretionary access control in a database system is based on the granting and
revoking privileges.

b. Mandatory security mechanisms


In many applications, and additional security policy is needed that classifies data and users based on security
classes. This approach as mandatory access control, would typically be combined with the discretionary access
control mechanisms.

2. Security Issues in Databases


 The security mechanism of a DBMS must include provisions for restricting access to the database as a
whole; this function is called access control and is handled by creating user accounts and passwords to
control login process by the DBMS.
 The security problem associated with databases is that of controlling the access to a statistical database,
which is used to provide statistical information or summaries of values based on various criteria. ¢ The
counter measures to statistical database security problem is called inference control measures.
 Another security is that of flow control, which prevents information from flowing in such a way that it
reaches unauthorized users. ¢ Channels that are pathways for information to flow implicitly in ways that
violate the security policy of an organization are called covert channels. ¢ A final security issue is data
encryption, which is used to protect sensitive data (such as credit card numbers) that is being transmitted
via some type communication network.
 The data is encoded using some coding algorithm. An unauthorized user who access encoded data will
have difficulty deciphering it, but authorized users are given decoding or decrypting algorithms (or
keys) to decipher data.

3. Database Security and the DBA


The database administrator (DBA) is the central authority for managing a database system. The DBA’s
responsibilities include granting privileges to users who need to use the system and classifying users and data in
accordance with the policy of the organization. The DBA has a DBA account in the DBMS, sometimes called a
system or superuser account, which provides powerful capabilities:

 Account creation
 Privilege granting
 Privilege revocation
 Security level assignment

The DBA is responsible for the overall security of the database system. Action 1 is access control, whereas 2
and 3 are discretionary and 4 is used to control mandatory authorization. Whenever a person or group of person
s need to access a database system, the individual or group must first apply for a user account. The DBA will
then create a new account number and password for the user if there is a legitimate need to access the database.
76
If any tampering with the database is suspected, a database audit is performed, which consists of reviewing the
log to examine all accesses and operations applied to the database during a certain time period. A database log
that is used mainly for security purposes is sometimes called an audit trail.

4. Discretionary Privileges
a. The account level:
At this level, the DBA specifies the particular privileges that each account holds independently of the relations
in the database. The privileges at the account level apply to the capabilities provided to the account itself and
can include the CREATE SCHEMA or CREATE TABLE privilege, to create a schema or base relation; the
ALTER privilege, To apply schema changes such adding or removing attributes from relations; the DROP
privilege, to delete relations or views; the MODIFY privilege, to insert, delete, or update tuples; and the
SELECT privilege, to retrieve information from the database by using a SELECT query.

b. The relation (or table level):


At this level, the DBA can control the privilege to access each individual relation or view in the database. To
control the granting and revoking of relation privileges, each relation R in a database is assigned an owner
account, which is typically the account that was used when the relation was created in the first place. The owner
of a relation is given all privileges on that relation. In SQL2, the DBA can assign an owner to a whole schema
by creating the schema and associating the appropriate authorization identifier with that schema, using the
CREATE SCHEMA command. The owner account holder can pass privileges on any of the owned relation to
other users by granting privileges to their accounts.

In SQL the following types of privileges can be granted on each individual relation R:

 SELECT (retrieval or read) privilege on R: Gives the account retrieval privilege. In SQL this gives the
account the privilege to use the SELECT statement to retrieve tuples from R.
 MODIFY privileges on R: This gives the account the capability to modify tuples of R. In SQL this
privilege is further divided into UPDATE, DELETE, and INSERT privileges to apply the corresponding
SQL command to R. In addition, both the INSERT and UPDATE privileges can specify that only
certain attributes can be updated by the account.

Note that to create a view, the account must have SELECT privilege on all relations involved in the view
definition. The mechanism of views is an important discretionary authorization mechanism in its own right.

For example, if the owner A of a relation R wants another account B to be able to retrieve only some fields of R,
then A can create a view V of R that includes only those attributes and then grant SELECT on V to B. The same
applies to limiting B to retrieving only certain tuples of R; a view V’ can be created by defining the view by
means of a query that selects only those tuples from R that A wants to allow B to access.

5. Granting and Revoking of Privileges


The SQL standard includes the privileges select, insert, update, and delete. The privilege all privileges can be
used as a short form for all the allowable privileges. A user who creates a new relation is given all privileges on
that relation automatically For example, the owner of a relation may want to grant the SELECT privilege to a
user for a specific task and then revoke that privilege once the task is completed. Hence, a mechanism for
revoking privileges is needed. In SQL, a REVOKE command is included for the purpose of canceling
privileges. Example of Grant and Revoke Commands in SQL Suppose that the DBA creates four accounts -- Al,
A2, A3, and A4-- and wants only A1 to be able to create base relations; then the DBA must issue the following
GRANT command in SQL: GRANT CREATE TAB TO AI;
77
In SQL2 the same effect can be accomplished by having the DBA issue a CREATE SCHEMA command as:

CREATE SCHEMA EXAMPLE AUTHORIZATION A1;

GRANT privilege_name ON object_name TO {user_name [PUBLIC |role_name} [WITH GRANT OPTION];

For eg: GRANT SELECT ON EMPLOYEE, DEPARTMENT TO A3 WITH GRANT OPTION;

GRANT INSERT, DELETE ON EMPLOYEE, DEPARTMENT TO A2

The REVOKE command removes user access rights or privileges to the database objects. The Syntax for the
REVOKE command is:

REVOKE privilege_name ON object_name FROM {user_name [PUBLIC |role_name}

REVOKE SELECT ON EMPLOYEE FROM A3

Privilege for Views: Suppose that Al wants to give back to A3 a limited capability to SELECT from the
EMPLOYEE relation and wants to allow A3 to be able to propagate the privilege. The limitation is to retrieve
only the NAME, BDATE, and ADDRESS attributes and only for the tuples with DNO=5. Al then create the
view:

CREATE VIEW A3EMPLOYEE AS SELECT NAME, BDATE, ADDRESS

FROM EMPLOYEE WHERE DNO = 5;

After the view is created, Al can grant SELECT on the view AZEMPLO YEE to A3 as follows:

GRANT SELECT ON A3EMPLOYEE TO A3 WITH GRANT OPTION;

6. Access Control
The purpose of access control is to limit the actions or operations that a legitimate user of a computer system
can perform. Access control constraints what a user can do directly, as well as what programs executing on
behalf of the users are allowed to do. In this way access control seeks to prevent activity that could lead to a
breach of security.

a. Discretionary access control (DAC)


The discretionary access control techniques of granting and revoking privileges on relations have traditionally
been the main security mechanism for relational database systems. Specify the rules, under which subjects can,
at their discretion, create and delete objects, and grant and revoke authorizations for accessing objects to others.
Discretionary models are applied in relational database systems using System R.

78
b. MANDATORY ACCESS CONTROL (MAC)
MAC security policies govern the access on the basis of the classifications of subjects and objects in the system.
Objects are the passive entries storing information for example relations, tuples in a relation etc. Subjects are
active entities that access the objects, usually, active processes operating on behalf of users.

This is an all-or-nothing method: A user either has or does not have a certain privilege. In many applications,
and additional security policy is needed that classifies data and users based on security classes. This approach as
mandatory access control would typically be combined with the discretionary access control mechanisms.

Typical security classes are Top secret (TS), secret (S), confidential (C), and unclassified (U), where TS is the
highest level and U the lowest.

The commonly used model for multilevel security, known as the Bell-LaPadula model. Classifies each subject
(user, account, program) and object (relation, tuple, column, view, operation) into one of the security
classifications, T, S$, C, or U. Clearance (classification) of a subject S as class(S) and to the classification of an
object O as class(O). Two restrictions are enforced on data access based on the subject/object classifications:

 A subject S is not allowed read access to an object O unless class(S) = class(O). This is known as the
simple security property.
 A subject S is not allowed to write an object O unless class(S) < class(O). This known as the star
property (or * property).
 To incorporate multilevel security notions into the relational database model, it is common to consider
attribute values and tuples as data objects. Hence, each attribute A is associated with a classification
attribute C in the schema, and each attribute value in a tuple is associated with a corresponding security
classification.
 In addition, in some models, a tuple classification attribute TC is added to the relation attributes to
provide a classification for each tuple as a whole. Hence, a multilevel relation schema R with n
attributes would be represented as R (A1,C1,A2,C2, ..., An,Cn,TC) where each Ci represents the
classification attribute associated with attribute Ai.

The value of the TC attribute in each tuple t — which is the highest of all attribute classification values within t
— provides a general classification for the tuple itself, whereas each Ci provides a finer security classification
for each attribute value within the tuple. A multilevel relation will appear to contain different data to subjects
(users) with different clearance levels. In some cases, it is possible to store a single tuple in the relation at a
higher classification level and produce the corresponding tuples at a lower-level classification through a process
known as filtering. In general, the entity integrity rule for multilevel relations states that all attributes that are
members of the apparent key must not be null and must have the same security classification within each
individual tuple.

Comparing Discretionary Access Control and Mandatory Access Control


 Discretionary Access Control (DAC) policies are characterized by a high degree of flexibility, which
makes them suitable for a large variety of application domains. The main drawback of DAC models is
their vulnerability to malicious attacks, such as Trojan horses embedded in application programs.
 By contrast, mandatory policies ensure a high degree of protection in a way, they prevent any illegal
flow of information. Mandatory policies have the drawback of being too rigid and they are only
applicable in limited environments.

In many practical situations, discretionary policies are preferred because they offer a better trade-off between
security and applicability
79
c. Role-Based Access Control
Role-based access control (RBAC) emerged rapidly in the 1990s as a proven technology for managing and
enforcing security in large-scale enterprisewide systems. In, Role-based access control (RBAC), permissions are
associated with roles, and users are made members of appropriate roles. This greatly simplifies management of
permissions. Roles are closely related to the concept of user groups in access control. However, a role brings
together a set of users on one side and a set of permissions on the other, whereas user groups are typically
defined as a set of users only.

Roles can be created using the CREATE ROLE and DESTROY ROLE commands. The GRANT and REVOKE
commands discussed under DAC can then be used to assign and revoke privileges from roles.

CREATE ROLE role_name [WITH ADMIN {CURRENT_USER | CURRENT_ROLE}]

With the above syntax, a role with role_name is created and immediately assigned to the current user or the
currently active role is passed on to other users. The default usage is WITH ADMIN CURRENT_USER.

 RBAC appears to be a viable alternative to traditional discretionary and mandatory access controls; it
ensures that only authorized users are given access to certain data or resources.
 Many DBMSs have allowed the concept of roles, where privileges can be assigned to roles.
 Role hierarchy in RBAC is a natural way of organizing roles to reflect the organization’s lines of
authority and responsibility.
 Another important consideration in RBAC systems is the possible temporal constraints that may exist
on roles, such as time and duration of role activations, and timed triggering of a role by an activation of
another role.
 Using an RBAC model is highly desirable goal for addressing the key security requirements of Web-
based applications.

In contrast, discretionary access control (DAC) and mandatory access control (MAC) models lack capabilities
needed to support the security requirements emerging enterprises and Web based applications.

80
Chapter 14: XML and Web Services
The Extensible Markup Language (XML) was not designed for database applications. In fact, like the Hyper-
Text Markup Language (HTML) on which the World Wide Web is based, XML has its roots in document
management, and is derived from a language for structuring large documents known as the Standard
Generalized Markup Language (SGML). However, unlike SGML and HTML, XML is designed to represent
data. It is particularly useful as a data format when an application must communicate with another application,
or integrate information from several other applications. When XML is used in these contexts, many database
issues arise, including how to organize, manipulate, and query the XML data.

For the family of markup languages that includes HTML, SGML, and XML, the markup takes the form of tags
enclosed in angle brackets, <>. Tags are used in pairs, with and delimiting the beginning and the end of the
portion of the document to which the tag refers. For example, the title of a document might be marked up as
follows: <title> XML </title> , <slide> Introduction …</slide>

The ability to specify new tags, and to create nested tag structures make XML a great way to exchange data, not
just documents. Much of the use of XML has been in data exchange applications, not as a replacement for
HTML. Tags make data (relatively) self-documenting

E.g.
<university>
<department>
<dept_name>Comp.Sci.</dept_name>
<building>Taylor</building>
<budget>100000</budget>
</department>
<course>
<course_id>CS-101</course_id>
<title>IntrotoComputerScience</title>
<dept_name>Comp.Sci</dept_name>
<credits>4</credits>
</course>

</university>

1. Comparison with Relational Data


Compared to storage of data in a relational database, the XML representation may be inefficient, since tag
names are repeated throughout the document. However, in spite of this disadvantage, an XML representation
has significant advantages when it is used to exchange data between organizations, and for storing complex
structured information in files:

 First, the presence of the tags makes the message self-documenting; that is, a schema need not be
consulted to understand the meaning of the text. We can readily read the fragment above, for example.
 Second, the format of the document is not rigid. For example, if some sender adds additional
information, such as a tag last accessed noting the last date on which an account was accessed, the
recipient of the XML data may simply ignore the tag. As another example, in Figure 23.3, the item with
identifier SG2 has a tag called unit-of-measure specified, which the first item does not. The tag is
required for items that are ordered by weight or volume, and may be omitted for items that are simply
ordered by number. The ability to recognize and ignore unexpected tags allows the format of the data to
81
evolve over time, without invalidating existing applications. Similarly, the ability to have multiple
occurrences of the same tag makes it easy to represent multivalued attributes.
 Third, XML allows nested structures. The purchase order shown in Figure 23.3 illustrates the benefits of
having a nested structure. Each purchase order has a purchaser and a list of items as two of its nested
structures. Each item in turn has an item identifier, description and a price nested within it, while the
purchaser has a name and address nested within it.
 Finally, since the XML format is widely accepted, a wide variety of tools are available to assist in its
processing, including programming language APIs to create and to read XML data, browser software,
and database tools.

2. Structure of XML Data


Tag: label for a section of data and Element: section of data beginning with <tagname> and ending with
matching </tagname>. Elements must be properly nested

Proper nesting <course> … <title> …. </title> </course>

Improper nesting <course> … <title> …. </course> </title>

Formally: every start tag must have a unique matching end tag that is in the context of the same parent element.
Every document must have a single top-level element. Elements can have attributes. An element may have
several attributes, but each attribute name can only occur once

<course course_id= “CS-101”>

<title> Intro. to Computer Science</title>

<dept name> Comp. Sci. </dept name>

<credits> 4 </credits>

</course>
82
Distinction between subelement and attribute: In the context of documents, attributes are part of markup,
while subelement contents are part of the basic document contents. In the context of data representation, the
difference is unclear and may be confusing. Same information can be represented in two ways

<course course_id= “CS-101”> … </course>

<course>
<course_id>CS-101</course_id> …
</course>

Suggestion: use attributes for identifiers of elements, and use subelements for contents

NameSpaces: XML data has to be exchanged between organizations. Same tag name may have different
meaning in different organizations, causing confusion on exchanged documents. Specifying a unique string as
an element name avoids confusion. Better solution: use unique-name:element-name. Avoid using long unique
names all over document by using XML Namespaces

<university xmlns:yale=“http://www.yale.edu”>

……

<yale:course>

<yale:course_id> CS-101 </yale:course_id>

<yale:title> Intro. to Computer Science</yale:title>

<yale:dept_name> Comp. Sci. </yale:dept_name>

<yale:credits> 4 </yale:credits>

</yale:course>

</university>

Elements without subelements or text content can be abbreviated by ending the start tag with a /> and deleting
the end tag

<course course_id=“CS-101” Title=“Intro. To Computer Science” dept_name = “Comp. Sci.” credits=“4” />

To store string data that may contain tags, without the tags being interpreted as subelements, use CDATA as: <!
[CDATA[<course> … </course>]]>

Here, <course> and </course> are treated as just strings and CDATA stands for “character data”

3. XML Document Schema


Databases have schemas, which are used to constrain what information can be stored in the database and to
constrain the data types of the stored information. In contrast, by default, XML documents can be created
without any associated schema: an element may then have any subelement or attribute. While such freedom may
occasionally be acceptable given the self-describing nature of the data format, it is not generally useful when

83
XML documents must be processed automatically as part of an application, or even when large amounts of
related data are to be formatted in XML.

a. Document Type Definition


The document type definition (DTD) is an optional part of an XML document. The main purpose of a DTD is
much like that of a schema: to constrain and type the information present in the document. However, the DTD
does not in fact constrain types in the sense of basic types like integer or string. Instead, it constrains only the
appearance of subelements and attributes within an element. The DTD is primarily a list of rules for what
pattern of subelements may appear within an element.

while the + operator specifies “one or more.” Although not shown here, the ∗ operator is used to specify “zero or
more,” while the ? operator is used to specify an optional element (that is, “zero or one”)

The keyword #PCDATA indicates text data; it derives its name, historically, from “parsed character data.” Two
other special type declarations are empty, which says that the element has no contents, and any, which says that
there is no constraint on the subelements of the element; that is, any elements, even those not mentioned in the
DTD, can occur as subelements of the element. The absence of a declaration for an element is equivalent to
explicitly declaring the type as any.

The allowable attributes for each element are also declared in the DTD. Unlike subelements, no order is
imposed on attributes. Attributes may be specified to be of type CDATA, ID, IDREF, or IDREFS; the type
CDATA simply says that the attribute contains character data, while the other three are not so simple; they are
explained in more detail shortly. For instance, the following line from a DTD specifies that element course has
an attribute of type course id, and a value must be present for this attribute:

<!ATTLIST course course_id CDATA #REQUIRED>, or

<!ATTLIST course course_id ID #REQUIRED

dept_name IDREF #REQUIRED

instructors IDREFS #IMPLIED >

ID and IDREFS: An attribute of type ID provides a unique identifier for the element; a value that occurs in an
ID attribute of an element must not occur in any other element in the same document. At most one attribute of
an element is permitted to be of type ID. (We renamed the attribute ID of the instructor relation to IID in the
XML representation, in order to avoid confusion with the type ID.) An attribute of type IDREF is a reference to

84
an element; the attribute must contain a value that appears in the ID attribute of some element in the document.
The type IDREFS allows a list of references, separated by spaces.

Here are some of the limitations of DTDs as a schema mechanism:

 Individual text elements and attributes cannot be typed further. For instance, the element balance cannot
be constrained to be a positive number. The lack of such constraints is problematic for data processing
and exchange applications, which must then contain code to verify the types of elements and attributes.
 t is difficult to use the DTD mechanism to specify unordered sets of subelements. Order is seldom
important for data exchange (unlike document layout, where it is crucial). While the combination of
alternation (the | operation) and the ∗ or the + operation as in Figure 23.9 permits the specification of
unordered collections of tags, it is much more difficult to specify that each tag may only appear once.
 There is a lack of typing in IDs and IDREFSs. Thus, there is no way to specify the type of element to
which an IDREF or IDREFS attribute should refer. As a result, the DTD in Figure 23.10 does not
prevent the “dept name” attribute of a course element from referring to other courses, even though this
makes no sense

c. XML Schema
An effort to redress the deficiencies of the DTD mechanism resulted in the development of a more sophisticated
schema language, XML Schema. XML Schema defines a number of built-in types such as string, integer,
decimal date, and boolean. In addition, it allows user-defined types; these may be simple types with added
restrictions, or complex types constructed using constructors such as complex Type and sequence.

The first thing to note is that schema definitions in XML Schema are themselves specified in XML syntax, using
a variety of tags defined by XML Schema. To avoid conflicts with user-defined tags, we prefix the XML
Schema tag with the namespace prefix “xs:”; this prefix is associated with the XML Schema namespace by the
xmlns:xs specification in the root element:

85
the use of ref to specify the occurrence of an element defined earlier. XML Schema can define the minimum and
maximum number of occurrences of subelements by using minOccurs and maxOccurs. The default for both
minimum and maximum occurrences is 1, so these have to be specified explicitly to allow zero or more
department, course, instructor, and teaches elements

In SQL, a primary-key constraint or unique constraint ensures that the attribute values do not recur within the
relation. In the context of XML, we need to specify a scope within which values are unique and form a key. The
selector is a path expression that defines the scope for the constraint, and field declarations specify the elements
or attributes that form the key.1 To specify that dept name forms a key for department elements under the root
university element, we add the following constraint specification to the schema definition:

86
Note that the refer attribute specifies the name of the key declaration that is being referenced, while the field
specification identifies the referring attributes. XML Schema offers several benefits over DTDs, and is widely
used today. Among the benefits that we have seen in the examples above are these:

 It allows the text that appears in elements to be constrained to specific types, such as numeric types in
specific formats or complex types such as sequences of elements of other types.
 It allows user-defined types to be created.
 It allows uniqueness and foreign-key constraints.
 It is integrated with namespaces to allow different parts of a document to conform to different schemas

In addition to the features we have seen, XML Schema supports several other features that DTDs do not, such as
these:

 It allows types to be restricted to create specialized types, for instance by specifying minimum and
maximum values.
 It allows complex types to be extended by using a form of inheritance.

4. Querying and Transformation


Given the increasing number of applications that use XML to exchange, mediate, and store data, tools for
effective management of XML data are becoming increasingly important. In particular, tools for querying and
transformation of XML data are essential to extract information from large bodies of XML data, and to convert
data between different representations (schemas) in XML. Just as the output of a relational query is a relation,
the output of an XML query can be an XML document. As a result, querying and transformation can be
combined into a single tool. In this section, we describe the XPath and XQuery languages:

 XPath is a language for path expressions and is actually a building block for XQuery.
 XQuery is the standard language for querying XML data. It is modeled after SQL but is significantly
different, since it has to deal with nested XML data. XQuery also incorporates XPath expressions. The
XSLT language is another language designed for transforming XML.

a. Tree Model of XML


A tree model of XML data is used in all these languages. An XML document is modeled as a tree, with nodes
corresponding to elements and attributes. Element nodes can have child nodes, which can be subelements or
attributes of the element. Correspondingly, each node (whether attribute or element), other than the root
element, has a parent node, which is an element. The order of elements and attributes in the XML document is
modeled by the ordering of children of nodes of the tree. The terms parent, child, ancestor, descendant, and
siblings are used in the tree model of XML data.

87
b. XPath
XPath is used to address (select) parts of documents using path expressions and a path expression is a sequence
of steps separated by “/” Think of file names in a directory hierarchy

Result of path expression: set of values that along with their containing elements/attributes match the
specified path. E.g. /university-3/instructor/name evaluated on the university-3 data we saw earlier returns

<name>Srinivasan</name>
<name>Brandt</name>

E.g. /university-3/instructor/name/text( ) that returns the same names, but without the enclosing tags

The initial “/” denotes root of the document (above the top-level tag)

 Path expressions are evaluated left to right that means Each step operates on the set of instances
produced by the previous step
 Selection predicates may follow any step in a path, in [ ]. E.g. /university-3/course[credits >= 4], it
returns account elements with a balance value greater than 400
/university-3/course[credits] returns account elements containing a credits subelement
 Attributes are accessed using “@”. E.g. /university-3/course[credits >= 4]/@course_id returns the
course identifiers of courses with credits >= 4

XPath provides several functions

 The function count() at the end of a path counts the number of elements in the set generated by the path
E.g. /university-2/instructor[count(./teaches/course)> 2]
Returns instructors teaching more than 2 courses (on university-2 schema)
Also function for testing position (1, 2, ..) of node w.r.t. siblings
 Boolean connectives and and or and function not() can be used in predicates
 IDREFs can be referenced using function id()
id() can also be applied to sets of references such as IDREFS and even to strings containing multiple
references separated by blanks E.g. /university-3/course/id(@dept_name)
returns all department elements referred to from the dept_name attribute of course elements.

Features

 Operator “|” used to implement union


E.g. /university-3/course[@dept name=“Comp. Sci”]|/university-3/course[@dept name=“Biology”]
Gives union of Comp. Sci. and Biology courses. However, “|” cannot be nested inside other operators.
 “//” can be used to skip multiple levels of nodes: E.g. /university-3//name
finds any name element anywhere under the /university-3 element, regardless of the element in which it
is contained.
 A step in the path can go to parents, siblings, ancestors and descendants of the nodes generated by the
previous step, not just to the children
“//”, described above, is a short from for specifying “all descendants”
“..” specifies the parent.
 doc(name) returns the root of a named document. doc(“university.xml”)/university/department

d. XQuery
The World Wide Web Consortium (W3C) has developed XQuery as the standard query language for XML.
88
FLWOR Expressions
XQuery queries are modeled after SQL queries, but differ significantly from SQL. They are organized into five
sections: for, let, where, order by, and return.

Items in the return clause are XML text unless enclosed in {}, in which case they are evaluated. Let clause not
really needed in this query, and selection can be done In XPath An equivalent query may have only for and
return clauses:

We could modify the above query to return an element with tag course, with the course identifier as an attribute,
by replacing the return clause with the following: return.

Joins
Joins are specified in a manner very similar to SQL

The same query can be expressed with the selections specified as XPath selections:

Path expressions in XQuery are the same as path expressions in XPath2.0. Path expressions may return a single
value or element, or a sequence of values or elements. In the absence of schema information, it may not be
possible to infer whether a path expression returns a single value or a sequence of values. Such path expressions
may participate in comparison operations such as =, =.

Nested Queries
XQuery FLWOR expressions can be nested in the return clause, in order to generate element nestings that do
not appear in the source document.

89
$c/* denotes all the children of the node to which $c is bound, without the enclosing top-level tag. While
XQuery does not provide a group by construct, aggregate queries can be written by using the aggregate
functions on path or FLWOR expressions nested within the return clause. For example, the following query on
the university XML schema finds the total salary of all instructors in each department

Sorting in XQuery
Results can be sorted in XQuery by using the order by clause. For instance, this query outputs all instructor
elements sorted by the name subelement:

To sort in descending order, we can use order by $i/name descending. Sorting can be done at multiple levels of
nesting. For instance, we can get a nested representation of university information with departments sorted in
department name order, with courses sorted by course identifiers, as follows:

Functions and Types


XQuery provides a variety of built-in functions, such as numeric functions and string matching and
manipulation functions. In addition, XQuery supports userdefined functions. The following user-defined
function takes as input an instructor identifier, and returns a list of all courses offered by the department to
which the instructor belongs:

90
Types are optional for function parameters and return values. The * (as in decimal*) indicates a sequence of
values of that type. Universal and existential quantification in where clause predicates

 some $e in path satisfies P


 every $e in path satisfies P
 Add and fn:exists($e) to prevent empty $e from satisfying every clause

XQuery also supports If-then-else clauses

5. Relational Databases
Since relational databases are widely used in existing applications, there is a great benefit to be had in
storing XML data in relational databases, so that the data can be accessed from existing applications

a. Storing as string
Small XML documents can be stored as string (clob) values in tuples in a relational database. Large XML
documents with the top-level element having many children can be handled by storing each child element as a
string in a separate tuple. E.g. Account, customer, depositor relations. Each with a string-valued attribute to
store the element. While the above representation is easy to use, the database system does not know the schema
of the stored elements. As a result, it is not possible to query the data directly.

Store values of subelements/attributes to be indexed as extra fields of the relation, and build indices on these
fields E.g. department_name or account_number. Thus, a query that requires department elements with a
specified department name can be answered efficiently with this representation. Some database systems support
function indices, which use the result of a function as the key value. The function should return the value of the
required subelement/attribute

The above approaches have the drawback that a large part of the XML information is stored within strings. It is
possible to store all the information in relations in one of several ways that we examine next.

b. Tree Representation
Arbitrary XML data can be modeled as a tree and stored using a relation: nodes (id, parent id, type, label,
value). Each element and attribute in the XML data is given a unique identifier. A tuple inserted in the nodes
relation for each element and attribute with its identifier (id), the identifier of its parent node (parent id), the type
of the node (attribute or element), the name of the element or attribute (label), and the text value of the element
or attribute (value). If order information of elements and attributes must be preserved, an extra attribute position
can be added to the nodes relation to indicate the relative position of the child among the children of the parent.
This representation has the advantage that all XML information can be represented directly in relational form,
and many XML queries can be translated into relational queries and executed inside the database system.
However, it has the drawback that each element gets broken up into many pieces, and a large number of joins
are required to reassemble sub-elements into an element.

c. Map to Relations
A relation is created for each element type (including subelements) whose schema is known and whose type is a
complex type (that is, contains attributes or subelements).
91
 An id attribute to store a unique id for each element
 A relation attribute corresponding to each element attribute
 A parent_id attribute to keep track of parent element: As in the tree representation and Position
information (ith child) can be store too

All subelements that occur only once can become relation attributes

 For text-valued subelements, store the text as attribute value


 For complex subelements, can store the id of the subelement

Subelements that can occur multiple times represented in a separate table are similar to handling of multivalued
attributes when converting ER diagrams to tables.

Applying above ideas to department elements in university-1 schema, with nested course elements, we get
department(id, dept_name, building, budget) and course(parent id, course_id, dept_name, title, credits)

Publishing: process of converting relational data to an XML format and Shredding: process of converting an
XML document into a set of tuples to be inserted into one or more relations. XML-enabled database systems
support automated publishing and shredding. Some relational databases support native storage of XML. Such
systems store XML data as strings or in more efficient binary representations, without converting the data to
relational form. A new data type xml is introduced to represent XML data, although the CLOB and BLOB data
types may provide the underlying storage mechanism. XML query languages such as XPath and XQuery are
supported to query XML data. Special internal data structures and indices are used for efficiency

6. SQL Extension
While XML is used widely for data interchange, structured data is still widely stored in relational databases.
There is often a need to convert relational data to XML representation. Figure 23.15 shows the SQL/XML
representation of (part of) the university data from above, containing the relations department and course.

92
The above query creates an XML element for each course, with the course identifier and department name
represented as attributes, and title and credits as subelements. Xmlagg creates a forest of XML elements

A stylesheet stores formatting options for a document, usually separately from document. E.g. an HTML style
sheet may specify font colors and sizes for headings, etc. The XML Stylesheet Language (XSL) was originally
designed for generating HTML from XML. XSLT is a general-purpose transformation language that can
translate XML to XML, and XML to HTML. XSLT transformations are expressed using rules called templates.
Templates combine selection using XPath with construction of results

7. Application Program Interface


There are two standard application program interfaces to XML data:

SAX (Simple API for XML): Based on parser model, user provides event handlers for parsing events. E.g. start
of element, end of element

DOM (Document Object Model): XML data is parsed into a tree representation. Variety of functions provided
for traversing the DOM tree. E.g.: Java DOM API provides Node class with methods: getParentNode( ),
getFirstChild( ), getNextSibling( ) and getAttribute( ), getData( ) (for text node, getElementsByTagName( ),
….Also provides functions for updating DOM tree

8. XML Applications
Several applications of XML for storing and communicating (exchanging) data and for accessing Web services
(information resources).

a. Storing and exchanging data with complex structures


XML-based representations are now widely used for storing documents, spreadsheet data and other data that are
part of office application packages. E.g. Open Document Format (ODF) format standard for storing Open Office
and Office Open XML (OOXML) format standard for storing Microsoft Office documents. XML is also used to
represent data with complex structure that must be exchanged between different parts of an application. For
example, a database system may represent a query execution plan (a relational-algebra expression with extra
information on how to execute operations) by using XML. XML-based standards for representation of data have
been developed for a variety of specialized applications, ranging from business applications such as banking and
shipping to scientific applications such as chemistry and molecular biology. Numerous other standards for a
variety of applications: ChemML, MathML.

b. Web Services
Applications often require data from outside of the organization, or from another department in the same
organization that uses a different database. In many such situations, the outside organization or department is

93
not willing to allow direct access to its database using SQL, but is willing to provide limited forms of
information through predefined interfaces. When the information is to be used directly by a human,
organizations provide Web-based forms, where users can input values and get back desired information in
HTML form. However, there are many applications where such information needs to be accessed by software
programs, rather than by end users. Providing the results of a query in XML form is a clear requirement. In
addition, it makes sense to specify the input values to the query also in XML format. In effect, the provider of
the information defines procedures whose input and output are both in XML format.

The HTTP protocol is used to communicate the input and output information, since it is widely used and can go
through firewalls that institutions use to keep out unwanted traffic from the Internet. The Simple Object Access
Protocol (SOAP) defines a standard for invoking procedures, using XML for representing the procedure input
and output. SOAP defines a standard XML schema for representing the name of the procedure, and result status
indicators such as failure/error indicators. The procedure parameters and results are application-dependent XML
data embedded within the SOAP XML headers. Typically, HTTP is used as the transport protocol for SOAP,
but a messagebased protocol (such as email over the SMTP protocol) may also be used. The SOAP standard is
widely used today. For example, Amazon and Google provide SOAP-based procedures to carry out search and
other activities. These procedures can be invoked by other applications that provide higher-level services to
users. The SOAP standard is independent of the underlying programming language, and it is possible for a site
running one language, such as C#, to invoke a service that runs on a different language, such as Java. A site
providing such a collection of SOAP procedures is called a Web service. Several standards have been defined to
support Web services.

The Web Services Description Language (WSDL) is a language used to describe a Web service’s capabilities.
WSDL provides facilities that interface definitions (or function definitions) provide in a traditional
programming language, specifying what functions are available and their input and output types. In addition
WSDL allows specification of the URL and network port number to be used to invoke the Web service.

There is also a standard called Universal Description, Discovery, and Integration (UDDI) that defines how a
directory of available Web services may be created and how a program may search in the directory to find a
Web service satisfying its requirements. The following example illustrates the value of Web services. An airline
may define a Web service providing a set of procedures that can be invoked by a travel Web site; these may
include procedures to find flight schedules and pricing information, as well as to make flight bookings. The
travel Web site may interact with multiple Web services, provided by different airlines, hotels, and other
companies, to provide travel information to a customer and to make travel bookings. By supporting Web
services, the individual companies allow a useful service to be constructed on top, integrating the individual
services. Users can interact with a single Web site to make their travel bookings, without having to contact
multiple separate Web sites. To invoke a Web service, a client must prepare an appropriate SOAP XML
message and send it to the service; when it gets the result encoded in XML, the client must then extract
information from the XML result. There are standard APIs in languages such as Java and C# to create and
extract information from SOAP messages

c. Data Mediation
Common data representation format to bridge different systems. Comparison shopping is an example of a
mediation application, in which data about items, inventory, pricing, and shipping costs are extracted from a
variety of Web sites offering a particular item for sale. The resulting aggregated information is significantly
more valuable than the individual information offered by a single site.

94

You might also like