IT3306 - Data Management Systems - AllInOne

1 : Data Management Evolution
IT3306 – Data Management

Level II - Semester 3
© e-Learning Centre, UCSC

Overview
• This lesson on data management evolution discusses the

gradual development of management of data.
• Here we look into Object, XML and NoSQL databases in detail.
• Finally, we present a comparison between relational databases
concepts and non-relational databases.
© e-Learning Centre, UCSC 2

Intended Learning Outcomes
At the end of this lesson, you will be able to;

• Describe how and why different data management
techniques evolved.
• Identify the requirements for Object Databases.
• Explain the key concepts incorporated from OOP.
• Recognise the importance of NoSQL databases.
• Describe different data models available in NoSQL.
• Analyse the difference between NoSQL and relational
Databases.

List of subtopics
1.1. Major concepts of object-oriented, XML, and NoSQL databases

1.1.1. Object Databases
1.1.1.1. Overview of Object Database Concepts
1.1.1.1.1 Introduction to object-oriented concepts and features
1.1.1.1.2 Object Identity and Literals
1.1.1.1.3 Encapsulation of Operations
1.1.1.1.4 Persistence of Objects
1.1.1.1.5 Type Hierarchies and Inheritance
1.1.1.1.6 Polymorphism and Multiple Inheritance

List of subtopics
1.1.2. XML Databases

1.1.2.1. Reason for the origination of XML
1.1.2.2. Structured, Semi structured, and Unstructured Data
Structured data, Storage in relational database, Semi structured
data, Directed graph model, Unstructured data
1.1.2.3. XML Hierarchical (Tree) Data Model Basic objects,
Element, Attribute, Document types, Data-centric and Document-
centric, Hybrid

List of subtopics
1.1.3. NoSQL Databases

1.1.3.1. Origins of NoSQL Impedance Mismatch, Problem of
clusters, Common characteristics of NoSQL databases, Important
rise of NoSQL with Polyglot Persistence.
1.1.3.2. Data models in NoSQL
1.1.3.2.1. Introduction to Aggregate data models, Reason for
using Aggregate data models
1.1.3.2.2. Key-Value Model and suitable Use Cases
1.1.3.2.3. Document Data Model and suitable Use Cases
1.1.3.2.4. Column-Family Stores and suitable Use Cases
1.1.3.2.5. Data model for complex relationship structures
(Graph database model)

List of subtopics
1.2 Contrast and compare relational databases concepts and non-

relational databases.
1.2.1. Object databases and Relational databases
1.2.2. XML and Relational databases
1.2.3. NoSQL and Relational databases
1.2.3.1. Data modelling difference, Modeling for Data Access
1.2.3.2. Aggregate oriented vs aggregate ignorant
1.2.3.3. Schemalessness in NoSQL
1.2.3.4. Overview of Materialised views

Object Databases - Overview of Object Database
Concepts
Overview of Object Database Concepts

• Relational database systems use relational data model,
similarly Object Databases (aka. Object-Oriented Databases
- OODB ) are built on object data model.
• The major importance of object database is the flexibility to
determine the structure and relevant operations of the
objects.
• In early days business requirements were dealt with
traditional data models such as network model, hierarchical
model and relational model.

Concepts

• However, Real time applications and information systems
which require high performance and calculations such as;
Telecommunication, Architectural Designing, Biological
sciences and Geographical Information Systems (GIS), has
shortcomings when using traditional data models due to their
rigid structures.
• Object databases are developed to serve the new business
requirements and are frequently used in aforementioned
domains.

Concepts

• popularity of Object-Oriented programming languages is the
second fact that bring about object databases.
• Sometimes, applications developed with Object-Oriented
Languages (e.g.: C++, Java) meet conflicts when use with
traditional databases.
• However, object databases facilitate the seamless integration
with application developed using object-oriented languages.
• There are some RDBMS include the features of object
databases. Those are known as object-relational or
RDBMSs.
• Due to the popularity, relational and object-relational database
systems are widely used when compared to the object
databases.
Concepts

• Orion System developed by Microelectronics and Computer
Technology Corporation, OpenOODB - Texas Instruments,Iris
system by Hewlett-Packard (hp) laboratories, the Ode system
by AT&T Bell Labs, and the ENCORE/ObServer project by
Brown University are some examples for experimental
prototypes of Object DBs.
• GemStone Object Server of GemStone Systems, ONTOS
DB of Ontos, Objectivity/DB of Objectivity Inc., Versant
Object Database and FastObjects by Versant Corporation
(and Poet), ObjectStore of Object Design, and Ardent
Database of Ardent are commercially available Object
Database Systems.

Concepts
Introduction to object-oriented concepts and features

• Object- Oriented abbreviated as O-O , this term is originated
from Object Oriented Programming Languages (OOPL).
• Later, the concepts used in OOPL was introduced to other
areas, such as; Database, Software engineering, Computer
Systems and Knowledge base etc.
• Most of the concepts originally developed for OOPL, has been
adopted by Object databases.

Concepts

• The major components of an object are,
⁻ State (value): can have complex data structure
⁻ Behavior (operations)
Object
State
Real world entity
Behavior
1
3
Concepts
OOPL has two object categories based on their existence

⁻ Transient objects - objects which only exist while the
program is running.
⁻ Persistent objects - objects which exist even after the
termination of the program.
1
4
Concepts

• Object- Oriented(O-O) Databases expand the existence of an
object by permanently storing it in a database.
• In a secondary storage, O-O databases can store persistent
objects and can be retrieved later.
• Data stored inside OO databases can be shared among
different applications and programs.
1
5
Concepts
• Object Identity
• Database objects needs to have coherence with real-world
objects in order to preserve their integrity and identity. It will
make easy to identify objects and operate on. Therefore, a
unique identity is assigned to each independent object
stored in the database known as Object Identifier (OID).
• OID generally is system generated.
• The value of the OID might be hidden from external users.
However, it is used inside the system to uniquely identify the
objects and to create and manage inter-object references.
1
6
Concepts
• Object Identifier (OID)
• Main properties of OID,
‒ Immutable - value of OID does not change
‒ Unique - used only once. OID of deleted object will
not be assigned to a new object in the future.
• According to these two properties of OID, OID is independent
from any attribute value of an object. (because attribute value
may change over time)
• Object database systems must have mechanism to generate
OID and preserve immutability. Since an object retains its
identifier over its lifetime, the object remains the same despite
changes in its state.
• OID is similar to the primary key attribute in relational
databases, which is used to uniquely identify the tuples.
1
7
Concepts
Literals
• The Object Model supports different literal types, which are
considered as attribute values.
• Literals are embedded inside objects and the object model
facilitates to define complex structured literals within an object.
• Literals do not have identifiers (OIDs) and, therefore, cannot be
individually referenced like objects from other objects.
1
8
Concepts
Literals
The literal types supported by the Object Model are
– Single-valued or atomic types where each value of the
type is considered as an atomic (indivisible) single value.
– Struct (or tuple) constructor which is used to create
standard structured types, such as the tuples (record types)
in the basic relational model.
– Collection (or multivalued) type constructors which
include the set(T), list(T), bag(T), array(T), and
dictionary(K,T) type constructors.
1
9
Concepts
Single-valued or atomic types

• This includes the basic built-in data types such as integer,
string, char, floating-point number, date, enumerated type
and Boolean.
• For examples some atomic types relevant to the Employee
could be defined as given below:
Fname: string;
Lname: string;
Empid: char(05);
Birth_date: DATE;
Address: string;
Gender: char;
Salary: float;
2
0
Concepts
Struct (or tuple) constructor

• A structured type is made up of several components and is
similar to tuples/record types in the basic relational mode. So
sometimes structured type is referred to as a compound or
composite type.
• The struct constructor is a type generator, because many
different structured types can be created.
• For example, the following structured type can be created for
the composite attribute EmpName with three components
FirstName, MiddleName and LasName or the Date attribute
with its components such as Year, Month and Day.
struct EmpName<FirstName: string, MiddleName: string,

LastName: string>
2
1
Concepts
Collection (or multivalued) type constructors
This enables to represent a collection of elements, which

themselves could be of any literal or object type. The collection
literal types supported by the Object Model include set, bag, list,
array, and dictionary.
A set is an unordered collection of elements {i1, i2, … , in} of the

same type without any duplicates.
A bag (also called a multiset) is an unordered collection of

elements similar to a set except that the elements in a bag may
contain duplicates.
2
2
Concepts
Collection (or multivalued) type constructors (contd..)
In contrast to sets and bags, a list is an ordered collection of

elements of the same type. A list is similar to a bag except that the
elements in a list are ordered.
An array is a dynamically sized ordered collection of elements that

can be located by position. The difference between array and list
is that a list can have an arbitrary number of elements whereas an
array typically has a maximum size.
A dictionary is an unordered sequence of key-value pairs (K, V),

without any duplicates. The value of a key K can be used to
retrieve the corresponding value V.
2
3
Concepts
Complex Object
The literal types supported
by the object model enables
to define complex objects
Set Primitive
which may consist of other Obj Obj
List Primitive
objects as illustrated by the Obj Obj
following Department
example..
define type DEPARTMENT
tuple ( Dname: string; Atomic Type
Tuple Type
Dnumber: integer;
Mgr: tuple ( Manager: EMPLOYEE;
Start_date: DATE; );
Locations: set(string); Set Type
Employees: set(EMPLOYEE);
Projects: set(PROJECT); ); 2
4
Activity
State the correct answer by filling the provided spaces.
1. The major components of an object are _____, _______ .

2. OOPL has two object categories based on their existence,
namely ________ and ________.
3. In ____________ each value of the type is considered as an
atomic (indivisible) single value.
4. _______ constructor is used to create standard structured types.
5. Collection or ______ type constructors which include the set(T),
list(T), bag(T), array(T), and dictionary(K,T) type constructors.
2
5
Concepts
Encapsulation
• The concept of encapsulation is applied to database objects
and it is the mechanism that binds together code and the data it
manipulates.
• Thus, encapsulation defines the behavior of a type of object
based on the operations that can be externally applied to
objects of that type.
• Encapsulation can bring a form of data and operation
independence. Therefore, encapsulation is encouraged by
defining the operation in two parts namely signature and
method.
2
6
Concepts
Encapsulation
‒ Signature/Interface of the operation - Specifies the name of
the operations and its arguments (parameters).
‒ Method/Body - Specifies the implementation of the
operation.
• External programs pass messages to the objects, to invoke
operations which includes the operation name and the
parameters.
• Thus, encapsulation restricts the direct access to the values of an
object as the values are required to be accessed through the pre-
defined methods.
2
7
Concepts
Encapsulation
• For database applications, the requirement that all objects be
completely encapsulated is too strict.
• This requirement is relaxed by dividing the structure of an object
into visible and hidden attributes (instance variables).
• Visible attributes can be seen by and are directly accessible to
the database users and programmers via the query language.
• The hidden attributes of an object are completely encapsulated
and can be accessed only through predefined operations.
2
8
Concepts
Class
• The term class is often used to refer to a type definition, along

with the definitions of the operations for that type.
• The relevant operations are declared for each class, and the
signature (interface) of each operation is included in the class
definition.
• A method (implementation) for each operation must be defined
elsewhere using a programming language.
2
9
Concepts
Class
• The operations which would be defined in a class may include
the
– object constructor operation (often called new), which is
used to create a new object
– destructor operation, which is used to destroy (delete) an
object.
– A number of object modifier operations can also be
declared to modify the states (values) of various attributes of
an object.
– Additional operations can retrieve information about the
object.
3
0
Concepts
The following example shows how the type definitions can be
extended with operations to define classes.
define class DEPARTMENT
type tuple ( Dname: string;
Dnumber: integer;
Mgr: tuple ( Manager: EMPLOYEE;
An operation is applied Locations: set (string);
to an object by using
Employees: set (EMPLOYEE);
the dot notation. For
example, if d is a Projects set(PROJECT); );
reference to a operations no_of_emps: integer;
DEPARTMENT object, create_dept: DEPARTMENT;
an operation such as delete_dept: boolean;
no_of_emps can be assign_emp(e: EMPLOYEE): boolean;
invoked by writing (* adds an employee to the department *)
d.no_of_emps. remove_emp(e: EMPLOYEE): boolean;
(* removes an employee from the
department *)
end DEPARTMENT;
3
1
Concepts
• Encapsulation of Operations
define class Student
type tuple ( Firstname: string;
Lname: string;
NIC: string;
Birth_date: DATE;
Address: string;
Gender: char;
Dept: DEPARTMENT; );
operations age: integer;
create_stue: Student;
destroy_stu: Boolean;
end Student;
3
2
Activity
Define a class to store following details about a vehicle.
Registration_No: String, Color: String, Engine_capacity: integer,

No_of_seats: integer, No_of_wheels
Operations - Start: boolean, Speed: float, Stop: Boolean
3
3
Concepts
• Persistence of Objects
• Transient objects - in OOPL, objects only exist while the
program is running. Hence, these objects are known as
transient objects.
• Persistent objects - objects that are exist even after the
termination of the program.
• There are two mechanism to make an object persistence.
i) Naming Mechanism
ii) Reachability Mechanism
3
4
Concepts
Persistence of Objects
i) Naming Mechanism
• This mechanism assigns a name to an object which is
unique within database. An operation or a statement can be
used to specify the name.
• Users and applications perform database access through
the named persistent objects which are used as entry points
to the database.
• However, it is not practical to name all objects in a large
database that includes thousands of objects. Therefore,
most objects are made persistent by using the second
mechanism, called reachability.
3
5
Concepts
Persistence of Objects
ii) Reachability Mechanism
• The reachability mechanism works by making the object
reachable from some other persistent object.
• An object B is said to be reachable from an object A if a

sequence of references in the database lead from object A to
object B.
3
6
Concepts
For example, to make the DEPARTMENT object persistent the

following steps are to be followed:
• create a class DEPARTMENT_SET whose objects are of
type set(DEPARTMENT)
• create an object of type DEPARTMENT_SET and give it a
persistent name ALL_DEPARTMENTS for example.
ALL_DEPARTMENTS is a named object that defines a
persistent collection of objects of class DEPARTMENT. In
the object model standard, ALL_DEPARTMENTS is called
the extent.
• Any DEPARTMENT object that is added to the set of
ALL_DEPARTMENTS by using the add_dept operation
becomes persistent by virtue of its being reachable from
ALL_DEPARTMENTS
3
7
Class DEPARTMENT_SET is
defined as a collection of
define class DEPARTMENT_SET DEPARTMENT objects.
type set (DEPARTMENT);
operations add_dept(d: DEPARTMENT): boolean;
(* adds a department to the
DEPARTMENT_SET object *)
define class DEPARTMENT
remove_dept(d: DEPARTMENT): boolean; type tuple ( Dname: string;
(* removes a department from the Dnumber: integer;
DEPARTMENT_SET object *) Mgr: tuple ( Manager: EMPLOYEE;
create_dept_set: DEPARTMENT_SET; Locations: set (string);
destroy_dept_set: boolean; Employees: set (EMPLOYEE);
Projects set(PROJECT); );
end Department_Set; persistent collection of objects of
Operations
… class DEPARTMENT known as
no_of_emps: integer;
Extent
create_dept: DEPARTMENT;
persistent name ALL_DEPARTMENTS: destroy_dept: boolean;
DEPARTMENT_SET; assign_emp(e: EMPLOYEE): boolean;
(* ALL_DEPARTMENTS is a persistent named object of (* adds an employee to the department *)
type DEPARTMENT_SET *) remove_emp(e: EMPLOYEE): boolean;
(* removes an employee from the department *)
…
end DEPARTMENT;
d:= create_dept;
(* create a new DEPARTMENT object in the variable d *)
DEPARTMENT object d is added to the set of
… ALL_DEPARTMENTS by using the add_dept
b:= ALL_DEPARTMENTS.add_dept(d); operation.
38
Concepts
• It is necessary to understand the difference between relational

databases and ODBs in persistence aspect.
• In relational databases, all objects are assumed to be
persistent. Hence, when a table such as DEPARTMENT is
created in a relational database, it represents both the type
declaration for DEPARTMENT and a persistent set of all
DEPARTMENT records (tuples).
• In the OO approach, a class declaration of DEPARTMENT
specifies only the type and operations for a class of objects.
Object persistency is required to be defined separately through
type set(DEPARTMENT) whose value is the collection of
references (OIDs) to all persistent DEPARTMENT objects.
3
9
Concepts
• Thus, object model allows transient and persistent objects

to follow the same type and class declarations.
• Moreover, it is possible to define several persistent
collections for the same class definition, if needed.
4
0
Concepts
Type Hierarchies and Inheritance
• Type/class Hierarchies and Inheritance are two key features in
Object database. Inheritance direct to type/class hierarchies.
• Inheritance is a key concept in Object model which allows to
inherit structure and/or operations of previously defined classes.
• Inheritance promotes the reuse of existing type definitions and
ease the incremental development of data types.
Shape Rectangle and Triangle

Colour:String classes acquire the
property of Shape class
through Inheritance.
Rectangle Triangle
Length:int Base:int
Width:int Height:int
4
1
Concepts

• Functions - in the basic level of inheritance the term
“functions” is used to refer both attributes and operations
together since attributes resemble functions with zero
arguments.
• Since attributes and operations are handled in a similar
manner in the basic level of inheritance, an attribute value or a
value returned from an operation would be referred using a
function name.
4
2
Concepts

• With the function concept a type is specified as given below:
Type_Name:Function1, Function2, Function3,..
Student: Name, NIC, Birth_date, Age, Gender
• Implementation of the functions Name, NIC, DoB, Gender can
be implemented as stored attributes while function Age can
be implemented as an operation that calculates the Age from
the value of the DOB attribute and the current date.
4
3
Concepts

• Subtypes & Supertypes: The concept of subtype is useful
when it is necessary to create a new type/class that is similar
but not identical to an already defined type/class.
• Suppose that the type PERSON is already defined as follows:
PERSON: NIC, Name, Address, Birthdate
• Consider that two new types EMPLOYEE and STUDENT are
to be defined as given below:
EMPLOYEE: NIC, Name, Address, Birthdate, Empid,
Salary, Hire_date
STUDENT: NIC, Name, Address, Birthdate, RegNo,
IndexNo, Gpa
4
4
Concepts
PERSON
NIC, Name,
Address, Birthdate
STUDENT EMPLOYEE
Attributes of Attributes of
Student Employee
(NIC, Name, (NIC, Name,
Address, Address,
Birthdate, Birthdate,
RegNo, Empid, Salary,
IndexNo, Gpa) RegNo, IndexNo, Empid, Salary, Hire_date)
Gpa Hire_date
4
5
Concepts

• It is possible to define EMPLOYEE and STUDENT as
subtypes of PERSON and the type PERSON is referred to as
the supertype.
• The subtype then inherits all the functions of the super type
and it may have some additional functions as well.
• Thus, EMPLOYEE and STUDENT can be declared as given
below:
EMPLOYEE subtype-of PERSON: Empid, Salary,
Hire_date
STUDENT subtype-of PERSON: RegNo, IndexNo, Gpa
• Hence, it is possible to create a type hierarchy to show the
supertype/subtype relationships among all the types declared
in the system.
4
6
Concepts

Geometric_Obj : Name, Colour, Area, Cal_area
Circle_Obj subtype-of Geometric_Obj : radius
Rectangle_Obj subtype-of Geometric_Obj : width, height
• In the above example Geometric_Obj is the supertype.
Circle_Obj and Rectangle_Obj are subtypes that inherits all the
functions of Geometric_Obj and in addition to that their additional
functions.
4
7
Concepts
Geometric_Obj
Name, Colour, Area,
Cal_area
Circle_Obj Rectangle_Obj :
Radius width, height
GetRadius GetWidth(w:float)
(rad:float) Getheight(h:float)
4
8
Activity
Toy: product_ID, colour, age_limit, battery, toy_type, price.

Create two subtypes Bear and Car of the Toy super type. Type
Bear has greeted, sing and size as additional functions and type
Car has controller function.
Bear Car
4
9
Concepts
Multiple Inheritance
• Multiple inheritance takes place when a subtype inherits
from two or more supertypes. In such cases, the subtype
may inherit all the functions of all it’s supertypes.
• A subtype ENGINEERING_MANAGER that is a subtype of
both MANAGER and ENGINEER.
• A type lattice is created as a result of multiple inheritance
rather than a type hierarchy.
5
0
Concepts
EMPLOYEE
ENGINEER MANAGER SALARIED-EMPLOYEE
ENGINEERING-MANAGER
5
1
Concepts
Problems with multiple inheritance

• Ambiguity - A subtype is inherited from two supertypes and
both these supertypes have a function with similar name.
• Such cases present an ambiguity (doubtfulness) when
implementing a particular function with similar name.
• For example, both MANAGER and ENGINEER may have a
function called Salary. If MANAGER and ENGINEER
supertypes have different implementations for salary function
an ambiguity exists as to which of the two is inherited by the
subtype ENGINEERING_MANAGER.
5
2
Concepts
Problems with multiple inheritance

Overcome Ambiguity
Given below are techniques for dealing with ambiguity in
multiple inheritance.
• Check for ambiguity during the creation of subtype and let
user/developer to specify the function which should be
inherited.
• Use a system default.
• Check for name ambiguity and deny multiple inheritance if
there is an ambiguity present. It is possible to request the
user to change function names in supertypes to eliminate
ambiguity.
5
3
Concepts
Selective inheritance
• Selective inheritance enables subtype to inherit only some of
the functions of a supertype. Other functions are not inherited.
• The functions in a supertype that are not to be inherited by the
subtype are defined through EXCEPT clause .
• The mechanism of selective inheritance is not facilitated in
ODBs, but it is used more frequently in artificial intelligence
applications.
5
4
Concepts
Polymorphism of operations / operator overloading

• The OO concept of representing one operation in multiple
forms is known as Polymorphism. This concept is also known
as operator overloading
• Polymorphism is derived from two greek words: poly and
morphs. The word "poly" means many and "morphs" means
forms. Thus, polymorphism means many forms.
• The implementation varies based on the object type that
operation is applied to.
• E.g: an operator to calculate area of a geometric
object can be applied to different object types such as
triangles, rectangles and circles. Implementation
differs based on the object type.
5
5
Concepts

• The cal_area function to calculate the area of a circle or a
rectangle has two different implementations. This is due to
the fact that, area calculation of a circle is different to area
calculation of a rectangle. Circle and Rectangle are two
different types.
For Circle_Obj : Cal_area (radius: float)
For Rectangle_Obj: Cal_area (width : float, height: float)
In this example Cal_area operation is overloaded by
different implementations .
5
6
Concepts
Geometric_Obj
Name, Colour, Area,
Cal_area
Circle_Obj Rectangle_Obj
Radius width, height
GetRadius GetWidth(width:float)
(radius:float) Getheight(height:float)
Cal_area Cal_area
(radius: float) (width : float, height: float)
5
7
Concepts

• Static (Early) binding: Defines the appropriate method at
compile time. Example: for area calculation, the type of the
object is known before compiling the program.
• Dynamic (Late) binding: Checks the object type during
the program execution and then call the appropriate
method. Example: for area calculation the appropriate
Cal_area function is called based on the object type.
5
8
Activity
Match the correct phrase from the given list.

(Ambiguity, Naming mechanism, overloading, Type lattice, hidden
attributes, late binding)
1. A problem with multiple inheritance

2. A result of multiple inheritance
3. Call the appropriate method during the runtime
4. Attributes are encapsulated and access through predefined
operations
5. A mechanism to make an object persistence
5
9
XML Databases - Reason for the Origination of XML
• XML - Extensible Markup Language

• Same as desktop applications are connected to local
databases for running them, web applications have interfaces
that connect them with data sources.
• Data sources can be used to access information in web
applications.
• Formatting web pages is done using Hypertext Documents
• Eg: HTML
6
0
• HTML - Not a good method to represent structured data

• Languages that can structure and exchange data on web
applications;
• XML
• JSON
• XML - Describes the meaning of the data and how they are
structured in the web pages on a text document itself. (Self
Describing Documents)
• Eg: Attribute Names, Attribute Values
6
1
• Basic HTML - Static Web Pages

• Dynamic Web Pages
Different
output
with Interactive
User Input
Output
different
input
XML - Transfer self describing text

documents in dynamic web pages
6
2
Activity
State whether the following statements are true or false.

1. HTML is a good method to represent structured data.
2. A self describing document mentions both the meaning

of data and how they are structured.
3. XML supports only the functioning of static web pages.
4. Generating different outputs that match the different

inputs is done with static data.
5. Generating the specific order values for different

customers in e commerce is an instant of dynamic data.
6
3
XML Databases - Structured, Semi - Structured and
Unstructured Data
Structured Data
• Information in Relational Databases - Structured Data
• Each Data Record in the database follows the same
structure
• Structure is decided by the database schema
6
4
Unstructured Data
Semi - Structured Data
• Some data is collected in unplanned situations. Therefore,
all the data may not have the same format.
• Data may have a certain structure. However, not all the
data has the same structure
E.g: Some have additional attributes; Some
attributes maybe present only in some data.
• No predefined schema
• Data model to represent semi - structured data
- Tree data structure
- Graph data structure
6
5
Unstructured Data
• Semi - Structured Data as a Directed Graph
Pick &
Ride P001 Colombo
6
6
Unstructured Data
Semi - Structured Data

• Schema information and data values are represented as
mixed.
i.e.: Names of attributes, Relationships etc. are mixed
with data values because the different entities will have
different attributes.
6
7
Unstructured Data
Semi - Structured Data as a Directed Graph
Internal Nodes
Pick &
Ride P001 Colombo
Labels / Tags
Leaf Nodes
6
8
Unstructured Data
Semi - Structured Data as a Directed Graph

• Semi structured data is represented as a directed graph as
given below:
– Labels/Tags represent Schema Names (names of
attributes, object types, relationships).
– Internal nodes represent Objects or Composite attributes
– Leaf Nodes represent actual data values
Hierarchical (Tree) ● Internal Nodes represent complex

Data Model elements.
● Leaf Nodes represent data values.
6
9
Unstructured Data
Unstructured Data
• Almost has no specification of the type of data
eg: Web pages designed using HTML
• <p></p> - Outputs whatever data inside the two tags
regardless of the meaning of them.
• <p><b></b></p> - Data formatting is also mixed with data
values.
7
0
Unstructured Data
Unstructured Data: eg: - HTML document with unstructured data
7
1
Unstructured Data
• Unstructured Data
• HTML documents are harder to analyze and interpret
using software since they do not have schema information
on the type of data.
• However, with the shifting of the human needs to an
online environment, it is required to have correct
interpretations on data presented online and exchange
these data.
• This need gave rise to the use of XML in online data
manipulation environments.
7
2
Activity
Select the correct type of data for the following.

(Structured data, Semi structured data, Unstructured data)
1. Geo Spatial and Weather data at different conditions

2. Customer transaction database of a retail outlet
3. Email data
4. XML Data
5. Data generated by robots at intense environments
6. Student details database of a school management system
7
3
XML Databases - XML Hierarchical (Tree) Data Model
Constructing an XML Document

• XML comprises of XML objects
• Structure of an XML document includes:
– Elements
– Attributes
<book>
<title>Gamperaliya </title>
<author id="ABC"> Martin Wickramasinghe </author>
</book>
<book>
<title>Siyalanga Ruu Soba </title>
Element <author id="ABC"> J. B. Dissanayake </author>
Attribute
</book>
• Same as in HTML, in XML elements and attributes have a

starting tag and an ending tag
• Tag name is defined inside the angle's brackets used.
7
4
• A major difference between XML and HTML is that HTML

tags define how the text is to be displayed whereas XML tag
names describe the meaning of the data elements in the
document.
• This makes it possible to process the data elements in the
XML document automatically by computer programs.
• Also, the XML tag (element) names can be defined in another
document, known as the schema document, to give a
semantic meaning to the tag names that can be exchanged
among multiple programs and users.
• In HTML, all tag names are predefined and fixed; that is why
they are not extendible.
7
5
• Constructing an XML Document
7
6
Constructing an XML Document

• Simple Elements include data values
• Complex elements are created from other elements by the
hierarchy they are presenting
• Tag names are created such that the meaning of the data
inside the tag is conveyed (In HTML the formatting is
presented by the tag name)
• A document named Schema document is prepared
separately to convey the meanings of the created tags.
• Schemaless XML Documents - XML documents that do not
have a predefined element names are called Schemaless
XML documents. They do not have a hierarchical tree
structure.
7
7
• Constructing an XML Document
Complex Elements
Simple Elements
7
8
There are three main types of XML documents:

Includes small data items with
the purpose of transferring over
the web with a predefined
structure.
Includes large amounts of text
data with less structure.
Includes both text and

structured data.
7
9
Data-centric XML documents

• Data-centric XML documents usually have a predefined schema
that defines the tag names.
• These documents contain many small data items that follow a
specific structure and hence may be extracted from a structured
database. Thus, data-centric XML is used to mark up highly
structured information such as the textual representation of
relational data from databases.
• Since data-centric XML documents represent structured
information in its textual representation, these documents are
used to exchange data over the Web. Therefore, web services
are about data-centric uses of XML.
8
0
Data-centric XML documents

• The XML includes many
different types of tags and
there is no long-running
text.
• The tags are organized in a
highly structured manner
and the order and
positioning of tags matter,
relative to other tags. For
example, <Hours> is
defined under <Employee>,
which should be defined
under <Project>. The
<Projects> tag is used only
once in the document.
8
1
Document-centric XML documents

• These documents have large amounts of text, such as news
articles or books.
• There are few or no structured data elements in these
documents.
• The usage rules for tags are very loosely defined and they could
appear pretty much anywhere in the document:
8
2
Document-centric XML documents

<H1>Types of XML documents</H1>
<P>It is possible to characterize <B>three main types </B> of XML
documents</P>
<LIST>
<ITEM>Data-centric XML documents.</ITEM>
<ITEM>Document-centric XML documents</ITEM>
<ITEM>Hybrid XML documents</ITEM>
</LIST>
<P>If an XML document conforms to a predefined XML schema or
DTD then the document can be considered as <LINK
HREF=“structuredXML.xml"> structured data </LINK> </P>
8
3
Hybrid XML documents

• These documents may consist of both structured data and
textual or unstructured data.
• They may or may not have a predefined schema.
8
4
Activity
Select the correct term for the description.

(Composite elements, Leaf nodes, Document centric XML
documents, Schema document, Schemaless XML documents)
Represents data values in the hierarchical

(tree) model
Document prepared for XML tag definition
Includes large volumes of text XML data
Elements created from other elements
Does not have a data definition for XML tags
8
5
NoSQL Databases - Origins of NoSQL
• Relational database management systems (RDBMS) provide

many advantages and thus are widely being used.
• However, there are number of issues as given below that
RDBMS is not capable of handling. This gave rise to find an
alternative ways of managing data.
i) Impedance Mismatch (Data Modeling issues)
ii) Drawbacks of shared database integration (Data Modeling
issues)
iii) Difficulties in scaling up to manage the growth in data volume
(Scalability & Availability issues)
8
6
Impedance Mismatch
• Impedance mismatch is the term used to refer to the
dissimilarity between the relational database model and the
programming language model (data structures in-memory).
• Although relational data model represents data as relations
in tabular format which is simple, it introduces certain
limitations. In particular, tabular representation cannot
contain any structure, such as a nested record or a list.
8
7
Impedance Mismatch
There are no such limitations for in-memory data structures,
which can take on much richer structures than relations.
Consequently, a richer in-memory data structure is required to be
translated into a relational representation to store it on disk.
These two different representations would cause the impedance
mismatch which require translation from one representation to the
other.
8
8
Orders
ID: 0001
Customer: Kamal
Customers
343 2 Rs.6590 Rs.765

344 3 Rs.76554 Rs.344
345 5 Rs.8754 Rs.654 Order Lines
346 4 Rs.8765 Rs.876
Card: HSBC Card Details

Cc No. 1234
Expiry:05/25
8
9
Impedance Mismatch
• Impedance mismatch is made much easier with the object-
relational mapping frameworks, such as Hibernate that
implement well-known mapping patterns.
• Although these mapping frameworks remove a lot of tedious
work the mapping problem is still an issue since the problems
in representing the in-memory data structures from
the database perspective is not taken into consideration. This
would hinder the query performance.
9
0
Drawbacks of Shared Database Integration

• The database acts as an integration database with multiple
applications, developed by separate teams, storing their data
in a common database.
• To this end, the primary factor is the role of SQL as an
integration mechanism between applications.
• Although shared database integration improves
communication enabling all the applications to operate on a
consistent set of persistent data there are certain drawbacks
of database integration.
9
1
The drawbacks of Database Integration are as given below:
• A structure that is designed to integrate many applications
often become more complex than any single application
needs.
• If an application requires to make changes to its data storage, it
needs to coordinate with all the other applications using the
database.
• Since different applications have different structural and
performance needs, an index required by one application may
hinder the performance of insert operations of another
application.
• Usually, a separate team is responsible for each application.
This means that the database cannot trust applications to
update the data in a way that preserves database integrity.
Thus, the responsibility of preserving database integrity is
within the database itself.
9
2
• A different approach is to treat the database as an application

database which is directly accessed by only a single
application codebase that is controlled by a single team.
• With an application database, only the team using the

application needs to know about the database structure, which
makes it much easier to maintain and evolve the schema.
• Since the application team takes care of both the database and
the application code, the responsibility for database integrity
can be passed onto the application code.
• Web services (where applications would communicate over
HTTP) enabled a new form of a widely used communication
mechanism which challenged the use of SQL with shared
databases.
9
3
Shared Database Integration VS Application Database

Sales system
Sales system
Sales
Application
database Database
Shared
Web service
Database Common
Integration database
integration
Inventory system
Application
Inventory system Database
Inventory
system
9
4
Advantages of Application Database

• Using web services as an integration mechanism
under application database resulted in giving more
flexibility for the structure of the data that was being
exchanged.
• Communication with SQL, requires the data to be structured
as relations. However, use of a web service, enables data
structures with nested records and lists to be used.
• These data are usually represented as documents in XML or,
more recently, JSON.
9
5
Scalability & Availability issues

• When the large volumes of data is getting generated and
the amount of data traffic is getting incremented, people
understood that there is a requirement for more computer
resources to store the data.
• It was suggested to go for big storage machines to scale up,
and a higher processing power. However, more memory is
needed to handle this kind of increase.
• When the sizes of machines increased, it becomes more
expensive.
• Alternatively, a set of small machines which are working
parallelly as a cluster could be used.
9
6
Scalability & Availability issues

• A collection of small machines that are working in a cluster is
comparatively cheaper than buying a bigger machine.
• It also provides high reliability and thus high availability since
even when a single machine fails, the cluster will be up and
running independent of the failures.
9
7
NoSQL databases emerged mainly to provide the following
advantages
a) Flexible data modeling: NoSQL emerged as a solution to the
impedance mismatch problem between relational data models and
object-oriented data models. NoSQL covers four different data
organization models as given below which are highly-customizable
to different businesses' needs.
i) Document databases: store data as documents similar to
JSON (JavaScript Object Notation) objects. Each document has
pairs of fields and values.
ii) Key-Value databases: represent a simpler type of database
where each item contains keys and values. A value can only be
retrieved by referencing its key and thus querying for a specific
key-value pair is simple.
iii) Column-Family databases: store data in tables, rows, and
in dynamic columns. This data model provides a lot of flexibility
over relational databases because each row is not required to
have the identical columns.
9
8
iv) Graph databases: store data in nodes and edges

representing a graph structure. Nodes for example store
information about people, places, and things while edges store
information about the relationships between the nodes.
The details of these data models are explained in detail in next
slides.
b) High Scalability: NoSQL enables to manage growing data

volumes since it is cluster-friendly. This feature makes it easily
scalable, as large data sets can easily be split across clusters, with
or without replication.
c) High Availability: NoSQL assures 100% database uptime. For
businesses that process huge numbers of transactions, the goal of
accessibility (which is related with the speed) is to have the
data/page available quickly, even if it is not 100% accurate. This is
due to the fact that if a data request does not produce a visible result
fast, the user will navigate away, assuming something is wrong.
9
9
Important rise of NoSQL with Polyglot Persistence

• Different kinds of data could best be dealt with different data
storage methods. In short, it means picking the right data
storage for the right use case.
• Rather than using Relational databases for different data
types, it is possible to select a data storage based upon
– the nature of data
– the ways of manipulating it.
• Polyglot Persistence is to use many data storage
technologies to suit the data requirement.
1
Polyglot Persistency Functionality Considerations Database Type
Rapid Access for reads

User Sessions and writes. No need Key-Value
to be durable.
E-commerce
Platform High availability across
multiple Document, (Key
Shopping Cart
locations. Can merge Value maybe)
inconsistent writes.
Shopping Inventory
cart & and item Customer Needs transactional
session data price social Financial Data updates. Tabular RDBMS
graph structure fits data.
Completed
orders RDBMS (if
Depending on size and
modest), Key Value
Graph rate of ingest. Lots of
or Document (if
Key-Value store POS Data writes, infrequent
ingest very high) or
store reads mostly for
RDBMS Column if analytics is
RDBMS/ analytics.
Document
key.
store Rapidly traverse links
between friends, Graph, (Column if
Recommendations
product purchases, simple)
Reference: and ratings.
https://www.jamesserra.com/archive/2015/07/what- Lots of reads,
is-polyglot-persistence/ infrequent
Product Catalog Document
writes. Products make
natural aggregates.
1
Common characteristics of NoSQL databases

• NoSQL databases are not using SQL. However, there are
some NoSQL databases which are using query languages
equivalent to SQL. Those languages are easy to learn.
• Most of the NoSQL projects are open-source.
• Majority of the NoSQL databases have the ability of running on
clusters.
• The capacity of running on clusters come with the effect on the
data model as well as the degree of consistency provided.
1
Common characteristics of NoSQL databases

• In relational databases, ACID properties are preserved and
ensure the consistency across the whole database.
• However, the scenario in a clustered environment is different.
In NoSQL environment, there are several levels for consistency
and a range of options for distribution.
• There are certain NoSQL databases which do not facilitate in
running on clusters.
• Graph database is one of the NoSQL data models that offer a
distribution similar to relational databases which is more
suitable for the data with complex relationships.
• NoSQL databases are schemaless which means, it is possible
to add fields freely into the records except making changes in
the defined structure.
1
NoSQL Databases - Data models in NoSQL
Introduction to Aggregate data models

• Relational model:- every operation is performed on tuples.
• Aggregate orientation:- operations can be performed on
data in units. When comparing with tuples, it is a complex
structure.
• It is possible to consider this structure as a record where we
can operate other structures nested inside it.
1
Introduction to Aggregate data models

• An aggregate is a set of related objects that is treated as a
unit.
• Aggregates have the feature of making natural unit to help
in creating replication and sharding.
• Because of that it is easy to operate as clusters.
• Aggregates also make the life of application programmers
easy as they manipulate data through aggregate.
1
Data model for complex relationship structures

• Aggregates are helpful as they put together data that is
accessed together.
• However, related data could be accessed differently. Consider
the following example.
The connection between a client and the requests he gets for
orders:
• Applications will need to see the request history as they
access the client. The possibility is joining the client with
his request history to create an individual
result/aggregate. Different applications, might need to
handle orders as individual aggregates.
1
Data model for complex relationship structures

• The mechanism of handling updates is important when
studying about relationships between aggregates.
• Aggregates are treated as the unit in the data retrieval
process when it comes to the Aggregate-oriented
databases.
• This treatment allows the property of Atomicity within the
contents of aggregates.
• In Relational databases we can modify a set of records in
the same transaction because ACID guarantees when we
do such modifications.
• NoSQL databases came into the picture as there was a
need to run on clusters. This paved the way to aggregate-
oriented data models which have a high number of records
but simple connections. 1
E.g: - Suppose we are going to implement an e-commerce portal.

We will be storing data about orders, users, payments, product
catalog, and a set of addresses. We will be using this data model in
a relational environment.
1
NoSQL Databases - Data models in NoSQL: Sample
of data stored in the tables
Customer Order
Id Name Id CustomerId ShippingAddressId
1 Shantha 99 1 77
Product Billing Address
Id Name Id CustomerId AddressId
27 Laptop 55 1 77
Order Item Address
Id OrderId ProductId Price Id Name

100 99 27 145000 77 Kandy
Order Payment
Id OrderId Card Number BillingAddressId TxnId
33 99 34-886-89 55 act-564 1
The same model can be represented like this, in an aggregate-

oriented environment.
1
• The two main aggregates here are,

• customer
• order
• Aggregation structure is represented with the UML composition
symbol (black color diamond).
• When we pay attention to the properties hold by each entity,
• Customer -> billing addresses.
• Order -> order items
a shipping address
payments.
• Payment -> billing address.
1
• In the example data, A single logical address record can be

seen three times.
• But instead of using IDs, here in the aggregate oriented
diagram, a single value is copied in several places.
• This is suitable for a scenario which we assure that either of the
addresses do not change.
• In a relational database, when we want to keep the addresses
without changing, we can add a new row.
• When using aggregates, a copy of the entire address structure
can be used as we prefer.
1
Reason for using Aggregate data models

• Aggregate orientation facilitate in running on clusters.
• When creating the data clusters, it is necessary to consider
the count of nodes that needs to be accessed when retrieving
data.
• Aggregates have an important consequence for transactions.
• In a relational database, different combination of rows from
different tables is possible to be manipulated in one
transaction.
1
• Key-Value Model and suitable Use Cases

• Lookup based on keys can be used to access and
aggregate in a key-value store.
• Queries to the database can be submitted based on the
fields in the aggregate in a document database. Database
creates indexes based on the content of an aggregate.
Moreover, part of an aggregate can be retrieved.
• The aggregate being stored could be any data structure; it
does not have to be a domain object. (eg: Redis)
• Redis supports storing lists, sets, hashes. It can also
perform operations such as range, diff, union, and
intersection.
• Redis can be used in multiple ways rather than just a
general key-value store because of these features. 1
Key-Value Model and suitable Use Cases

• There are many key-value databases.
• Riak, Redis, Oracle NoSQL are examples for key-value
databases.
• In Riak database, the keys are segmented by storing in
separate buckets.
• Consider a scenario where we want to store several
information such as user session data, shopping cart data,
and location data.
• We can store all of that information in one bucket as single
key and single value objects.
• Now we have a single object with all the data that is stored
in a single bucket.
1
• Key-Value Model and suitable Use Cases
<Bucket = userData>
<Key = sessionID>
<Value = object>
UserProfile
SessionData
shopping Cart
CartItem
CartItem
1
Key-Value Model and its Use Cases

• Session Information Use Case
• We know that web sessions are unique and each of those
has a unique sessionid value.
• All most all the applications store that session-id while the
applications use key-value stores get more benefits.
• This is because all session details can be inserted using
one PUT request and fetched using a GET.
• It makes the processing faster since all the data on a
single operation can be stored in one object.
1
Key-Value Model Example
title isbn year
Sinhala Natyaye 111-444-3333 1867

Wikashaya
Key
111-444-3333 222-555-7777 222-555-777
111-444-3333 222-555-7777 222-555-777
Score 1867 Score 1899 Score 1920

111-444-3333 111-111-33 333-444-55
1
118 1
Document Data Model and suitable Use Cases

• By adding a field called ID into a document database, it can
be considered as a lookup of key-value pairs.
• It is allowed to add different structures of data into a key-
value database.
• The main item in a document databases is the documents.
• It stores and retrieves documents such as XML, JSON, and
BSON
• The features of these documents are:
– self-describing
– hierarchical data structures which consist of different
types of values.
1

• There’s a similarity in the stored documents.
• In document databases, documents are store as the value
and the ID is stored as the key.
• It is similar to a key-value stores which has an examinable
the value part.
• There can be different schemas in different documents.
• But in a relational database each row of a given table has
to have the same schema.
• But here, these documents can belong to the same
collection.
1

• Use case of Logging of events
• Different event logging needs of applications can be stored
in a Document databases.
• Document database will then become the central store of
event data and it is important to be kept in dynamic
situations.
• Name of the application, type of event can be used when
you want to share the event.
1
Document Data Model Example
Relational Document
model model
Tables Collections
Rows Documents
Columns Key/value pairs
Joins Can be applied

where
1
necessary 2
Column-Family Stores and suitable Use Cases

• Column stores can be thought about as databases with a
data model of the style of a big table.
• Even though rows act as the storage unit in database who
helps write operations, there could be situations when writes
are infrequent but multiple reads in a set of columns and
rows are present.
• Using the method of storing columns groups for all rows as
the fundamental unit of storage is the wise thing to do in
such cases. Hence the reason for the name “column
databases”.
1

• An aggregate structure which spans two levels can be thought
of as a Column-family model.
• In key-value stores, the first key in key-value model is an
identifier which uniquely recognises each row of our interest.
• This row aggregate is similar to a map which contains
descriptive values.
• These values in the second level are what we call columns.
1
Column
Column-Family Example Column
value
key
Column name Kawshi

Family Profile
billingAddress data...
payment data...
6783
ORDER1 data...
ORDER2 data...
row key
ORDER3 data...
Orders
ORDER4 data...
1

• Content Management Systems, Platforms used for blogs
Use Case
• Column families allow to:
‒ Record blog entries with features such as tags,
classifications, connections, and tracing options.
‒ In addition to that, we can store comments in the
same row or if we like, we can move them to a
different key space.
• Separate column-families can be used to record users
of blog and the real blogs.
1
Graph Data Stores

• Graph databases have a different model which is small
records which have interconnections that are complex.
• Graph databases specialise in capturing interconnected
information. But it’s much easier than reading these
interconnections on a diagram could.
• Best uses are in extracting complex connections such as
in social networks, users’ preferences for products, or
rules for eligibility.
• Graph database has a data model which is nodes
connected by edges (also called arcs).
1
Graph Data Stores

• Graph databases enables to traverse the joins fast.
• Calculations for the relationship between nodes is not
performed at query time. It is continued as a relationship
itself.
• This is because traversing persisted relationships is
efficient than performing the calculation on each individual
query.
• There could be different types of relationships between
nodes.
1
Graph Data Stores Relational model Graph model

Tables Vertices and Edges
set
Rows Vertices
Krishna
Columns Key/value pairs
Joins Edges
Sithara Ana Kevin
Fathima Chamal
1 1
© e-Learning Centre, UCSC 2 2
Activity
Match the most relevant description, with the data models given.
Data Model Description
Document Store all key-value pairs together

in a single namespace.
Key-value Have a row key and within that,

stores a combination of columns
that fit together.
Column family Enables to traverse the joins fast.
Graph data stores Organise documents into groups

called collections
1
Activity
You have given a set of NoSQL databases and data models. Drop the
databases in left hand side to its relevant data model in right hand side.
Key-value
HBase Neo4j
HyperTable Redis
Document
CouchDB MongoDB
Column-family Riak FlockDB
Cassandra Infinite Graph

Graph
1
Activity
Sandun is a software Developer who is planning to implement a system

on network and IT operations. He has to decide on what is the most
suitable NoSQL database model for his system. The database model he
selects should be able to represent the connection of IT managers,
catalog assets and their deployments. With the application, network
managers can do analyses, answering the following questions:
• Which applications or services do particular customers rely upon?
(Top-down analysis)
• In case of a failure in a network element, which applications and
services, and customers will be affected? (Bottom-up analysis)
What is the most suitable NoSQL database model for the above use
case?

Activity
Rehana is an architect working on a software project on Event

logging. It has the following features.
• Simple setup and maintenance
• Very high velocity of random read & writes
• Less secondary index needs
• Wide column requirements.
What is the most suitable NoSQL database model for the
above software product?

Activity
Mohammad wants to build an e-commerce website to have a

shopping cart. He wants the shopping carts to have the capability to
be accessible across browsers, and sessions. His friend, Karthigai
suggested to add all the shopping information and user-id into two
entities which will generate a connection.
What is the most suitable NoSQL database model for the
above software product?

Activity
Drag and drop the correct answer to the blanks from the given list.
(Consistency, Availability, Partition Tolerance, master-slave,
Replication, replica sets, column, memtable)
The CAP theorem states that we can ensure only two features
of_______, _________, and ________________. In Document
databases, availability is improved by data replication using the
______ setup. With __________, we can access data stored in
multiple nodes and the clients do not have to worry about the failure
of the primary node since data is available in other nodes as well. In
MongoDB, availability is achieved using ___________. Cassandra
is a column-family database which uses _____ as the basic unit of
storage. The procedure of receiving a write by Cassandra is as
follows. Data is stored into memory only after stored in a commit log.
The term used to describe that in-memory structure is ________.

Activity
(inter-aggregate, intra-aggregate, node and edge, schemaless,
materialised views, map-reduce)
It is difficult to handle _________ relationships than ___________

relationships.
____________ are more suitable for the application which have
connected relationships.
One of the main advantages we have in __________databases is
the ability of adding fields to records freely.
In graph databases, __________ are provided as directed
connections of two node entities.

Activity
(aggregate , unit, ACID, aggregate-oriented, clusters)
A data collection that we consider as a unit is known as an

_________.
___________ properties guarantee that once a transaction is
complete, its data is consistent and stable on disk.
Key-value, document, and column-family are different forms
___________database.
Aggregates make the database easier to handle in _________.

Object databases and Relational databases
• Handling Relationships
• In ODBs, features of a relationship which is also called
reference attributes are used to manage relationships
between the objects. In relational DBs, attributes with
matching values are used to specify the relationship
among the tuples(records). Foreign keys are used for
referencing relation in relational DBs.
• Single reference or collection of references can be used
in ODBs, but the basic relational model only support
single valued references. Hence, representation of many
to many relationships are not straight forward in relational
model, a separate relation should be created to represent
M:N relationships.
• However, mapping of binary relationships in ODBs is not
a direct process. Therefore, designer should specify
which side should possess the attributes.
1
Object databases and Relational databases
• Handling Inheritance
• In ODB a construct is already available in the inner-
workings for handling inheritance. But basic relational
model does not have such options to handle inheritance.
• Specifying Operations
• In ODB, operations should be designed during the design,
as a part of class specification. Relational model does not
require the designed to specify the operations during the
design phase.
• Relational model supports ad-hoc queries while ad-hoc
queries will violate the encapsulation in ODBs.
1
XML and Relational databases
• Relationships
• XML databases follow a hierarchical tree structure with
simple and composite elements and attributes
representing relationships. But relational databases have
relationships among tables where one table is the parent
table, and the other table is the dependent table.
• Self - Describing Data
• In XML databases, the tags define both the meaning and
explanation of the data items together with the data
values. Hence different data types are possible within a
single document. In relational databases, data in a single
column should be of the same type and column definition
defines the data definition.
1
XML and Relational databases
Inherent Ordering
• In XML databases, the order of the data items is defined
by the ordering of the tags. The document does not
include any other type of ordering. But in relational
databases, unless an order by clause is given, the
ordering of the data rows is according to the row ordering
inside tables.
1
NoSQL and Relational databases
Data modelling difference

• The definition of a data model stands as the way we can
interact with the database and the stored data, which is
not the sae as a storage model.
• Storage model’s definition says it’s how the storing and
manipulation of data happens in the database.
• In contrast, aggregate orientation realises the need to
use data which have a complex structure than a using
set of records.
1

• All NoSQL data models make use of a complex record that
lets us nest lists and other structures inside it.
• Domain-Driven Design gave birth to the term aggregate.
• Therefore, an aggregate is defined in domain driven design
as a collection of related objects that are treated as a unit.
1
1
Modelling for Data Access

• Aggregates can be used for analysis purposes. The
denormalisation of the data lets us get to the interested data
item quickly.
• Having the information of the columns in an ordered manner
lets us make it easy to find the mostly-accessed columns
fast through naming conventions.
• However, when it comes to modelling data with column
families, we need to do it on a basis of queries rather than
on the writing purpose.
• The general rule to follow is to make the querying task easy
and denormalise the data at the point of write.
1
Customer
BELONGS_TO
BILLED_TO
PURCHASED
Address OrderPayment
Product
PAID_WITH
SHIPPED_TO PART_OF
Order
1

• In the above figure, what we need to do to find all the
Customers who PURCHASED a product with the name
Refactoring Databases, is pass a query to the product node
with Refactoring Databases and check all the Customers
with the incoming PURCHASED relationship.
• As it’s clear, using graph databases for this type of
relationship traversal is convenient.
• It is also convenient at the point when you need to utilise the
information to generate recommendations or to discover
patterns.
• In modelling data with graph databases, all objects are
modelled as nodes and relations go as relationships.
• Relationships have two features: types and directions. 1
Aggregate-oriented vs aggregate ignorant

• The complexity in Inter-aggregate relationships is higher
than intra-aggregate relationships.
• While schemaless databases allows to add fields to records
with no restriction, there is an implicit schema which the
users, who utilise the data, need.
• Materialised views are processed by Aggregate-oriented
databases. This gives a different organisation to data from
primary aggregates. The mechanism used in this case is
map-reduce computations.
1
Schemalessness in NoSQL
• NoSQL databases have no schema; useful when we have
to work with nonuniform data.
• A key is used to store data in a key-value store.
• Since a document database achieves the same and
therefore, we have no restrictions on the document
structure.
• We can store data in columns in Column-family databases.
1
Overview of Materialised Views

• In Relational databases, the absence of aggregate feature
help the users access data in different ways.
• Therefore, users can see the data in different views.
• Even though a view is similar to a relational table, it is not
created physically in the database. It is defined on the base
tables.
• Views can be generated with derived data as well as with
the data available in base tables. But the source of data is
not exposed to the client.
• Materialised views are data which are computed early and
stored in the cache.
• It is beneficial for an application which has more read
operations. 1
Overview of Materialised Views cont.
• There are no views in NoSQL databases. But there are
precomputed, cached queries which can be reused. Those
are called “materialised view”.
• In an aggregate-oriented databases, there can be queries
that don’t match exactly with the aggregate structure.
• We can use materialised views inside the same aggregate.
• For an example, suppose we have an order document with
a summary of the order included in it. When we are
querying on order summary data, there is no requirement to
transfer the order document in full.
• When we consider column-family databases, we can use
several column families for materialised views.
• The benefit of this approach is the ability to update the
materialised view within a single atomic operation. 1
Activity
State whether the following statements are true or false.
When an instant query for searching through the database needs to

be executed, the most successful output will be given by relational
databases compared to Object databases.
In relational databases all the data items need to be of the same data
type while in XML databases, different data types are allowed.
Foreign key concept is available in both relational and XML

databases.
If the functionalities of a certain entity in a database is not known at

the time of creating the entities, it is better to use the relational model.
Both Object databases and relational databases support object

oriented concepts at the creation of the database.
1
Activity
Select the correct option from True/False column.
Statement True/False
A view is not like a relational table
Since NoSQL databases don’t have views, they
cannot have precomputed and cached queries
Graph databases organise data into node and
edge graphs
Storage model describes how the database stores
and manipulates the data internally
An aggregate is a unit for data manipulation and
management of consistency.
1
Summary
Major concepts of object-oriented, XML, and NoSQL
databases
Object Databases Overview of Object Database concepts
Reasons for the origin of XML

XML Databases Structured, Semi structured, Unstructured data
XML hierarchical data model
Origins of NoSQL
NoSQL Databases
Data models in NoSQL
1
Summary
Contrast and compare relational databases concepts
and non-relational databases
Object DB and
Relational DB
XML and Relational

DB
Data modelling difference, Aggregate oriented

NoSQL and Relational vs aggregate ignorant, Schemalessness in
DB NoSQL, Overview of Materialised views
1
2 : Database Constraints and Triggers
IT3306 – Data Management Systems


Overview
• Relational Model Constraints

• Specifying Constraints in SQL
• Constraints in Databases as Assertions
• Specifying Actions in Databases as Triggers

• At the end of this lesson, you will be able to:

• Understand what relational model constraints are
• Identify constraints violations
• Write constraints in SQL
• Understand what assertions are
• Define what a trigger is
• Write trigger statements

List of Sub topics
2.1. Relational Model Constraints
2.1.1. Categories of Constraints
2.1.2. Domain Constraints
2.1.2.1. Key Constraints and Constraints on NULL Values
2.1.2.2. Entity Integrity and Referential Integrity
2.1.2.3. Other Types of Constraints
2.1.2.4. Insert, Delete and Update Operations Dealing with
Constraint Violations

List of Sub topics
2.2. Specifying Constraints in SQL

2.2.1. Specifying Key and Referential Integrity Constraints
2.2.2. Specifying Constraints on Tuples Using CHECK
2.2.3. Specifying Names to Constraints
2.3. Constraints in Databases as Assertions
2.4. Specifying Actions in Databases as Triggers
2.4.1. Introduction to Triggers and Create Trigger Statement
2.4.2. Active Databases and Triggers

Relational Model Constraints
• Categories of Constraints
• Domain Constraints
• Key Constraints and Constraints on NULL Values
• Entity Integrity and Referential Integrity
• Other Types of Constraints
• Handling Constraint Violations for Insert, Delete and Update
Operations

Categories of Constraints
• In this section, we discuss what a constraint is and categories of

constraints.
• Constraint is a condition that specifies restrictions on the
database state.
• Constraints maintain data integrity of a database. That is,
constraints ensure that the values entered to a database is
accurate and valid.
• For an example, NIC column should hold unique values and then
when you insert or update values to this column, duplicate values
should not be allowed. Constraint defined on the NIC column
should maintain this uniqueness with respect to the data
manipulation operations. The above example highlights only one
type of constraint. However, constraint types in relational model
can be divided into three main categories.

Categories of Constraints cont.
• Three constraint categories are:

• Inherent model-based constraints or implicit constraints
• They are the constraints which are inherent in relational
model itself.
• Schema-based constraints or explicit constraints
• These constraints apply when you use Data Definition
Language (DDL) to specify a schema.
• Application-based or semantic constraints or business rules
• These are the constraints that cannot directly be defined in
the schema and are enforced through application programs or
triggers. E.g. The salary of an employee should not exceed
the salary of the employee’s supervisor or the maximum
number of hours that an employee can work on all projects
per week is 40.
Activity
• Map each statement into the correct column.

• Primary key (eno)
• Salary of an employee should not be greater than his or her
manager
• Foreign Key (dept_no)
• Employee’s NIC should not contain NULL values
• No duplications for tuples
Implicit constraints Explicit constraints Business rules

Implicit or Inherent model-based constraints
EDIT MASTER TITLE STYLE
These constraints are assumed to hold by the definition of the
relational model (i.e., built into the system and not specified by a
user).
• Inherent constraints
■ A relation consists of a certain number of simple attributes.
■ An attribute value is atomic
■ No duplicate tuples are allowed
Master of Computer Science © e-Learning Centre, UCSC

10
Implicit or Inherent model-based constraints EDIT
MASTER TITLE STYLE
• Each attribute value in a tuple should have an atomic value; that is,
attribute value is not divisible into components within the relational
model.
• Hence, composite and multivalued attributes cannot be naturally
represented.
• This model is known as the flat relational model.
• The theory behind the relational model was developed based on
the first normal form assumption.
• As a result, multivalued attributes should be represented by
separate relations, and composite attributes are represented only
by their simple component attributes in the basic relational model.
Master of Computer Science © e-Learning Centre, UCSC

11
Explicit or Schema-based Constraints
• Explicit or schema-based constraints apply directly in the schemas

by defining them in Data Definition Language(DDL).
• There are different types of schema-based constraints. They are:
• Domain constraints
• Key constraints
• Constraints on NULLs
• Entity integrity constraints
• Referential integrity constraints.

Domain Constraints
• Domain constraints ensure that the data value entered for a

particular column matches with the pre-defined data type of that
column. The pre-defined data types are Integers, Real numbers,
Characters, Booleans, Fixed-length strings, Variable-length
strings, Date, time etc. In the below query, each column is created
with a data type. When data is entered to each column, it only
allows values of the defined data type.
• Following is the SQL to create Student table.
CREATE TABLE STUDENT(
STU_NO CHAR(05),
STU_NAME VARCHAR(35) ,
STU_DOB DATE,
EXAM_FEE INT,
STU_ADDRESS VARCHAR(35) ,
PRIMARY KEY (STU_NO));
Domain Constraints
• Domains can also be described by a subrange of values from a

data type or as an enumerated data type in which all possible
values are explicitly listed. This is facilitated through CHECK
constraint in SQL as illustrated below.
ENROLL_NO CHAR(05) NOT NULL ,
STU_NAME VARCHAR (35),
STU_ADDRESS VARCHAR (35), Checks
STU_AGE INT CHECK (STU_AGE >= 18), whether the
gender is
GENDER VARCHAR(06) Male or
Female.
CHECK (GENDER in ('Male', 'Female’)),
PRIMARY KEY (ENROLL_NO)
); © e-Learning Centre, UCSC 14
Domain Constraints
• UNIQUE Constraint enforces a column or set of columns to have
unique values. Therefore, that specific column cannot contain
duplicate values. When you define the PRIMARY KEY, then by
default it maintains having unique values.

ENROLL_NO CHAR(05) NOT NULL,
NIC CHAR(10) UNIQUE,
STU_NAME VARCHAR (35) NOT NULL,
STU_AGE INT NOT NULL,
STU_ADDRESS VARCHAR (35),
PRIMARY KEY (ENROLL_NO));

Domain Constraints
• The DEFAULT constraint is used when there is no value to insert

as the column value of a table, instead it provides a default value.

Exam_Fee
STU_NAME VARCHAR (35) NOT NULL, is set to
10000
STU_DoB DATE NOT NULL, which is the
EXAM_FEE INT DEFAULT 10000, default
value.
PRIMARY KEY (ENROLL_NO)
);

Activity
• Using SQL create UnderGrad_Student table for the given relation.

Where DEFAULT Reg_course is ‘Computer Science’.
UnderGrad_Student (Sid, Enroll_Year, Name, Address, Age,
Reg_course)

Key Constraints
• No two tuples should have the same combination of values for their
attributes. The value of a key attribute can be used to identify each
tuple uniquely in the relation.
• This property is time-invariant.
• If a relation has more than one key, they are called candidate keys.
• In general, a candidate key with a fewer number of attributes is
selected as the primary key.
CREATE TABLE Employee (

Emp_id CHAR(05) PRIMARY KEY,
Emp_name VARCHAR(55) NOT NULL,
Hire_date DATE NOT NULL,
NIC CHAR(10) NOT NULL,
Salary DECIMAL (9,2) NOT NULL );
Specifying Key and Referential Integrity Constraints:
Primary Key Cont.
• Primary key of a relation can be a combination of more than one

attribute which is known as a composite key. In this situation, you
cannot state the primary key when declaring attributes. That is, the
primary key has to be defined separately.
CREATE TABLE DEPENDENT(

EMP_NO CHAR(05),
DEPENDENTREF CHAR(05),
DEPENDENT_NAME VARCHAR (35),
AGE INT NOT NULL,
PRIMARY KEY (EMP_NO, DEPENDENTREF));

Key Constraints
• In a relation, there are subset of attributes that when

taken/considered together would enable unique identification of
each tuple. This subset of attributes is known as a superkey.
• A superkey is a set of attributes that can be used to identify a tuple
uniquely. A candidate key is a minimal set of attributes that require
in identifying a tuple which is also known as a minimal superkey.
• It is a minimal superkey where you cannot remove any attribute to
hold the uniqueness.

Key Constraints
Employee
Eid Ename Address Salary NIC DoB

E1001 Amal Kandy 200000 751234567V 1/1/1989
E1002 Sunil Colombo 150000 772345678V 13/4/1980
E1003 Nimal Matara 175000 822131412V 23/2/1975
• In the above relation, possible superkeys are: {Eid, Ename},

{Eid, Ename, Address}
• In the above relation, possible candidate keys are: Eid and NIC

Entity Integrity Constraints
• The entity integrity constraint states that no primary key value

can be NULL.
• This is because the primary key value is used to identify individual
tuples uniquely in a relation.
• Having NULL values for the primary key implies that it is not
possible to identify tuples uniquely in the database.

Activity
Match the correct answers.
• Constraint ensures that allows values of the

defined data type
• Domain constraint subset of attributes
• Super key of a relation is a schema-based

constraints
• Referential key is the values entered to a

database is accurate and
valid

NOT NULL Constraints
• Other than the Primary Key, there can be other attributes which
cannot contain NULL values.
• For an example, if the columns Empname, Hire_date, NIC and
salary cannot contain NULL values, when the table is created, you
can state it as follows.
CREATE TABLE Employee (

Emp_id CHAR(05) PRIMARY KEY,
Emp_name VARCHAR(55) NOT NULL,
Hire_date DATE NOT NULL,
NIC CHAR(10) NOT NULL,
Salary DECIMAL (9,2) NOT NULL );

24
NOT NULL Constraints
• NOT NULL constraint makes sure that a column does not hold
NULL value. When we cannot give a value for a particular column
while inserting a record into a table, default value taken by it is
NULL. By specifying NOT NULL constraint, we ensure that a
particular column(s) does (do) not contain NULL values.

As per the
ENROLL_NO CHAR(05) NOT NULL, SQL,
STU_NAME VARCHAR (35) NOT NULL, ENROLL_NO
and
STU_DoB DATE, STU_NAME
cannot
EXAM_FEE INT, contain NULL
STU_ADDRESS VARCHAR (35) , Values

Candidate Keys
• If a relation has more than one key, they are called candidate keys.
• In general, we select a candidate key with a fewer number of attributes
as the primary key. The remaining candidate keys are nominated as
unique keys.
• When an attribute contains UNIQUE values and a NOT NULL
constraint, then NOT NULL and UNIQUE combination would make that
attribute to be a candidate key.
Index No. is a candidate key
being Unique and Not Null.
INDEX_NO CHAR(05) NOT NULL UNIQUE,
NIC CHAR(10) UNIQUE ,
STU_NAME VARCHAR (35) NOT NULL,
STU_DOB DATE NOT NULL,

Activity
• Employee table is comprised of following attributes.
• Empid, Salary, NIC, Name, Contact_No, Gender
• The above attributes contain below constraints.
• Check whether the Salary is greater than 20000
• Check whether the Gender is Male or Female
• NIC is a candidate key
• Empid is the Primary Key of the relation
• Name and Contact_No cannot contain NULL values
• Write a SQL statement to create Employee table adhering to the
above constraints.

Specifying Names to Constraints
• Name to a constraint is given to identify a constraint uniquely in

the system. Therefore, the constraint name should be unique.
• When you want to remove a constraint from a relation, you can
drop the constraint. However, giving a constraint name is optional.
• Following is the syntax for specifying a name to a constraint.
CONSTRAINT <constraint name> <constraint type>

Examples of Specifying Names to Constraints:
CHECK
CREATE TABLE STUDENT (

ENROLL_NO CHAR(05) NOTNULL,
STU_NAME VARCHAR(50) NOT NULL,
STU_DOB DATE NOT NULL,
STU_ADDRESS VARCHAR(35),
GENDER VARCHAR(06),
PRIMARY KEY (ENROLL_NO),
CONSTRAINT UG_Students
CHECK (GENDER in ('Male’, 'Female’))
);

Examples of Specifying Names to Constraints:
PRIMARY KEY
CREATE TABLE STUDENT (

STDID CHAR(05),
NAME VARCHAR(20),
ADDRESS VARCHAR(35),
DoB DATE,
CONSTRAINT PK_STDID PRIMARY KEY (STDID)
);

Activity
• Using constraint names

• Add default value to Reg_course as ‘Computer Science’.
• Add check constraint to Age where Age is between 19 and 26.
for the table UnderGrad_Student (Sid, Name, Address, Age,
Reg_course)

Referential Integrity
• Tables in a database are normally not independent and there are

links between tables.
• Referential constraints are introduced to maintain referential
integrity of data that is linked across tables.
• A referential constraint is defined for a specific column (called a
foreign key) when a table is defined.
• The table in which a referential constraint and a foreign key are
defined is called the referencing table.
• The table that is referenced from a referencing table with a
foreign key is called the referenced table.
• The primary key that is referenced by the foreign key must be
pre-defined in the referenced table.

32
• Referential integrity allows the consistency of values across related
tables. Referential integrity between the following two tables is
defined when the foreign Key (stdid) of the StudentMarks table is
created.
StudentMarks
(Referencing Table) Students (Referenced Table)
stdid course_id grade student_id name age
53666 CS100 C 53666 Amal 18
53667 IT101 B 53667 Shiva 18
53668 IS102 A 53668 Saman 19
53669 CS103 B 53669 Fathima 20
• Referential integrity constraint verifies the values of the stdid

column against the corresponding student_id column in the
Students table. This ensures that anyone entered a stdid value
into the StudentMarks table is an existing value in the Students
table:
• In enforcing referential integrity constraint, the foreign key should
be in the same domain as the Primary Key.
• Foreign key is not allowed to have NULL if it is part of the primary
key of the referencing table as illustrated in the previous example.
However, there can be situations where foreign key could contain
NULL values.
Employee Department
(Referencing Table) (Referenced Table)
Empid ename address did Dept_id dname
E1001 Amal Silva Kandy 002 001 HR
E1002 Shiva Colombo 001 002 Finance
Kumar
003 Research
E1003 Fathima Ampara NULL
Siyam 004 Marketing
Since Employee (did) is not a part of the PK, it could be

allowed to have NULL 34
• Referential integrity is declared in the table definition using foreign

key constraint as given below.
CREATE TABLE STUDENTMARKS (

STDID CHAR(05),
COURSE_ID CHAR(05),
GRADE CHAR(01),
PRIMARY KEY(STDID, COURSE_ID),
CONSTRAINT FK_GRADE
FOREIGN KEY(STDID) REFERENCES STUDENTS(STUDENT_ID)
CONSTRAINT FK_COURSE
FOREIGN KEY(COURSE_ID) REFERENCES COURSE(COURSE_ID)
);
Activity
• Identify which relations can be enforced with referential integrity.

constraints.
Relation Yes/No
Student
Professor
Course
Transcript
Teaching
Department
Activity
• Following two tables illustrate rows that a user tries to enter. Identify
what would happen when each of the tuple is inserted into the two
tables given below. Assume that the course table already has the
rows relevant to CS100, CS101 and CS103 courses.
Students(student_id CHAR(05), name VARCHAR(50), age INT)
StudentMarks(stdid CHAR(05),course_id CHAR(05),grade CHAR)
Relation Tuple Violation

Students (53666, ‘Amal’, 18)
(primary key(student_id)) (53667, ‘Anne’,
‘eighteen’)
(53668, ‘Saman’, 19)
StudentMarks (53666, ‘CS100’, ‘C’)
(primary key (stdid, (53669, ‘CS101’, ‘B’)
course_id))
(NULL, ‘CS103’, ‘B’)

Referential Constraint Actions
• Referential constraint actions define alternate processing options

for the referencing table in the event a referenced row is deleted, or
referenced columns are updated when there are existing matching
rows.
• Referential actions are specified as given below with an update
operation (ON UPDATE) or a delete operation (ON DELETE), or
both, in any sequence:
CASCADE,
SET NULL,
SET DEFAUT,
SET NULL or RESTRICT.
• The ON UPDATE and ON DELETE have the following syntax to
enforce the referential actions:
ON UPDATE {CASCADE | SET NULL | RESTRICT | NO ACTION}
or
ON DELETE {CASCADE | SET NULL | RESTRICT | NO ACTION}

Referential Constraint Actions: ON UPDATE CASCADE
• A foreign key with UPDATE CASCADE means that if value of the

primary key of the parent/referenced table is changed, the
corresponding value of foreign key in the child/referencing table is
also changed.
Students (Referenced Table)
StudentMarks
(Referencing Table) student_id name age
Foreign
Key
stdid Course_id grade 53666 Amal 18
53666 CS100 C 53667 Shiva 18
53667 CS101 B 53668 Saman 19
53668 IT102 A 53669 Fathima 20
53670
53669 CS103 B
53670
53669 CS100 A SQL code is executed to update
53670 53669 to 53670 in referenced table
Reference Constraints are checked and If there is a FK whose value is
53669, it is updated to 53670©in referencing table.
e-Learning Centre, UCSC
39
Referential Constraint Actions: ON UPDATE CASCADE
UPDATE CASCADE is specified as given below

• In StudentMarks Table
CONSTRAINT Student_ID_FK
FOREIGN KEY (stdid)
REFERENCES Students (student_id)
ON UPDATE CASCADE ;
UPDATE Students SET student_id = 53670

WHERE student _id = 53669;
Updating a student_id will result in changing it in the StudentMarks
table.
Referential Constraint Actions: ON DELETE CASCADE
• A foreign key with ON DELETE CASCADE means if value of

primary key of the parent/referenced table is deleted, the
corresponding value of foreign key in the child/referencing table is
also deleted.
StudentMarks
(Referencing Table) Students (Referenced Table)
Foreign
Key stdid Course_id grade student_id name age
53666 100 C 53666 Amal 18
53667 101 B 53667 Shiva 18
53668 102 A 53668 Saman 19
53669 103 B 53669 Fathima 20
Reference Constraints are SQL code is executed to delete

checked and If there is a FK 53669
whose value is 53669, then it will
be deleted too. © e-Learning Centre, UCSC 41
Referential Constraint Actions: ON DELETE CASCADE
When DELETE CASCADE is specified

FOREIGN KEY (stdid) REFERENCES Students
(student_id)
ON DELETE CASCADE;
DELETE FROM Students

Deleting a student_id will result in deleting it in the StudentMarks
table.
Referential Constraint Actions: NO ACTION/ RESTRICT
• NO ACTION/RESTRICT is the default behavior of returning an error

in attempting to delete or update a row in the referenced table with
matching rows in the referencing table.
• This action can also be explicitly defined as:
ON DELETE RESTRICT (or ON DELETE NO ACTION)
ON UPDATE RESTRICT (or ON UPDATE NO ACTION)

Referential Constraint Actions: ON DELETE RESTRICT
• The following example illustrates how NO ACTION/RESTRICT

prevents deleting a row of the parent/referenced if there are rows
with the matching foreign key in the child/referencing table.
StudentMarks Students (Referenced Table)

(Referencing Table) student_id name age
Foreign
Key stdid Course_id grade
53666 Amal 18
53666 100 C
53667 Shiva 18
53667 101 B
53668 Saman 19
53668 102 A
53669 Fathima 20
53669 103 B
Reference Constraints are SQL code is executed to delete

checked and If there is a FK 53669
whose value is 53669, then you
can’t perform the delete action
44
Referential Constraint Actions: ON DELETE RESTRICT
When DELETE RESTRICT is specified

FOREIGN KEY (stdid) REFERENCES Students
(student_id) ON DELETE RESTRICT;
DELETE FROM Students

Deleting a student_id will not be performed.

Referential Constraint Actions: SET DEFAULT/
SET NULL
• The other two actions are SET DEFAULT and SET NULL.
• With these actions, when you update or delete a value in the
referenced table you can set a default value or null for the
referencing value.
FOREIGN KEY (stdid) REFERENCES Students (student_id)
ON DELETE SET NULL ON UPDATE SET DEFAULT

Referential Constraint Actions
• Given below is an example where referential actions for both delete

and update operations are used together.
Student (Sid, Name, Address, Age,) Course (Cid, CourseName, Credits)
Marks (Sid, Cid, Mark)

CREATE TABLE Marks (
Sid CHAR(05),
Cid CHAR(05),
Mark CHAR(01),
PRIMARY KEY (Sid, Cid),
FOREIGN KEY Sid REFERENCES Student (Sid)
ON DELETE SET NULL ON UPDATE SET NULL,
FOREIGN KEY Cid REFERENCES Course(Cid)
ON UPDATE CASCADE ON DELETE RESTRICT);
Activity
Create tables for the given schema enforcing suitable referential

integrity constraints and referential actions where necessary.
UnderGrad_Student (Sid, Cid, Name, Address, Age, Reg_course)
Course (Cid, CourseName, Credits)

48
Insert, Delete and Update Operations Dealing with
Constraint Violations
• When the following data manipulation operations are performed, it

is important to maintain the integrity of the data.
- Insert : insert one or more rows to a relation
- Delete : delete an existing row in a table
- Update : modify values of an existing row
• That is, performing these operations should not violate constraints

of a relation.

Insert Operations Dealing with Constraint Violations
• Insert operation adds values with new tuples to a table. When we

perform this operation, there are four types of constraints that
could get violated. They are:
• Entity integrity violation – Insert a row with NULL for the
primary key
• Key constraint violation – Insert a new tuple with an existing
primary key
• Referential integrity violation – Insert a new tuple but the
referenced tuple does not exist
• Domain constraint violation – Insert a tuple with
inappropriate data type
• If an insert violates a constraint then by default it rejects the
insertion.

Delete Operations Dealing with Constraint Violations
• Performing a delete operation violates referential integrity. If you

delete the primary key of a relation that is referenced by another
tuple, then it violates referential integrity.
• How can you prevent this violation?
• Restrict – Reject performing delete operation
• Cascade – delete tuples that refer the primary key
• Set NULL or Set DEFAULT – when you delete a primary key,
set NULL or DEFAULT value to the referencing tuples.

Update Operations Dealing with Constraint
Violations
• Update operations change values of attributes. Update does not

have any impact on attributes which are not primary key or foreign
key. Also, when updating a value it could violate the domain if the
data type of the value is inappropriate. Hence the possible
violations are:
• Primary key violation – Update the primary key value of a
tuple by giving an existing value
• Referential integrity violation – Update the primary key value
of a tuple without updating that value in a referencing table.
Or, Update foreign key value without checking whether that
value exists in the referenced table.

Activity
• Match the correct answers.
Entity integrity violation by default it rejects the

insertion
would prevent
If an insert violates a
referential integrity
constraint
violation
Set NULL or Set DEFAULT can violate integrity of

the data
Performing modification May insert a row with

operations NULL for the primary key

Activity
• StudentMarks (Primary key (stdid, course_id)) and Student
(primary key (stdid)) tables have 3 tuples for each. You want to
perform operations given in the bottom table. Identify whether you
can perform the following operations for the given relations.
StudentMarks Students
stdid Course_id grade Student_id name age
53666 CS100 C 53666 Amal 18
53667 CS101 B 53667 Anne 18
53668 CS103 B 53668 Saman 19
Operation Action
Insert <53669, CS104, ‘A’> into StudentMarks
Assume that the course table already has the rows relevant to
CS104.
Insert <53669, ‘Nimal’, 21> into Students
Insert <NULL, ‘Sunil’, 19> into Students
Update Students Set age = ‘Twenty’ Where Student_id=53666
Drop Table
• Drop table is deleting tables. Two types of actions can be taken

when dropping tables.
• RESTRICT – if there is a constraint (FK / View) then do not
drop the table
• CASCADE - drop all the other constraints & views that
refer the table
DROP TABLE Employee [RESTRICT|CASCADE]

ADD or REMOVE Constraints
• Drop the primary key constraint of a table

- ALTER TABLE Employees
DROP CONSTRAINT PK_Employees;
Or for some DBMS the following statement is possible
- ALTER TABLE Student Drop PRIMARY KEY;
• Drop a unique, foreign key, or check constraint
- ALTER TABLE Employee Drop Constraint FK_EmpDept;
• Add a new constraint
- ALTER TABLE PassStudents ADD Constraint avg_Marks
CHECK (marks >= 50 );

Other Types of Constraints
• Previous constraints discussed in this lesson were included in the

Data Definition Language (DDL). The other types of constraints
are semantic integrity constraints.
• If you need to set a constraint that The salary of an employee
should not exceed the salary of the employee’s supervisor, then
it cannot be directly done by DDL. You need to do it through
application program.
• For this purpose, we use triggers and assertions in SQL. In the
later part of the lesson, we discuss how to create triggers.

57
Constraints in Databases as Assertions
• An assertion is a condition that allows entering valid values to the

tables. It gets enforced on any number of other tables, or any
number of other rows in the same table.
• Assertions differ from CHECK constraint since Assertions are
defined on schema level. CHECK constraints relate to one single
row only and hence are defined on a column or a table level.
• Enforcing assertions are complex in nature. Therefore, most of the
DBMSs do not support it.
CREATE ASSERTION <Constraint Name>

CHECK <Search Condition>

Example
Employee (eid, ename, address, Salary, Ssn, Dno)

Department (Dnumber, dname, Mgr_ssn)
Example, you want to add a constraint where “No employee should
have a salary greater than their manager”. Therefore:
CREATE ASSERTION SALARY_CONSTRAINT

CHECK ( NOT EXISTS ( Constraint Name
SELECT *
FROM EMPLOYEE E, EMPLOYEE M, DEPARTMENT D
WHERE E.Salary>M.Salary AND E.Dno = D.Dnumber
AND D.Mgr_ssn = M.Ssn ) );
Conditions
Activity
• Write an Assertion to the following.

• No student can be graduated if his/her average Mark for all
courses is less than 30.
Marks (Sid, Cid, Mark)

Activity
• Write an Assertion to ensure that every mortgage customer who

has a mortgage should keep a minimum of Rs. 500 in their bank
account.
Bank_Account (Account_no, BranchID, OwnerName, OwnerNIC,

OwnerAddress, mortgageID, balance)
Loan_Details(mortgageID,amount,interest_rate, NoOfYears)

Specifying Actions in Databases as Triggers
• Introduction to Triggers and Create Trigger Statement

• Active Databases and Triggers

Active Databases
• Active database is where a database can perform actions based

on events and conditions that take place in the system.
• ECA model or event-condition-action model is used to state active
database rules.
• Active databases perform active rules to notify conditions. E.g.
indicate when pressure level of a pipe is exceeding the danger
level.
• It can also to apply integrity constraints to evaluate business rules.
• Active rules maintain derived data when values of tuples change.
• TRIGGERS are used to define automatic actions that need to be
executed when events and conditions happen.

Introduction to Triggers
• A trigger is a statement that the system executes automatically as

a side effect of a modification to the database.
• A trigger has an event that causes the trigger to be checked and a
condition that must be satisfied for trigger execution to proceed.
• A trigger has an action that has to be taken when the trigger
executes.
• The following example illustrates the reason for using Triggers
• A production organisation maintains a warehouse. It
maintains a minimum inventory level. When the inventory level
goes down it is possible for a trigger to alert the manager with
respect to the reorder level.

Design A Trigger
• To design a trigger, you require to identify;

• When should the trigger be executed?
• Event – a cause to check the trigger
• Condition – logic to be satisfied to execute the trigger
• Action - The action that should be taken when
the trigger is executed
• Model of a trigger also known as event- condition - action model

Event-Condition-Action model
• Event-Condition-Action model comprises of

• on Event
• Event occurs in database
E.g. insertion of new row, deletion of row
• if Condition
• Condition is checked
E.g. is salary < 10000 ? Has student passed?
• then Action
• Actions are executed if condition is satisfied
E.g. give a salary increment, congratulate student

Types of Triggers
• Row Triggers
• Trigger is fired once for each row in a transaction.
• If an update statement modifies multiple tuples of a table, a
row trigger is fired once for each tuple which are affected by
the update query. If there is no tuple affected by the query,
then the row trigger is not executed.
• Have access to :new (new values) and :old (old values).
E.g. update the total salary of a department when the salary
value of an employee tuple is changed.

Types of Triggers
• Statement Triggers
• Trigger is fired once for each transaction.
• If a DELETE statement deletes several rows from a table, a
statement-level DELETE trigger is fired only once, regardless
of how many rows are deleted from the table.
• Does not have access to :new and :old values.
E.g. Delete a row relevant to an employee who has
completed the contract period.

Tigger Timing
• Before Trigger
• Runs before any change is made to the database.
E.g. Before withdrawing money account balance is required to
be checked.
• After Trigger
• Runs after changes are made to the database
E.g. After withdrawing money update the account balance
• Instead of
• INSTEAD OF triggers are used to modify views that cannot be
modified directly through UPDATE, INSERT, and DELETE
statements.

Syntax for Specifying a Trigger
CREATE [OR REPLACE ] TRIGGER trigger_name
Trigger Time
{BEFORE | AFTER | INSTEAD OF }
{INSERT [OR] | UPDATE [OR] | DELETE}
[OF col_name] Trigger
ON table_name Event
[REFERENCING OLD AS o NEW AS n]
Type of
[FOR EACH ROW | STATEMENT ]
Trigger
WHEN (condition)
DECLARE
Declaration-statements Trigger
Action
BEGIN
Executable-statements
EXCEPTION
Exception-handling-statements
END; © e-Learning Centre, UCSC
70
Example
When inserting values for the Employee table ensure whether the
salary is >= 20000.
Employee(Empid, Ename, Salary, Dno).
CREATE TRIGGER check_salary

BEFORE INSERT ON Employee
FOR EACH ROW
BEGIN
IF (:new.Salary < 20000)
THEN
PRINT ‘Wrong Salary’;
END IF;
END;

OLD and NEW references
• OLD references old values for DML statements. NEW references

new values for DML operations. Delete, Update, Insert are DML
operations.
• OLD and NEW references are not available for table-level triggers,
and it can be used for record-level triggers.
• Table-level triggers cannot fire a trigger for each row of a table.
Record-level or row-level triggers can fire a trigger for each row of
a table.

Example of OLD and NEW references
• Following trigger is an example of updating total salary (delete old

salary and add new salary of employees) of a particular
department
(Elmasri and Navathe, 2015)

Activity
CREATE TRIGGER derive_commission_trg

BEFORE UPDATE OF salary ON Employee
FOR EACH ROW
WHEN (new.job = 'Salesman’)
BEGIN
:new.comm := :old.comm * (:new.salary/:old.salary);
END;
In the above query:

• What is the event?
• What is the Condition?
• What is the action?

Activity
• Create a trigger that accepts insertion into the student table and
checks the GPA. If the GPA of the inserted student is greater than
3.3, or less than or equal to 3.6, that student will be automatically
applying for Computer Science stream. Otherwise, the student has
to apply for Software Engineering stream (Stream is Computer
Science, Biology etc; enrollment is number of students enrolled for
the college, decision is yes or no in getting selected for a stream)
Student(sID, sName, GPA)

Apply(sID, cName, stream, decision)
College(cName, state, enrollment)

Activity
Create a trigger that simulates the behavior of cascaded delete

when there is a referential integrity constraint from the student ID in
the Apply table, to the student ID in the student table. Make it
activate on delete cascade when there is a deletion from the Student
table.
College(cName, state, enrollment);
Student(sID, sName, GPA);
Apply(sID, cName, major, decision);

Summary
Relational Model • Categories of Constraints

Constraints • Domain Constraints
• Specifying Key and Referential Integrity

Specifying Constraints in Constraints
SQL • Specifying Constraints on Tuples Using CHECK
• Specifying Names to Constraints
Constraints in Databases • Introduction to Assertions

as Assertions
Specifying Actions in • Introduction to Triggers and Create Trigger

Databases as Triggers Statement
• Active Databases and Triggers

3 : Database Indexing and Tuning


Overview
• This lesson on Database Indexing and Tuning discusses the

gradual development computer memory hierarchy and
hence the evolution of the Indexing Methods.
• Here we look into types of computer memory and types of

indexes in detail.

• At the end of this lesson, you will be able to;

• Describe how the different computer memories evolved.
• Identify the different types of computer memory and the storage
organizations of databases
• Recognize the importance of implementing indexes on databases
• Explain the key concepts of the different types of database indexes.

List of subtopics
3.1 Disk Storage and Basic File Structures

3.1.1 Computer memory hierarchy
3.1.2 Storage organization of databases
3.1.3 Secondary storage mediums
3.1.4 Solid State Device Storage
3.1.5 Placing file records on disk (types of records)
3.1.6 File Operations
3.1.7 Files of unordered records (Heap Files) and ordered
records (Sorted Files)
3.1.8 Hashing techniques for storing database records: Internal
hashing, external hashing

List of subtopics
3.2 Introduction to indexing

Introduce index files, indexing fields, index entry (record pointers
and block pointers)
3.3 Types of Indexes
3.3.1 Single Level Indexes: Primary, Clustering and Secondary

indexes
3.3.2 Multilevel indexes: Overview of multilevel indexes
3.4 Indexes on Multiple Keys
3.5 Other types of Indexes
Hash indexes, bitmap indexes, function based indexes
3.6 Index Creation and Tuning
3.7 Physical Database Design in Relational Databases

3.1.1. Computer Memory Hierarchy
Computer Memory
Directly accessible to the Not directly accessible to

CPU the CPU
Secondary
Primary Storage Tertiary Storage
Storage

• The data collected via a computational database

should be stored in a physical storage medium.
• Once stored in a storage medium, the database
management software can execute functions on that
to retrieve, update and process the data.
• In the current computer systems, data is stored and
moved across a hierarchy of storage media.
• As for the memory organization, the memory with the
highest speed is the most expensive option and it also
has the lowest capacity.
• When it comes to lowest speed memory, they are the
options with the highest available storage capacity.


Now let’s explain the hierarchy given in the previous slide
(slide number 7).
1. Primary Storage
This operates directly in the computer’s Central
Processing Unit.
Eg: Main Memory, Cache Memory.
• Provides fast access to data.
• Limited storage capacity.
• Contents of primary storage will be deleted when the
computer shuts down or in case of a power failure.
• Comparatively more expensive.

Primary Storage - Static RAM
• Static Random Access Memory (RAM) is the memory

where as long as power is provided.
• Cache memory in CPU is identified as the Static RAM
• Data is kept as bits in its memory.
• The most expensive type of memory.
• Using techniques like prefetching and pipelining, the
Cache memory speeds up the execution of program
instructions for the CPU.

Primary Storage - Dynamic RAM
• Dynamic Random Access Memory (DRAM) is the

CPU's space for storing application instructions and
data.
• Main memory of the computer is identified as the
DRAM.
• The advantage of the DRAM is its low cost.
• When it is compared with the Static RAM, the speed is
lesser.
1
0
2. Secondary Storage
Operates external to the computer’s main memory.
Eg: Magnetic Disks, Flash Drives, CD-ROM
• The CPU cannot process data in secondary storage
directly. It must first be copied into primary storage
before the CPU can handle it.
• Mostly used for online storage of enterprise
databases.
• With regards to enterprise databases, the magnetic
disks have been used as the main storage medium.
• Recently there is a trend to use flash memory for the
purpose of storing moderate amounts of permanent
data.
• Solid State Drive (SSD) is a form of memory that can
be used instead of a disk drive.
1
1
2. Secondary Storage
• Least expensive type of storage media.
• The storage capacity is measured in:
- kilobytes(kB)
- Megabytes(MB)
- Gigabytes(GB)
- Terabytes(TB)
- Petabytes (PB)
1
2
3. Tertiary Storage
Operates external to the computer’s main memory.
Eg: CD - ROMs, DVDs
• The CPU cannot process data in tertiary storage
directly. It must first be copied into primary storage
before the CPU can handle it.
• Removable media that can be used as offline storage
falls in this category.
• Large capacity to store data.
• Comparatively less cost.
• Slower access to data than primary storage media.
1
3
4. Flash Memory
• Popular type of memory with its non-volatility.
• Use the technique of EEPROM (Electronically
Erasable and Programmable Read Only Memory)
• High performance memory.
• Fast access.
• One disadvantage is that the entire block must be
erased and written simultaneously.
• Two Types:
- NAND Flash Memory
- NOR Flash Memory
• Common examples:
- Devices in Cameras, MP3/MP4 Players,
Cellphones, USB Flash Drives
1
4
5. Optical Drives
• Most popular type of Optical Drives are CDs and DVDs.
• Capacity of a CD is 700-MB and DVDs have capacities
ranging from 4.5 to 15 GB.
• CD - ROM reads the data by laser technology. They
cannot be overwritten.
• CD-R(compact disk recordable) and DVD-R: Allows to
store data which can be read as many times as
required.
• Currently this type of storage is comparatively declining
due to the popularity of the magnetic disks.
1
5
6. Magnetic Tapes
• Used for archiving and as a backup storage of data.
• Note that Magnetic Disks (400 GB–8TB) and Magnetic
Tapes (2.5TB–8.5TB) are two different storage types.
1
6
Activity
Categorize the following devices as Primary, Secondary or

Tertiary Storage Media.
1. Random Access Memory
2. Hard Disk Drive
3. Flash Drive
4. Tape Libraries
5. Optical Jukebox
6. Magnetic Tape
7. Main Memory
1
7
3.1.2. Storage Organization of Databases
• Usually databases have Persistent data. This means
large volumes of data stored over long periods of
time.
• These persistent data are continuously retrieved and
processed in the storage period.
• The place where the databases are stored
permanently in the computer memory is the
secondary storage.
• Magnetic disks are widely used here since:
- If the database is too large, it will not fit in the
main memory.
- Secondary storage is non-volatile, but the
main memory is volatile.
- The cost of storage per unit of data is lesser in
secondary storage.
1
8

• Solid State Drive (SSD) is one of the latest
technologies identified as an alternative for
magnetic storage disks.
• However, it is expected that the primary option for
the storage of large databases will continue to be
the magnetic disks.
• Magnetic tapes are also used for database backup
purposes due to their comparatively lower cost.
• But the data in them need to be loaded and read
before processing. Opposing to this, magnetic disks
can be accessed directly at anytime.
1
9
• Physical Database Design is a process that entails

selecting the techniques that best suit the
application requirements from a variety of data
organizing approaches.
• When designing, implementing, and operating a
database on a certain DBMS, database designers
and DBAs must be aware of the benefits and
drawbacks of each storage medium.
2
0
• The data on disk is grouped into Records or Files.

• These records include data about entities, attributes
and relationships.
• Whenever a certain portion of the data retrieved from
the DB for processing, it needs to be found on disk,
copied to main memory for processing, and then
rewritten to the disk if the data gets updated.
• Therefore, the data should be kept on disk in a way
that allows them to be quickly accessed when they are
needed.
2
1
• Primary File Organization defines how the data is

stored physically in the disk and how they can be
accessed.
File Organization Description

Heap File No particular order in storing data.
Appends new records to the end.
Sorted File Maintains an order for the records by
sorting data on a particular field.
Hashed File Uses the hash function of a field to identify
the record’s place in the database.
B Trees Use Tree structures for record storing.
2
2
Activity
State whether the following statement are true or false.

1. The place of permanently storing databases is the
primary storage.
2. A Heap File has a specific ordering criterion where the
new records are added at the end.
3. Upon retrieval of data from a file, it needs to be found
on disk and copied to main memory for processing.
4. The database administrators need to be aware of the
physical structuring of the database to identify whether
they can be sold to a client.
5. Solid State Drives are identified alternatives for
magnetic disks.
2
3
3.1.3. Secondary Storage Media
• The device that holds the magnetic disks is the Hard

Disk Drive (HDD).
• Basic unit of data on a HDD is the Bit. Bits together
make Bytes. One character is stored using a single
byte.
• Capacity of a disk is the number of bytes the disk
can store.
• Disks are composed of magnetic material in the
shape of a thin round disk, with a plastic or acrylic
cover to protect it.
• Single Sided Disk stores information on one of its
surfaces.
• Double Sided Disk stores information on both sides
of its surfaces.
• A few disks assembled together makes a Disk Pack
which has higher storage capacity.
2
4
• On a disk surface, information is stored in

concentric circles of small width, each with its own
diameter.
• Each of these circles is called a Track.
• A Cylinder is a group of tracks on different surfaces
of a disk pack that have the same diameter.
• Retrieval of data stored on the same Cylinder is
faster compared to data stored in different
Cylinders.
• A track is broken into smaller Blocks or Sectors
since it typically includes a vast amount of data.
2
5
Hardware components
on disk:
a) A single-sided disk
with read/write
hardware.
b) A disk pack with

read/write.
2
6
Different sector organizations

on disk:
(a) Sectors subtending a fixed

angle
(b) Sectors maintaining a

uniform recording density
2
7

• During disk formatting, the operating system divides a
track into equal-sized Disk Blocks (or pages). The
size of each block is fixed and cannot be adjusted
dynamically.
• Interblock Gaps, which are fixed in size and contain
specifically coded control information recorded during
disk formatting.
• Hardware Address of a Block is the combination of a
cylinder number, track number (surface number inside
the cylinder on which the track is placed), and block
number (within the track).
• Buffer is one disk block stored in a reserved region in
primary storage.
• Read Command - Disk block is copied into the buffer
• Write Command - Contents of the buffer are copied
into the disk block.
2
8

• A collection of several shared blocks is called a
Cluster
• The hardware mechanism that reads or writes a
block of data is the Read / Write Head
• An electronic component is coupled to a mechanical
arm in a read/write head.
• Fixed Head Disks - The read/write heads on disk
units are fixed, with as many heads as there are
tracks.
• Movable Head Disks - Disk units with an actuator
connected to a second electrical motor that moves
the read/write heads together and accurately
positions them over the cylinder of tracks defined in
a block address.
2
9
Activity
Match the description with the relevant technical term out

of the following.
[Capacity, Track, Buffer, Hardware Address of a Block,
Cluster]
1. The concentric circles on a disk where information is

stored.
2. Combination of a cylinder number, track number, and
block number
3. Number of Bytes that a disk can store
4. Collection of shared blocks
5. A disk block stored in a reserved location in primary
storage.
3
0
3.1.4. Solid State Device Storage
• Solid State Device (SSD) Storage is sometimes

known as Flash Storage.
• They have the ability to store data on secondary
storage without requiring constant power.
• A controller and a group of interconnected flash
memory cards are the essential components of an
SSD.
• SSDs can be plugged into slots already available for
mounting Hard Disk Drives (HDDs) on laptops and
servers by using form factors compatible with HDDs.
• SSDs are identified to be more durable, run silently,
faster in terms of access time, and delivers better
transfer rates than HDD because there are no
moving parts.
3
1
3.1.4. Solid State Device Storage
• As opposed to HDD, where Blocks and Cylinders

should be pre-assigned for storing data, any
address on an SSD can be directly addressed,
since there are no restrictions on where data can be
stored.
• With this direct access, data is less likely to be
fragmented, and the need for restructuring is not
available.
• Dynamic Random Access Memory (DRAM)-based
SSDs are also available in addition to flash
memory.
• DRAM based SSDs are more expensive than flash
memory, but they provide faster access. However,
they need an internal power supplier to perform.
3
2
Activity
State four key features of a Solid State Drive (SSD).

1.____________________
2.____________________
3.____________________
4.____________________
3
3
3.1.5. Placing File Records on Disk

Explanation can be found next slide.
Field
Emp_No Name Data_Of_Birth Position Salary
0001 Nimal 1971 - 04 - 13 Manager 70,000
0005 Krishna 1980 - 01 - 25 Supervisor 50,000
Employee Relation
Value
Record
• As shows in the previous slide, columns of the table

are called fields; rows are called records; each cell
data item is called value.
• The Data Type is one of the standard data types that
are used in programming.
- Numeric (Integer, Long Integer, Floating Point)
- Characters / Strings (Fixed length, varying
length)
- Boolean (True or False and 0 or 1)
- Date, Time
• For a particular computer system, the number of bytes
necessary for each data type is fixed.
3
5
Create table Employee

(
Emp_No Int,
Name Char (50),
Date_Of_Birth Date, An example for the
Position Char (50), Creation of
Salary Int Employee relation
); using MySQL with
data types.
3
6
Activity
Select the Data Type that best matches the description out of
the following.
[Integer, Floating Point, Date and Time, Boolean, Character]
1. NIC Number of Sri Lankans

2. The access time of users for the Ministry of Health website
within a week
3. The number of students in a class
4. Cash balance of a bank account
5. Response to the question by a set of students whether
they have received the vaccination for Rubella.
3
7
• File is a sequence of records. Usually all records in a

file belong to the same record type.
• If the size of each record in the file is the same (in
bytes) the file is known to be made up of Fixed
Length Records.
• Variable Length Records means that different
records of the file are of different sizes.
• Reasons to have variable length records in a file:
- One or more fields are of different sizes.
- For individual records, one or more of the fields
may have multiple values (Repeating Group/
Field)
- One or more fields are optional (Optional Fields)
- File includes different record types (Mixed File)
3
8
• In a Fixed Length Record;

• The system can identify the starting byte location
of each field relative to the starting position of the
record since each record has equal fields and
field lengths. This makes it easier for programs
that access such files to locate field values.
• However, variable length records can also be
stored as fixed length records.
• By assigning “Null” for optional fields where data
values are not available.
• By assigning the maximum possible number of
records for each repeating group.
• In each if these cases, the space is wasted.
3
9
• In a Variable Length Record;

• Each field in each record contains a value, but the
precise length of some field values is not correctly
known.
• To determine and terminate variable lengths
special characters can be used.
• They represent the number of bytes for a particular
record in each field.
• Separators that can be used are: ?, $, %
4
0
Record Storage Format 1

Eg: A fixed-length record with four fields and size of 44 bytes.
4
1

Eg: A record with two variable-length fields (Name and
Department) and two fixed-length fields (NIC and Job_Code ).
Separator Character is used to mark the record separation.
4
2

Eg: A variable-field record with three types of separator
characters
4
3

• In a record with Optional fields;
• A series of <Field-Name, Field-Value> pairs can
be added in each record instead of the field values
if the overall number of fields for the record type is
high but the number of fields that actually occur in
a typical record is low.
• It will be more practical to store a Field Type
code, to each field and include in each record a
series of <Field-Type, Field-Value>.
• In a record with a Repeating Field;
• One separator character can be used to separate
the field's repeated values and another separator
character can be used to mark the field's end.
4
4
Activity
Fill in the blanks in the following statements.
1. A file where the sizes of records in it are different in size
is called a _______________.
2. A _________________ includes different types of

records inside it.
3. In a file, the records belong to _________ record type.
4. A ___________ length record can be made by assigning

“Null” for optional fields where data values are not
available
5. To determine and terminate variable lengths special

characters named as __________ can be used. 4
5
• Block is a unit of data transfer between disk and
memory.
• When the block size exceeds the record size, each
block will contain several records, however, certain
files may have exceptionally large records that cannot
fit in a single block.
• Blocking Factor (bfr) is the number of records per
block in bytes.
• If Block Size> Record Size,
bfr can be calculated using the below equation.
bfr = B / R
Block Size = B bytes
Record Size = R bytes 4
6

• In calculating the bfr a floor function rounds down the
number to the nearest integer.
• But, when the bfr is calculated, there may be some
additional space remaining in each block.
• The unused space can be calculated with the equation
given below.
Unused Space in bytes = B - bfr* R

Block size Space dedicated
for blocks
4
7
• Upon using the unused space, to minimize waste of

space, a part of a record can be stored in one block
and the other part can be stored in another block.
• If the next block on disk is not the one holding the
remainder of the record, a Pointer at the end of the
first block refers to it.
• Spanned Organization of Records - One record
spanning to more than one block.
• Used when a record is larger than the block size.
• Unspanned Organization of Records - Not allowing
records to span into more than one block.
• Used with fixed length records.
4
8
Let’s look at the representation of Spanned and Unspanned

Organization of Records.
4
9
• A spanned or unspanned organization can be utilized

in variable-length records.
• If it is a spanned organization, each block may store a
different number of records.
• Here the bfr would be the average number of records
per block.
• Hence the number of blocks b needed for a file of r
records is,
b = r / bfr
5
0
Example of Calculation
There is a disk with block size B=256 bytes. A file has
r=50,000 STUDENT records of fixed-length. Each
record has the following fields:
NAME (55 bytes), STDID (4 bytes),
DEGREE(2 bytes), PHONE(10 bytes),
SEX (1 byte).
(i) Calculate the record size in Bytes.
Record Size R = (55 + 4 + 2 + 10 + 1) = 72 bytes

5
1
Example of Calculation Continued…
(ii) Calculate the blocking factor (bfr)
Blocking factor bfr = floor (B/R)

= floor(256/72)
= 3 records per block
Floor Function = Rounds the value

down to the previous integer.
5
2

Example of Calculation Continued...
(iii) Calculate the number of file blocks (b) required to
store the STUDENT records, assuming an unspanned
organization.
Number of blocks needed for file = ceiling(r/bfr)

= ceiling(50000/3)
= 16667
Ceiling Function = Rounds the

value up to the next integer.
5
3
• A File Header, also known as a File Descriptor,

includes information about a file that is required by
the system applications which access the file
records.
• For fixed-length unspanned records, the header
contains information to determine the disk addresses
of the blocks, as well as record format descriptions,
which may include field lengths and the order of
fields within a record, and field type codes, separator
characters, and record type codes for variable-length
records.
• One or more blocks are transferred into main
memory buffers to search for a record on disk.
• The search algorithms must do a Linear Search
over the file blocks if the address of the block
containing the requested record is unknown.
5
4
Activity
State the answer for the following calculations.

Consider a disk with block size B=512 bytes. A file
has r=30,000 EMPLOYEE records of fixed-length.
Each record has the following fields: NAME (30 bytes),
NIC (9bytes), DEPARTMENTCODE (9 bytes),
ADDRESS (40 bytes), PHONE (9 bytes),BIRTHDATE
(8 bytes), SEX (1 byte), JOBCODE (4 bytes), SALARY
(4 bytes, real number). An additional byte is used as a
deletion marker.
(i) Calculate the record size in Bytes.

(ii) Calculate the blocking factor (bfr)
(iii) Calculate the number of file blocks (b) required to
store the EMPLOYEE records, assuming an
unspanned organization.
5
5
3.1.6 File Operations
Operations on
Files
Retrieval Update
Operations Operations
Does not change any Changes the file by

data in the file. But insertion, deletion or
locate a certain modification of a certain
record based on the record based on the
selection / filtering selection / filtering
condition. condition.

3.1.6. File Operations
Emp_No Name Data_Of_Birth Position Salary
0001 Nimal 1971 - 04 - 13 Manager 70,000
0005 Krishna 1980 - 01 - 25 Supervisor 50,000
• Simple Selection Condition

Search for the record where Emp_No = “0005”
• Complex Selection Condition
Search for the record where Salary>60,000
5
7
• When several file records meet a search criterion,

the first record in the physical sequence of file
records is identified and assigned as the Current
Record. Following search operations will start with
this record and find the next record in the file that
meets the criterion.
• The actual procedures for identifying and retrieving

file records differ from one system to the next.
5
8
3.1.6. File Operation
The following are the File Access Operations.
Operation Description
Open Allows to read or write to a file. Sets the file
pointer to the file's beginning.
Reset Sets the file pointer of an open file to the
beginning of the file.
Find (Locate) The first record that meets a search
criterion is found. The block holding that
record is transferred to a main memory
buffer. The file pointer is set to the buffer
record, which becomes the current record.
5
9
Read (Get) Copies the current record from the buffer to
a user-defined program variable. The
current record pointer may also be
advanced to the next record in the file using
this command.
FindNext Searches the file for the next entry that
meets the search criteria. The block holding
that record is transferred to a main memory
buffer.
Delete The current record is deleted, and the file
on disk is updated to reflect the deletion.
6
0
Modify Modifies some field values for the current
record and the file on disk is updated to
reflect the modification.
Insert Locates the block where the record is to be

inserted and transfers that block into a main
memory buffer to insert a new record in the
file and the file on disk is updated to reflect
the insertion.
Close Releases the buffers and does any other
necessary cleaning actions to complete the
file access.
6
1
• The following is called “Record at a time” operation

since it is applied to a single record.
Scan Scan returns the initial record if the file
has just been opened or reset;
otherwise, it returns the next record.
6
2
• The following are called “Set at a time” operations

since they are applied to the file in full.
FindAll Locates all the records in the file that
satisfy a search condition.
FindOrdered Locates all the records in the file in a
specified order condition.
Reorganize Starts the reorganization process. (In
cases such as ordering the records)
6
3
• File Organization - The way a file's data is

organized into records, blocks, and access
structures, including how records and blocks are
put on the storage media and interconnected.
• Access Methods - A set of operations that may be

applied to a file is provided. In general, a file
structured using a specific organization can be
accessed via a variety of techniques.
6
4
• Static Files - The files on which modifications are

rarely done.
• Dynamic Files - The files on which modifications

are frequently done.
• Read Only File - A file where modifications cannot

be done by the end user.
6
5
Activity
Match the following descriptions with the relevant file

operation out of the following.
[Find, Reset, Close, Scan, FindAll]
1. Returns the initial record if the file has just been

opened or reset; otherwise, returns the next record.
2. Releases the buffers and does any other necessary
cleaning actions
3. Sets the file pointer of an open file to the beginning of
the file
4. The first record that meets a search criterion is found
5. Locates all the records in the file that satisfy a search
condition.
6
6
3.1.7. Files of unordered records (Heap Files) and

ordered records (Sorted Files)
Files of Unordered Records (Heap Files)
• Records are entered into the file in the order in

which they are received, thus new records are
placed at the end.
• Inserting a new record is quick and efficient. The
file's last disk block is transferred into a buffer,
where the new record is inserted before the block is
overwritten to disk. Then the final file block's
address is saved in the file header.
• Searching for a record is done by the Linear
Search.
6
7

• If just one record meets the search criteria, the

program will typically read into memory and search
half of the file blocks before finding the record. Here,
on average, searching (b/2) blocks for a file of b
blocks is required.
• If the search criteria is not satisfied by any records or
there are many records, the program must read and
search all b blocks in the file.
6
8

• Deleting a Record.
• A program must first locate its block, copy the block
into a buffer, remove the record from the buffer, and
then rewrite the block back to the disk to delete a
record.
• This method of deleting a large number of data
results in waste of storage space.
6
9

• Deleting a Record cont.

• Deletion Marker - An extra byte or bit stored with
every record whereas the deletion marker will get
a certain value when the record is deleted. This
value is not similar to the value that the deletion
marker holds when there is data available in the
record.
• Using the space of deleted records to store data
can also be used. But it includes additional work.
7
0

• Modifying a Record.
• Because the updated record may not fit in its
former space on disk, modifying a variable-length
record may require removing the old record and
inserting the modified record.
7
1

• Reading a Record.
• A sorted copy of the file is produced to read all
entries in order of the values of some field.
Because sorting a huge disk file is a costly task,
specific approaches for external sorting are
employed.
7
2

Files of Ordered Records (Sorted Files)
• The values of one of the fields of a file's records,

called the Ordering Field can be used to physically
order the data on disk. It will generate an ordered or
sequential file.
• Ordered records offer a few benefits over files that are
unordered.
• The benefits are listed in the next slide.
7
3

Files of Ordered Records (Sorted Files)
• Benefits of Ordered records:

• Because no sorting is necessary, reading the
records in order of the ordering key values
becomes highly efficient.
• Because the next item is in the same block as the
current one, locating it in order of the ordering key
typically does not need any extra block
visits.When the binary search approach is
employed, a search criterion based on the value
of an ordering key field results in quicker access.
7
4
3.1.8. Hashing techniques for storing database

records: Internal hashing, external hashing
• Another kind of primary file structure is Hashing,

which allows for extremely quick access to
information under specific search conditions.
• The equality requirement on a single field, termed
the Hash Field, must be used as the search
condition.
• The hash field is usually also a key field of the file,
in which case it is referred to as the hash key.
• The concept behind hashing is to offer a function h,
also known as a Hash Function or randomizing
function, that is applied to a record's hash field
value and returns the address of the disk block
where the record is stored.
7
5
Activity
Fill in the blanks with the correct technical term.
1. The _____________________ is an extra byte that is

stored with a record which will get updated when a record
is deleted.
2. A field which can generate an ordered or sequential file by

physically ordering the records is called ______________.
3. The function which calculates the Hash value of a field is

called ______________.
4. Searching for a record in a Heap file is done by the

____________.
7
6
3.1.8. Hashing techniques for storing database records:
Internal hashing, external hashing
• Internal Hashing.
When it comes to internal files, hashing is usually
done with a Hash Table and an array of records.
• Method 1 for Internal Hashing
• If the array index range is 0 to m – 1, there are m slots
with addresses that correspond to the array indexes.
• Then a hash function is selected that converts the
value of the hash field into an integer between 0 and
m-1.
• The record address is then calculated using the given
function.
• h(K) = Hash Function of K Value
• K = Field Value h(K) = K Mod m
7
7

Internal Hashing Data Structure - Array of m positions to use in

internal hashing
7
8


• By using algorithms that calculate the Hash Function
temp ← 1;
for i ← 1 to 20 do temp ← temp * code(K[i ] ) mod M ;
hash_address ← temp mod M;
Hashing Algorithm in applying the mod hash function to a

character string K.
7
9


• Folding - To compute the hash address, an arithmetic
function such as addition or a logical function such as
Exclusive OR (XOR) is applied to distinct sections of
the hash field value.
8
0

• Collision - When the hash field value of a record that
is being inserted hashes to an address that already
holds another record.
• Because the hash address is already taken, the new
record must be moved to a different location.
• Collision Resolution - The process of finding another
location.
• There are several methods for collision resolution.
8
1
• Methods of Collision Resolution
• Open Addressing - The program scans the
subsequent locations in order until an unused
(empty) position is discovered, starting with the
occupied position indicated by the hash address.
• Chaining - Changing the pointer of the occupied
hash address location to the address of the new
record in an unused overflow location and putting
the new record in an unused overflow location.
• Multiple Hashing - If the first hash function fails,
the program uses a second hash function. If a new
collision occurs, the program will utilize open
addressing or a third hash function, followed by
open addressing if required.
8
2
• External Hashing.
• Hashing for disk files is named as External
Hashing.
• The target address space is built up of Buckets,
each of which stores many records, to match the
properties of disk storage.
• A bucket is a continuous group of disk blocks or a
single disk block.
• Rather than allocating an absolute block address to
the bucket, the hashing function translates a key to a
relative bucket number.
• The bucket number is converted into the matching
disk block address via a table in the file header.
8
3
The following diagram shows matching bucket
numbers (0 to M -1) to disk block addresses.
8
4
• Since many records will fit in a bucket can hash to
the same bucket without generating issues, the
collision problem is less severe with buckets.
• When a bucket is full to capacity and a new record is
entered, a variant of chaining can be used in which a
pointer to a linked list of overflow records for the
bucket is stored in each bucket.
• Here, the linked list pointers should be Record
Pointers, which comprise a block address as well
as a relative record position inside the block.
8
5
Handling overflow for buckets by chaining
8
6
Activity
Match the description with the correct term.
1. Applying an arithmetic function such as addition or a

logical function such as Exclusive OR (XOR) to distinct
sections of the hash field value.
2. The technique used for hashing where the program uses
a second hash function if first hash function fails.
3. The instance when the value of the hash field of a newly
inserted record hashes to an address that already
contains another record.
4. Starting with the occupied place given by the hash
address, the program examines the succeeding locations
in succession until an unused (empty) spot is located
when a collision has occurred.
5. A continuous group of disk blocks or a single disk block
which is comprising of the target address space.
8
7
• Indexes are used to speed up record retrieval in

response to specific search criteria.
• The index structures are extra files on disk that provide
secondary access pathways, allowing users to access
records in different ways without changing the physical
location of records in the original data file on disk.
• They make it possible to quickly access records using
the Indexing Fields that were used to create the index.
• Any field in the file can be used to generate an index,
and the same file can have numerous indexes on
separate fields as well as indexes on multiple fields.
8
8
• Some Commonly used Types of Indexes

• Single Level Ordered Indexes
• Primary Index
• Secondary Index
• Clustering Index
• Multi Level Tree Structured Indexes
• B Trees
• B+ Trees
• Hash Indexes
• Logical Indexes
• Multi Key Indexes
• Bitmap Indexes
8
9
• Single Level Indexes: Primary, Clustering and

Secondary indexes
• Primary, Clustering and Secondary index are types
of single level ordered indexes.
• In some books, the last pages have ordered list of
words, which are categorized from A-Z. In each
category they have put the word, as well as the page
numbers where that particular word exactly appears.
These list of words are known as index.
• If a reader needs to find about a particular term,
he/she can go to the index and find the pages where
the term appears first and then can go through the
particular pages.
• Otherwise readers have to go through the whole
book, searching the term, which is similar to the
linear search.
9
0

Secondary indexes
• Primary Index - defined for an ordered file of
records using the ordering key field.
• File records on a disk are physically ordered by the
ordering key field. This ordering key field holds
unique values for each record.
• Clustering index is applied when multiple records in
the file have same value for the ordering field; here
the ordering field is a non key field. In this scenario,
data file is referred as clustered file.
9
1

Secondary indexes
• A file can have maximum of one physical ordering
field. Therefore, a file can have one primary index or
one clustering index. However, it cannot hold both
primary index and clustered index at once.
• Unlike the primary indexes, a file can have few
secondary indexes additional to the primary index.
9
2
• Single Level Indexes: Primary indexes

• Primary indexes are access structures that used to
increase the efficiency of searching and accessing
the data records in a data file.
• An ordered file which consists two fields and
limited length records is known as a primary index
file.
• One field is the ordering key field. Ordering key
field of the index file and the primary key of the data
file have same data type.
• The other field contains pointers to the disk blocks.
• Hence, the index file contains one index entry(a.k.a
index record) for each block in the data file.
9
3

• As mentioned before, index entry consist of two values.
i. Primary key field value of the first record in a data
block.
i. Pointer to the data block which contains above

primary key field.
for index entry i, two field values can be referred as,

<K(i) P(i)>
9
4
• Ex: Assuming that “name” is a unique field and the

“name” has been used to order the data file, we can
create index file as follows.
<K(1) = (Aaron, Ed), P(1) = address of block 1>
<K(2) = (Adams, John), P(2) = address of block 2>
<K(3) = (Alexander, Ed), P(3) = address of block 3>
The image given in the next slide illustrates the index file and
respective block pointers to the data file.
9
5
Primary index on the

ordering key field
9
6
• In the given illustration of the previous slide,

number of index entries in the index file is
equal to the number of disk blocks in the data
file.
• Anchor record/Block anchor: for a given block in
an ordered data file, the first record in that block is
known as anchor record. Each block has an anchor
record.
9
7

• Dense index and Sparse index
i. Indexes that contain index entries for each
record in the data file (or each search key value)
referred to as dense index.
ii. Indexes that contains index entries for some
records in the data file referred as sparse
index.
• Therefore, by definition, primary index falls into the
sparse (or the non - dense) index type since it does
not keep index entries for every record in the data
file. Instead, primary index keep index entries for
anchor records for each block which contains data
file.
9
8
• Generally, a primary indexing file takes smaller

space compared to the datafile due to two reasons.
i. Number of index entries are smaller than the
number of records in the data file.
ii. Index entry holds two fields which are
comparatively very short in size.
• Hence, performing a binary search on an index file
results in less number of block accesses when
compared to the binary search performed on a data
file.
9
9
• Block accesses for an ordered file with b blocks can

be calculated by using log2 b.
• Let’s assume that we want to access a record with
primary key value is K , which resides on the block
address is P(i), where K(i) ≤ K < K(i + 1).
• Since the physical ordering of the data file is depends
on the primary key, all records of K(i) resides in the the
ith block.
• Therefore, to retrieve the record corresponding to
given K value, a binary search is performed on the
index file to find the index entry for i.
• Then we can get the block address for the P (i) and
retrieve the record.
1

Ex: Let’s say we have an ordered file with its key field. File
records are of fixed size and are unspanned. Following
details are given and we are going to calculate the block
accesses require when performing a binary search,
number of records r = 300,000
block size B = 4,096 bytes
record length R = 100 bytes
We can calculate the blocking factor,
bfr = (B/R)= floor(4,096/100) = 40 records per block
Hence, the number of blocks needed to store all records
b = (r/bfr) = ceiling(300,000/40)= 7,500 blocks.
Block accesses required = log2 b
= ceiling(log2 7,500)= 13
1
Ex: For the previous scenario given, if we have a primary
index file with 9 bytes long ordering key field (V) and 6 bytes
long block pointer (P), the required block accesses can be
calculated as follows.
number of records r = 300,000
block size B = 4,096 bytes
index entry length Ri = (V+P)= 15
We can calculate the blocking factor for index,
Number of index entries required is equal to number of blocks
required for data file.
Hence, number of blocks needed for index file,
bi = (r/bfr) = ceiling(7,500/273)= 28 blocks.
Go to the next slide for the rest of the calculation
1
block accesses required

= log2 bi= ceiling(log2 28)= 5
However to access the record using the index, we

have to perform binary search on the index file plus
one additional access to retrieve the record.
• Therefore the formula for total number of block

access to access the record should be,
log2 bi + 1 accesses = 6 block accesses.
1
• Primary indexing has problems when we add new

records to or delete existing records from an ordered
file.
• If a new record is inserted to its correct position
according to the order, existing records in the data
file might subject to change their index in order to
spare some space for the new record.
• Sometimes this change result in change of anchor
records as well.
• Deletion of records also has the same issue as
insertion.
1
• An unordered overflow file, can be used to scale

down this problem.
• Adding a linked list of overflow records for each
block in the data file is another way to address this
issue.
• Deletion markers can be used to manage the
issues with record deletion.
1
• Single Level Indexes: Clustering indexes
• When a datafile is ordered using a non-key field

which does not consist of unique values, such file
are known as clustered files. The field which is
used to order the file is known as clustering field.
• Clustering index accelerate the retrieval of all
records whose clustering field (field that is used to
order the data file) has same value.
• In primary index the the ordering field consist of
distinct values unlike the clustering index.
1
• Clustering index also consists of two fields. One is

for the clustering field of the data file and the second
one is for block pointers.
• In the index file, there is only one entry for distinct
values in the clustering field with a pointer to the
first block where the record corresponding to the
clustering field appear.
1
• Since the data file is ordered, entering and deleting

records still causes problems in the clustering index
as well.
• A common method to address this problem is to
assign an entire block (or a set of neighbouring
blocks) for each value in the clustering field.
• All records that have similar clustering field will be
stored in that allocated block.
• This method ease the insertion and deletion of the
records.
1
• This problem can be scaled down using an

unordered overflow file.
• Adding a linked list of overflow records for each
block in the data file is another way to address this
issue.
• Deletion markers can be used to manage the
issues with record deletion.
• Clustering index also falls into the sparse index type
since the index field contains entries for distinct
values of the ordering key field in the data file, rather
than each and every record in the ordering key field.
1
Clustering Index
1
Clustering Index with allocation of
blocks for distinct values in the
ordered key field.
1
Ex: For the same ordered file with r = 300,000, B = 4,096
bytes, let’s say we have used a field “Zip code“ which is non
key field, to order the data file.
Assumption: Each Zip Code has equal number of records and
there are 1000 distinct values for Zip Codes (ri). Index entries
consist of 5-byte long Zip Code and 6-byte long block pointer.
Size of the record Ri = 5+6 = 11 bytes
Blocking factor bfri = B/Ri = floor(4,096/11)
= 372 index entries per
block
Hence, number of blocks needed bi = Ri/bfri
= ceiling(1,000/372) = 3 blocks.
Block accesses to perform a binary search,
= log2 (bi) = ceiling(log2 (3))= 2
1
• Single Level Indexes: Secondary indexes
• A secondary index provides an additional medium

for accessing a data file which already has a
primary access.
• Data file records can be ordered, not ordered or
hashed. Yet, the index file is ordered.
• A candidate key which has unique values for every
record or a non - key value which holds redundant
values can be use as the indexing field to define the
secondary indexing.
• The first field of the index file has the same data
type as the non-ordering field in the data file, which
is an indexing field.
• A block pointer or a record pointer is put in the
second field.
1
• For a single file, few secondary Indexes (and

therefore indexing fields) can be created - Each of
these serves as an additional method of accessing
that file based on a specific field.
• For a secondary index created on a candidate key
(unique key/ primary key), which has unique values
for every record in the file, the secondary index will
get entries for every record in the data file.
• The reason to have entries for every record is, the
key attribute which is used to create secondary
index has distinct values for each and every record.
• In such scenarios, the secondary index will create a
dense index which holds key value and block
pointer for each record in the data file.
1
• Same as before in primary index, here also two

fields of index entries are referred as <K (i), P (i)>.
• Since the order of the data file is based on the value
of K (i), a binary search can be performed.
• However, block anchors cannot be used since the
records of the data file is not physically ordered by
the values of the secondary key field.
• This is the reason for creating an index entry for
each record of data instead of using block anchors
like in primary index.
1
• Due to the huge number of entries, a secondary

index requires much storage capacity when
compared to the primary index.
• But, on the other hand, secondary indexing gives
greater improvement in the search time for an
arbitrary record.
• Secondary index is more important because we
have to do a linear search of the data file, If there
was no secondary index.
• For a primary index, a binary search can be
performed in the main file even if the index is not
present.
1

Ex: If we take the same example in primary index and assume
we search for a non-ordering key field V = 9 bytes long,in a
file with 300,000 records with a fixed length of 100 bytes. And
given block size B =4,096 bytes.
We can calculate the blocking factor,
Hence, the number of blocks needed,
b = (r/bfr) = ceiling(300,000/40)= 7,500 blocks.
• If we perform a linear search on this file, the required
number of block access = b/2
= 7,500/2
= 3,750 block accesses
1
However, if we have a secondary indexing on that non -
ordering keyfield, with entries for the block pointers P= 6 bytes
long,
Length of the index entry Ri = V+P
= 9+6 = 15
Blocking factor bfri = B/Ri
= floor(4,096/15) = 273
Since the secondary index is dense, the number of index
entries (ri) are same as the number of records (300,000) in the
file.
• Therefore, number of blocks required for secondary index
is,
bi = ri/ bfri
=ceiling(300,000 / 273)
= 1,099 1

• If we perform a binary search on this secondary index
the required number of block accesses can be
calculated as follows,
(log2 bi) = ceiling(log21,099)
= 11 block accesses.
• Since we need additional block access to find the
record in the data file using the index, the total number
of block accesses required is,
11 + 1 = 12 block accesses.
1

• Comparing to the linear search, that required 3,750
block accesses,the secondary index shows a big
improvement with 12 block accesses. But it is
slightly worse than the primary index which
needed only 6 block accesses.
• This difference is a result of the size of the primary
index. The primary index is sparse index, therefore,
it has only 28 blocks.
• While the secondary index which is dense, require
length of 1,099 blocks. This is longer when
compared to the primary index.
1

• Secondary index retrieves the records in the order
of the index field that we considered to create the
secondary index, because secondary indexing
gives a logical ordering of the records.
• However, in primary and clustering index, it
assumes that, physical ordering of the file is
similar to the order of the indexing field.
1
3.3.2 Multilevel indexes: Overview of multilevel
indexes
• Considering a single-level index is an ordered file, we

can create a primary index to the index file itself.
• Here the original index file is called as the first-level
index and the index file created to the original index is
called as the second-level index.
• We can repeat the process, creating a third, fourth, ...,
top level until all entries of the top level fit in one disk
block.
• A multi-level index can be created for any type of first
level index (primary, secondary, clustering) as long as
the first-level index consists of more than one disk
block.
1
indexes
• As we have discussed in topic 3.3, an ordered index

file is associated to the primary, clustered and
secondary indexing schemes.
• Binary search is used to find indexes and the
algorithm continues to reduces the part of the index
file that search, by factor 2 in each step. Hence we
use the log function to the base 2. (log 2 bi)
• Multilevel indexing is used to faster this search by
reducing the search space if the blocking factor of the
index is greater than 2.
• In multilevel indexing the blocking factor of the index
bfri referred to as fan-out which is symbolized as fo.
• In multilevel indexing, the number of block accesses
required is (approximately) logfo bi.
1
indexes
• If the first level index has r1 entries, blocking factor for the
first level bfr1 = fo.
• The number of blocks required for the first level is given
by, ( r1 / fo).
• Therefore, the number of records in the second level
index r2= ( r1 / fo).
• Similarly, r3= ( r2 / fo).
• However, we need to have second level only if the first
level requires more than 1 block. Likewise, we consider for
a next level only if the current level requires more than 1
block.
• If the top level is t,
t= ⎡ (logfo (r1)) ⎤ 1
• If a certain combination of attributes is used frequently,

we can set up a key value on those combination of
attributes for efficient access.
• For an example, let's say we have a file for students,
containing student_id, name, age, gpa, department_id
and department_name.
• If we want to find students whose department id = 1 and
gpa is 3.5, we can have the search strategies specified
in the next slide.
1
1. By assuming only the department_id has an index, we

can access the records with department_id = 1 using
the index and then find the records that has 3.5 gpa.
2. Alternatively, we can assume only the gpa has an
index and not the department_id, we can access the
records with gpa = 3.5 using the index and then find
the records that has department_id = 1.
3. If both of this department_id and gpa fields have
indexes, we can get the records that meets the given
individual condition (depadrmrnt_id = 1 and gpa = 3.5 )
and then take the intersection of those records.
1
• All of the mentioned methods will eventually give the
same set of records as the result.
• However, the number of individual records which meet
one of the specified conditions (either department_id= 1
or gpa= 3.5) are larger than the records that satisfy both
conditions (department_id= 1 and gpa= 3.5).
• Hence, none of the above three methods is efficient for
searching records we required.
• Having a multiple key index on department_id and gpa
would be more efficient in this case, because we can
search for the records which meets given requirements
just by accessing the index file.
• We refer to keys containing multiple attributes as
composite keys.
1
• Ordered Index on Multiple Attributes
• We can create a key field for previously discussed
file as <department_id,gpa>.
• Search key is also a pair of values. For the previous
example this will be <1,3.5>
• In general, if an index is created on attributes
<A1,A2,A3 …. ,An>, the search key values are tuples
with n values ; <v1,v2,v3 …. ,vn>.
• A lexicographic (alphabetical) ordering of these tuple
values establishes an order on this composite
search keys.
• For example, all the composite keys with 1 for
department_id will precede those for department_id
2.
• When the department_id is the same, the composite
keys will be sorted in ascending order of the gpa. 1
• Partitioned Hashing
• Partitioned hashing is an extension of static external
hashing (when a search-key value is provided, the
hash function always computes the same address)
which allows access on multiple keys.
• This is suitable only for equality comparisons. It
doesn’t support range queries.
• For a key consisting n attributes, n separate hash
addresses are generated. The bucket address is a
concatenation of these n addresses.
• Then it is possible to search for composite key by
looking up the appropriate buckets that match the
parts of the address in which we are interested.
1
• For example, consider the composite search key
<department_id,gpa>
• If department_id and gpa are hashed into 2-bit and
6-bit address respectively, we get an 8 bit bucket
address.
• If department_id = 1 hashed to 01 and gpa = 3.5
hashed to 100011 then the bucket address is
01100011.
• To search for students with 3.5 gpa, we can search
for buckets 00100011 , 01100011, 10100011,
11100011
1
• Advantages of partitioned hashing:
i. Ease of extending for any number of attributes.
ii. Ability to design the bucket addresses in a way
that frequently accessed attributes get higher-
order bits in the address. (Higher-order bits are
the left most bits)
iii. There is no need to maintain a separate access
structure for individual attributes.
1
• Disadvantages of partitioned hashing:
i. Inability to handle range queries on any of the
component attributes.
ii. Most of the time, records are not maintained by
the order of the key which was used for the hash
function. Hence, using lexicographic order of
combination of attributes as a key (eg:
<department_id,gpa>) to access the records
would not be straightforward or efficient.
1
• Grid Files
• Constructed using a grid array with one linear
scale (or dimension) for each of the search
attributes.
• For the previous example of students file, we can
construct a linear scale for department_id and
another for gpa.
• These linear scales are created to preserve the
uniform distribution of that particular attributes that
are considered as index.
• Each cell points to some bucket address where
the records corresponding to that cell are stored.
1
Following illustration shows a grid array for the Student file with
one linear scale for department_id and another for the gpa
attribute Student File
0 1 2 3
department_id Linear scale gpa Linear scale
0 0 < 0.9 0
1 1 1.0 - 1.9 1
2 2 2.0 - 2.9 2
3 3 > 3.0 3
Linear scale for department_id Linear scale for gpa 1
• Grid Files
• When we query for deparment_id = 1 and gpa =3.5, it
maps to cell (1,3) as highlighted in the previous slide.
• Records for this combination can be found in the
corresponding bucket.
• Due to nature of this indexing, we can perform range
queries.
• As an example, for range query gpa > 2.0 and
department_id < 2 , following bucket pool can be
selected.
3
0
1
3
0 1 2 3
• Grid Files
• Grid files can be applied to any number of search
keys.
• If we have n number of search keys, we’ll get a grid
array of n dimensions.
• Hence it is possible to partition the file along the
dimensions of the search key attributes.
• Thus, grid files provide an access by combinations of
values along dimensions of grid array.
• Space overhead and additional maintenance cost for
reorganization of the dynamic files are some
drawbacks of grid files.
1
• Hash Indexes
• The hash index is a secondary structure that allows
access to the file using hashing.
• The search key is defined on an attribute except the
one used for organizing the primary data file.
• Index entries consist of the hashed key and the
pointer to the record which is corresponding to the
key.
• The index files with hash index could be arranged as
dynamically expandable hash file.
1
• Hash Indexes
Hash-based indexing.
1
• Bitmap Indexes
• Bitmap index is commonly used for querying on

multiple keys.
• Generally, this is used for relations which are consist
of large number of rows.
• Bitmap index can be created for every value or range
of values in single or multiple columns.
• However, those columns used to create bitmap index
have quite less number of unique values.
1
• Bitmap Indexes
• Consider we are creating a bitmap index on column

C, for a particular value V and we have n number of
records.
• Therefore, the index contains n number of bits.
• For a given record with record number i, if that
record has the value V in column C, the ith bit will be
given 1, otherwise it will be 0.
1
• Bitmap Indexes Row_id Emp_id Lname Gender M F
0 51024 Sandun M 1 0
• In the given table we 1 23402 Kamalani F 0 1
have a column for
2 62104 Eranda M 1 0
record the gender of
the employee. 3 34723 Christina F 0 1
• The bitmap index for 4 81165 Clera F 0 1
the values are an
5 13646 Mohamad M 1 0
array of bits as
shown. 6 54649 Karuna M 1 0
7 41301 Padma F 0 1
M 10100110
F 01011001
1
4
1
• Bitmap Indexes
• According to the example given in the previous slide,
• If we consider value F in column gender, 1st, 3rd,
4th and 7th bits are marked as “1” because record
ids of 1,3,4, and 7 have value F init. But the record
ids of 0,2,5 and 6 set to “0”.
• Bitmap index is created on a set of records which
are numbered from 0 to n with a record id or row id
that can be mapped to a physical address.
• This physical address is created with block number
and record offset within the block.
1
• Function based indexing

• This methods was introduced by commercial DBMS
products like Oracle relational DBMS.
• In function based indexing, a function is applied on
one or more columns and the resulting value is the
key to the index.
• For an example, we can create an index on
uppercase of the last_name field as follows;
CREATE INDEX upper_lname

ON Employee (UPPER(Lname));
• “UPPER” function is applied on “Lname” field to

create index called “upper_lname”. 1
4
3
• Function based indexing

• If we apply following query, DBMS will use the index
created on last_name rather than searching the entire
table.
SELECT Emp_id, Lname

FROM Employee
WHERE UPPER(last_name)= "Sandun"
1
• Index Creation
• An index is not an essential part of a data file.
However, we can create and remove index
dynamically.
• Usually, index is known as access structures. We
can create index based on the frequently used
search requirements.
• The physical ordering of the data file is disregarded
by creating a secondary index.
• Secondary index can be created in conjunction with
virtually any primary record organization.
• Secondary index can be used in addition to the
primary index such as ordering, hashing or mixed
files.
1
• Index Creation
• Following command is a general way of creating an
index in RDBMS;
CREATE [ UNIQUE ] INDEX <index name>
ON <table name> ( <column name> [ <order> ] { ,
<column name> [ <order> ] } )
[ CLUSTER ] ;
• Keywords in green square brackets are optional.
• [Cluster] → sort records in the datafile on the
indexing attribute.
• <order> → ASC/DESC (default- ASC)
1
• Tuning Indexes
• The indexes that we have created, may require
modifications due to following reasons,
i. Long run time of the queries due to deficiency
of an index.
ii. Index may not get utilized.
iii. Attributes that are used to create the index
might subject to frequent changes.
• DBMS provide options to view the execution order
of the queries. The indexes used, number of disk
accesses are include in this view and it is known as
query plan.
• With the query plan, we can identify if the above
problems are taking place and hence update or
remove index accordingly.
1
• Tuning Indexes
• Database tuning takes place with the goal of meeting

best overall performance. The requirements are
dynamically evaluated and the organization of the
index and files are changed accordingly.
• Change nonclustered index into a clustered index or
change clustered index into a nonclustered index ,
creating or dropping index are some ways of
improving performance.
• Rebuild operation of index might help to improve the
performance by claiming the wasted space due to
many deletions.
1
3.7 Physical Database Design in Relational
Databases
• Analyzing the Database queries and transactions
• Before design the physical structure, we should have

a thorough idea of intended use of the database and
abstract knowledge about the queries that will be
used.
• Physical design of a database should provide the
appropriate structure to store data and at the same
time it should facilitate better performance.
• The mix of queries, transactions and applications that
are expected to be run on the database are some
factors that database designer should consider
before design the physical structure.
• Let’s discuss about each factor in detail.
1
Databases
• For retrieval query, the information given below will

be important
i. The relations that will be access by the query
ii. The attributes specified for the selection
condition
iii. Type of selection condition (equal, unequal,
range etc)
iv. Attributes help in linking multiple tables (join
conditions)
v. Attributes retrieved by the query
• ii and iv are candidates for index creation.
1
Databases
• When it comes to the update operation or update

transaction, we should consider,
i. Files subject to update
ii. Whether it is an insert, delete or update
operation
iii. Which attributes are specified in the selection
condition, to update or delete.
iv. Attributes whose values are subject to change
by the update query.
• Attributes in iii are useful when creating an index.
1
Databases
• Analyzing the Expected Frequency of Invocation of

Queries and Transactions
• We must consider how frequently we expect to call/

invoke a particular query.
• An aggregated list of expected frequencies for all
the queries and transactions along with their
attributes is prepared.
1
Databases
• Analyzing the Time Constraints of Queries and

Transactions.
• Some queries and transactions have rigid time

constraints. For an example, if we take a stock
exchange system, some of the queries required to
be completed within milliseconds.
• Generally, primary access structures provides the
most effective way of locating a record in a file.
Hence, selection attributes in queries with time
constraints should be given a high priority when
creating primary access structures.
1
Databases
• Analyzing the Expected Frequency of the update

queries
• Updating the access paths for a record itself slow

down the operations. Therefore, least amount of
access paths should be specified for the file that are
subject to frequent updates.
1
Databases
• Analyzing the Uniqueness constraint on attributes
• Primary key of a file or the unique attributes of a file

that are candidate keys, should have access paths
defined.
• Having index (or the access path) defined will make
it easy to search on the index when checking for
uniqueness.
• This will help to check the uniqueness when
inserting new records because if the value is already
exist, database will reject that record since it violates
the uniqueness.
1
Databases
• Design Decisions about indexing

• Whether to index an attribute:
• In general, indexes are created on the attributes
which are used as the unique key of the file or the
attributes which are used in selection conditions or
in join conditions in queries.
• Multiple indexes are defined to process operation
just by scanning the index rather than accessing
data files.
1
Databases

• What attribute or attributes to index on:
• An index can be defined on a single attribute or it
could be a composite index created on multiple
attributes. In composite index, the order of the
attributes should be match with their order in the
respective queries.
Ex: If we have a composite key with (department,
subject) it assumes the queries are based on
subjects within a department.
1
Databases

• Whether to set up a clustered index:
• We cannot have both primary and clustering index
on the same file because the data file is physically
ordered accordingly in both scenarios. We can
apply clustered index, if it supports answer the
queries just by accessing index. Otherwise, there is
no use of making a clustered index. If multiple
queries require clustering on different attributes, we
should evaluate the gain of each and decide on
which attribute to use.
• Whether to use dynamic hashing for the file:
• Dynamic hashing would be suitable for files which
are subject to frequent expansion and shrinking.
1
Activity
1. What are the types of single level ordered indexes?

a. ________
b. ________
c. ________
1
Activity
Fill in the blanks

1. A file can have _____ physical ordering field.
2. Primary, Clustering and Secondary index are types of
_______ level _____ indexes.
3. ________ search is possible on the index field since it has
________ values.
4. Indexing access structure is established on ______ ____.
5. Index file is usually _____ than the datafile.
1
Activity
Mark whether the given statement is true false.

1. An unordered file which consists two fields and limited
length records is known as a primary index file. ( t/f )
2. for a given block in an ordered data file, the first record
in that block is known as anchor record. ( t/f )
3. Indexes that contains index entries for some records in
the data file referred as non - dense index. ( t/f )
4. Ordering key field of the index file and the primary key
of the data file have same data type. ( t/f )
5. In primary index, index file contains one index
entry(a.k.a index record) for each record in the data
file.( t/f )
1
Activity
1. You have a file with 600,000 records (r), which is

ordered by its key field and each record of this file is
fixed length and unspanned. Record length (R) is 100
bytes and block size(B) is 4096.
a. What is the blocking factor?
b. How many blocks required to store this file?
c. Calculate the number of block accesses required
when performing a binary search on this file and
access data.
1
Activity

bytes and block size(B) is 4096.
b. How many blocks required to store this file?
when performing a binary search on this file and
access data.
1
Activity

bytes and block size(B) is 4096. If you have created a
primary index file with 9 bytes long ordering key field
(v) and 6 bytes long block pointer (p),
a. What is the blocking factor for index ?
b. How many blocks required for the index file?
when performing a binary search on index fileand
access data.
1
Activity
1. A data file with 400,000 records (r) is ordered by a non-

key field called “product_category”. The
product_category field has 750 distinct values. Record
length (R) is 100 bytes and block size(B) is 4096. If you
have created a primary index file on this non-key field
with 9 bytes long ordering key field (v) and 6 bytes long
block pointer (p),
b. How many blocks required to store the index file?
when performing a binary search on index file and
access data.
1
Activity
1. Assume we search for a non-ordering key field V = 9

bytes long,in a file with 600,000 records with a fixed
length of 100 bytes. And given block size B =8,192
bytes.
b. What is the required number of blocks?
c. How many block accesses required for a linear
search?
1
Activity
1. Assume we create a secondary index on non-ordering

key field V = 9 bytes long, with entries for the block
pointers P=6, in a file with 600,000 records with a fixed
length of 100 bytes. And given block size B =8,192
bytes.
b. How many blocks required to store the index file?
when performing a binary search on index file and
access data.
1
Activity
1. Assume we have a file with multi-level indexing. In the

first level, number of blocks b1 = 1099 and blocking
factor (bfri) = 273.
a. Calculate the number of blocks required for second
level.
b. Calculate the number of blocks required for third
level.
c. what is the top level index(t)?
d. How many block accesses required to access a
record using this multi-level index?
1
4 : Distributed Database Systems

1
Overview
• This lesson discusses about the concepts, advantages

and different types of distributed database systems.
• Distributed database design techniques and concepts
such as fragmentation, replication, allocation will be
discussed in detail.
• Thereafter, we will be looking at the distributed database
query optimization techniques.
• Finally, the NoSQL characteristics related to DDB will be
discussed.

• At the end of this lesson, you will be able to;

• Describe the concepts in data distribution and
distributed data management.
• Analyze new technologies that have emerged to
manage and process big data.
• Explain the distributed solutions provided in NoSQL
databases.
• Describe different concepts and systems being used
for processing and analysis of big data.
• Describe cloud computing concepts.

List of subtopics
4.1. Distributed Database Concepts, Components and Advantages

4.2. Types of Distributed Database Systems
4.3. Distributed Database Design Techniques
4.3.1. Fragmentation
4.3.2. Replication and Allocation
4.3.3. Distribution Models: Single Server , Sharding, Master-Slave,
Peer-to-Peer
4.4. Query Processing and Optimization in Distributed Databases
4.4.1 Distributed Query Processing
4.4.2 Data Transfer Costs of Distributed Query Processing
4.5. NoSQL Characteristics related to Distributed Databases and
Distributed Systems
4.1 Distributed Database Concepts, Components and
Advantage
• A system that performs certain assigned tasks with the

help of several sites which are connected via a computer
network is known as a distributed computing system.
• The goal of a distributed computing system is to partition a
complex problem that requires a large computational
power into smaller pieces of work .
• Distributed Database technology has emerged as a result
of the merger between database technology and
distributed systems technology.

Advantage
• Distributed database (DDB) is a set of logically

interrelated databases connected via a computer network.
• To manage the distributed database and to make it
transparent to the user, we are using a software called
distributed database management system (DDBMS).
• Following are the minimum conditions that should be
satisfied by a database to be distributed:
- Multiple computers (nodes) connected over a network to
transmit data.
- Logical relationship between the information available in
different nodes.
- The hardware, software and data related to each site is
not mandatory to be identical.

Advantage
• Location of the nodes either can be with in a same

physical location connected via a LAN (Local Area
Network) or geographically disperse which is connected
via WAN (Wide Area Network).
• We can use different network topologies to establish the
communication between sites.
• The topology we select directly affects the performance
and the query processing of the distributed database.

Advantage
Transparency
• In general, transparency is not allowing the end user to
know implementation details.
• There are several types of transparencies introduced in the
distributed database domain because the data is distributed
in multiple nodes.
i. Location transparency : Commands issued are not
changed according to the location of data or the
node.
ii. Naming transparency: When a name is associated
with an object, the object can be accessed without
giving additional details such as the location of data.

Advantage
Transparency Cont.
iii. Replication transparency : User is not aware of the
replicas that are available in multiple nodes in order to
provide better performance, availability and reliability.
iv. Fragmentation transparency: User is not aware of
the fragments available.
v. Design transparency: User is unaware of the design
of the distributed database while he is performing the
transactions.
vi. Execution transparency: User is unaware of the
transaction execution details.

Advantage
Reliability and Availability

• Reliability is the probability of a system in the running
state at a given time point.
• Availability is defined as the probability of a system been
continuously available at a given time interval.
• There is a direct relationship between reliability &
availability with the database faults, errors, and failures.
• If a system deviates from it’s defined behaviour, we call it
a Failure.
• Errors contain a subset of states which causes the
failures.
• A cause of an error is known as a Fault.
1
0
Advantage
Reliability and Availability Cont.

• There are several approaches to make a system reliable.
• One method is fault tolerance.
• In this method, we identify and eliminate faults before they
result in system failures.
• Another method is ensuring the system do not contain any
faults by conducting quality control measures and testing.
• A reliable DDBMS should be able to process user
requests as long as database consistency is preserved.
• The recovery manager in a DDBMS is working on the
failures arising from different aspects such as
transactions, hardware, and communication networks.
1
1
Advantage
Scalability and Partition Tolerance

• Scalability is identifying to which extent the system can be
expanded without making a disturbance to the operations.
• There are two main types of scalability as follows.
Scalability
Horizontal Scalability Vertical Scalability

Expand the number of nodes in Expand the capacity of the
a Distributed system. individual nodes in a system.
Make it possible to distribute Eg: Expanding the storage
some of the data and capacity or the processing power
processing loads among old of a node.
and new nodes. 1
2
Advantage
Scalability and Partition Tolerance Cont.

• When the number of nodes are increased, the possibility of
network failures also grows up, resulting the nodes to be
partitioned into subgroups.
• In this situation, the nodes within a single subnetwork can
communicate each other while the communication among
partitions are lost.
• The ability of the system to keep operating even though the
network is divided into separate groups is known as
partition tolerance.
1
3
Advantage
Autonomy
• The extent to which a single node (database) have the
capacity to be worked independently is refer to as
Autonomy.
• Higher flexibility is given to the nodes when there is high
autonomy.
• Autonomy can be applied in many aspects such as,
- Design autonomy: Independence of data model usage
and transaction management techniques.
- Communication autonomy: The extent to which each
node can decide on sharing of information with other
nodes.
- Execution autonomy: Independence of users to operate
as they prefer.
1
4
Advantage
Advantages of DDB
1. Improves the flexibility of application development
- The ability of carrying out application development
and maintenance from different physical locations.
2. Improve Availability
- Faults are isolated to the site of origin without
disturbing the other nodes connected.
- Even though a single node fails, the other nodes
continue to operate without failing the entire system.
(However, in a centralized system, failure at a single
site makes the whole system unavailable to all
users). Therefore, availability is improved with a
DDB.
1
5
Advantage
Advantages of DDB Cont.
3. Improve performance
- Data items are stored closer to where it is needed
the most. It reduces the competition for CPU and I/O
services required. The access delays involved in
wide area networks are also brought down.
- Since each node holds only a partition of the entire
DB, the number of transactions executed in each
site is smaller compared to the situation where all
transactions are submitted to a single centralized
database.
- Execution of queries in parallel by executing multiple
queries at different sites, or by splitting the query into
a number of subqueries also improves the
performance.
1
6
Advantage
Advantages of DDB Cont.

4. Easy expansion
- Ability to make the system expanded by adding more
nodes or increasing the database size helps to
facilitate the growth of data much easier when
compared to a centralized system.
1
7
Activity
Select the advantages of a distributed database over a

centralized database from the following features given.
• Less cost
• Slow responses
• Less complexity
• Improved performance
• Easier Scalability
• Availability improvement
• Maintainability
• Flexibility in application development
1
8
Activity
Fill in the blanks with the most suitable words given.

(same,multiple,network,location,replication,execution, design,
horizontal, vertical, communication, same, fragmentation,
naming)
With _________ transparency, user is unaware about the
different locations, where the data is stored.
Making the user unaware of having multiple copies of the same
data item in different sites is referred to as __________
transparency.
The ability of increasing the number of nodes in a distributed
database is __________ scalability.
Increasing the storage capacity of nodes is known as
_____________ scalability.
1
9
Activity
State whether the given statements are True or False.

1. A system with high transparency offers a lot of flexibility to
the application developer. ( True / False )
2. It is mandatory for all the nodes to be identical in terms of
data, hardware, and software. ( True / False )
3. A distributed Database should be connected via a local area
network. ( True / False )
4. In DDB systems, expanding the processing power of nodes
is not considered as a way of increasing scalability. ( True /
False )
5. With data localization, number of CPU and I/O services
required can be reduced. ( True / False )
2
0
4.2 Types of Distributed Database Systems
• There are different types of Distributed Database

Management Systems classified based on the degree of
homogeneity.
- Homogeneous system: All the sites(servers) in the
DDB use identical software and all the clients use the
identical software.
- Heterogeneous system: Different software installed
in the servers or if the users involved in DDB use
different software.
2
1
- Degree of local autonomy is another factor relevant to

the degree of homogeneity.
- If the local site is not allowed to be operated as an
independent site, there is no local autonomy.
- If local transaction granted permission for direct
access to the server, then there is some degree of
local autonomy.
2
2
• Classification of DDBMS with regards to distribution,

autonomy, and heterogeneity can be explained as below.
- Centralized DB: Got complete autonomy but a
complete loss of distribution and heterogeneity.
- Pure distributed database systems: There is only
one conceptual schema. A site, which is a part of the
DDBMS provides access to the system. Therefore, no
local autonomy exists.
- Further classification of centralized DBMS can be
done with level of autonomy. Those are federated
database systems and multi database systems. These
systems consist of independent servers, centralized
DBMS with local users, local transactions and DBA,
facilitating very higher degree of local autonomy.
2
3
- Federated database systems: Have a global view of

the federation of databases that is shared by the
applications.
- Multidatabase systems: Have full local autonomy in
DB but does not have a global schema.
eg: A system with full local autonomy and full
heterogeneity. (Peer-to-peer database system)
2
4
Classification of the distributed databases that we discussed

in previous two slides can be seen in the following image.
Federated database systems

Distribution
Multidatabase system
Pure distributed
database
system
Autonomy
Centralized database
Heterogeneity systems
2
5
4.3 Distributed Database Design Techniques
Fragmentation
● As the name implies, in distributed architecture, separate
portions of data should be stored in different nodes.
● Initially, we have to identify the basic logical unit of data.
In a relational database, relations are the simplest logical
unit.
● Fragmentation is a process of dividing the whole
database into various sub relations so that data can be
stored in different systems.
2
6
Example
● Suppose we have a relational database schema with three
tables (EMPLOYEE, DEPARTMENT, WORKS_ON) which
we should make partitions in order to store in several
nodes.
● Assume there are no replications allowed (data replication
allows storage of certain data in more than one place to
gain availability and reliability).
Employee
FNAME LNAME SSN BDATE ADDRESS
Works_on
ESSN DNO HOURS
Department
DNO DNAME LOCATION
2
7
Example - Approach 01
● One approach of data distribution is storing each relation
in each site.
We can store each relation in one node. In the following
example, we have stored the Employee table in Node 1, the
Department table in node 2 and Works_on table in node 3.
Employee Department Works_on
Site 01 Site 02 Site 03
Data distribution technique 01: Storing each relation in each site.

2
8
Example - Approach 02
● Another approach is dividing a relation into smaller logical
units for distribution.
● For instance, think of a scenario where 3 different
departments are located in 3 separate places. Finance
department in Colombo, research department in
Rathnapura and headquarters in Kandy as given in the
below table.
DNO DNAME LOCATION
d4 Headquarters Kandy
d3 Finance Colombo
DNO
d8 Research Rathnapura
DNAME
LOCATION
2
9
Example - Approach 02 Cont.
● For the scenario given in previous slide, we can store data

relevant to each department in separate site.
● The details of finance department will be stored in one
site.
● Details of headquarters will be stored in another site.
● Details of research department will be stored in another
separate site.
● Dividing a relation into smaller logical units can be done
by horizontal fragmentation or vertical fragmentation
which will be discussed in coming slides.
3
0
Example - Approach 02 Cont.
Finance Research Headquarters

Department Department Department
Details Details Details
Site 01 Site 03 Site 02
Data distribution technique 02: Storing details of different departments in

3
each site. 1

Horizontal Fragmentation
• A subset of rows in a relation is known as horizontal
fragment or shard.
• Selection of the tuple subset is based on a condition of one
or more attributes.
• With horizontal fragmentation, we can divide tables
horizontally by creating subsets of tuples which has a logical
meaning for each of the subset.
• Then these fragments are assigned to different nodes in the
distributed system.
• Each horizontal fragment on a relation R can be specified in
the relational algebra by σCi(R) operation.(Ci → condition,
R→ relation).
• Reconstruction of the original relation is done by taking the
union of all fragments.
3
2
• Ex:- If we want to store sales employee details and marketing
employee details separately in 2 nodes, we can use horizontal
fragmentation.
Employee
Name Salary Department
Kasun 120000 Sales
Rishad 135000 Sales
Kirushanthi 45900 Marketing
Anna 47900 Marketing
Sales Employee Marketing Employee
Name Salary Department Name Salary Department
Kasun 120000 Sales Kirushanthi 45900 Marketing
Rishad 135000 Sales Anna 47900 Marketing
3
3
Explanation
• Original table (Employee) is divided into two subset of
rows.
• First horizontal fragment created (Sales_employee)
consists of details of employees who are working in the
sales department.
Sales_employee  𝛔Department = “sales” (Employee)
• Second horizontal fragment created
(Marketing_employee) consists of details of employees
who are working in the marketing department.
Marketing_employee  𝛔Department = “marketing”
(Employee)
3
4
Vertical Fragmentation
• With vertical fragmentation, we can divide the table by
columns.
• There can be situations where we do not need to store all
the attributes of a relation in a certain site.
• Therefore, with the technique of vertical fragmentation, we
can keep only required columns of a relation within a
single site.
• In vertical fragmentation, it is a must to include the primary
key or some unique key attribute in every vertical
fragment. Otherwise, we will not be able to create the
original table by putting the fragments together.
3
5
Vertical Fragmentation Cont.

• A vertical fragment on a relation R can be specified by a
π Ai (R) operation in the relational algebra. (Ai → attributes, R→
relation )
• The Outer Union on vertical fragments can generate the
original table.
3
6
• Ex:- If we want to store employees’ pay details and department
details separately in 2 nodes, we can use vertical fragmentation.
Employee
Kasun 120000 Sales
Rishad 135000 Sales
pay data Dept. data
Name Salary Name Department
Kasun 120000 Kasun Sales

Rishad 135000 Rishad Sales
Kirushanthi 45900 Kirushanthi Marketing
Anna 47900
Anna Marketing
3
7
Explanation
• Original table (Employee) is divided into two subset of
columns.
• First vertical fragment created (Pay_data) consists of
salary details of employees.
Pay_data  πname, salary(Employee)
• Second vertical fragment created (Dept_data) consists of
department details of employees.
Dept_data  πname, Department (Employee)
3
8
Mixed Fragmentation
• Another fragmentation technique is the hybrid (mixed)
fragmentation where we can use a combination of both
the horizontal and vertical fragmentations.
• For example, take the EMPLOYEE table that we used
before.
• Employee table is vertically split into payment data and
department data. (vertical fragmentation)
• Then the department table is again separated by the
department, where the horizontal fragmentation is taking
place. (horizontal fragmentation)
• Relevant fragmentations with data can be seen in the next
slide.
3
9
Employee
Kasun 120000 Sales
Rishad 135000 Sales
pay data Sales -Deptdata
Name Salary Name Department
Kasun 120000 Kasun Sales

Rishad 135000 Rishad Sales
Kirushanthi 45900
Marketing -Deptdata
Anna 47900
Name Department
Kirushanthi 45900
Anna 47900
4
0
Activity
1) Give horizontally fragmented relations (with data) for the Project

relation given below so that projects with budgets less than
150,000 are separated from projects with budgets greater than
or equal to150,000. Express the fragmentation conditions using
relational algebra for each fragment (with data).
2) Indicate how the original relation would be reconstructed.
ProjNo ProjName Budget Location
23 Boks 100000 Colombo
4 Goods 50000 Galle
65 Furniture 75000 Jaffna
87 Clothes 200000 Matara

Activity
1) Give a vertical fragmentation of the above Project relation into two

sub-relations (with data), so that one contains only the information
about project budgets (i.e. ProjNo, Budget), whereas the other
contains project names and locations (i.e. ProjNo, ProjName,
Location). Express the fragmentation condition using relational
algebra for each fragment.
2) Indicate how the original relation would be reconstructed.
ProjNo ProjName Budget Location
P1 Books 100,000 Colombo
P2 Goods 50,000 Galle
P3 Furniture 75,000 Colombo
P4 Clothes 200,000 Kandy

Activity
Fill in the blanks with the most suitable word given.

(vertical, horizontal, mixed, union, outer join, projection, selection)
When we divide a relation based on columns, it is known as
__________ fragmentation while, the relation divided based on rows
know as _______________ fragmentation.
A combination of these 2 fragmentations is referred to as
__________ fragmentation.
The re-constructability of the relation from its fragments ensures that
constraints defined on the data in the form of dependencies are
preserved. A set of vertical fragments can be organized into the
original table using _________ operation.
With ________ operation, we can create the original relation from a
set of horizontal fragments.

Replication
• The main purpose of having data replicated in several
nodes is to ensure the availability of data.
• One extreme of data replication is having a copy of the
entire database at every node (full replication).
• The other extreme is not having replication at all. Here,
every data item is stored only at one site. (no replication)
4
4
Replication Cont.
• Full replication
-With full replication, we can achieve a higher degree
of availability. The reason for this is, the entire system
keeps running, even with only one site up, because
every site contains the whole DB.
-The other advantage is improved performance of read
queries, as the results can be obtained from any site
by locally processing at the site where it submitted.
-However, there are drawback of full replication.
-One is, degrading the write performance, because
each update should be performed at every copy of
data to maintain the consistency.
-Making the concurrency control and recovery
techniques are more complex and expensive.
4
5
Replication Cont.
• No replication
-When there are no replications, all fragments must be
disjoint ( no tuple in relation R, can be seen in more
than one site.) But the repetition of primary key should
be expected for the vertical fragments or mixed
fragments.
-Also known as non-redundant allocation.
-Suitable for systems with high write traffic.
-Lesser degree of availability is a disadvantage of no
replication.
4
6
Replication Cont.
• To get a balance between the pros and cons we
discussed, we can select a degree of replication suitable
for our application.
• Some fragments of the database may be replicated, and
others may not according to the requirements.
• It is also possible to have some fragments replicated in all
the nodes in the distributed system.
• Any way, all the replicas should be synchronized when an
update is taken place.
4
7
Allocation
• There cannot be any site which is not assigned to a site in
a DDB.
• The process of distributing data into nodes is known as
data allocation.
• The decisions of selecting the site to hold each fragment
and the number of replicas available for each data
depends on the,
- Performance requirement of the system
- Types of transactions
- Availability goals
- Transaction frequency
4
8
Allocation Cont.
Consider the following scenarios and the suggested
allocation mechanisms:
• Requires high availability of the system with high number
of retrievals,
- Recommend to have a fully replicated database.
• Requires to retrieve a subsection of data frequently,
-Recommend to allocate the required fragment into
multiple sites.
• Requires to perform a higher number of updates,
-Recommend to have a less number of replicas.
However, It is hard to find an optimal solution to distributed
data allocation since it is a complex optimization problem.
4
9
Activity
Select advantages of data replication in DDB.
1. Improves availability of data.

2. Improves performance of data retrieval.
3. Improves performance of data write operations.
4. Slow down update queries.
5. Hard recovery.
6. Expensive concurrency control.
7. Slow down select queries.
8. Easy notions used for data query.

Activity
Mark the following statements as true (T) and false (F).

1. With data replication, we can have multiple copies of the same
data item in many sites. ( )
2. Data replication would slow down the read and write operations
of a database. ( )
3. There can be some data fragments which are replicated in all
nodes of the distributed database. ( )
4. The number of copies created for each fragment should be
equal. ( )
5. In a replicated system, there can be fragments which are not
replicated in another site. ( )
6. For a system with frequent updates, it is advised to use a larger
number of replications. ( )

Distribution Models
When the data volume increases, we can add more nodes
within our distributed database system to handle it. There are
different models for distributing data among these nodes.
1. Single server
• This is the minimum form of distribution and most often the
recommended option.
• Here, the database will be running in a single server without
any distribution.
• Since all read and write operations occur at a single node, it
would reduce the complexity by making the management
process easy.
5
2
Distribution Models Cont.

2. Sharding
• A database can get busy when several users access
different data in different parts of the database at the
same time.
• This can be solved by splitting data into several parts and
storing them in different nodes. This is called sharding.
A B C D
A B C D
5
3

• In the best-case scenario of sharding, different users will
access different parts of the database stored in separate
nodes, so that each user will only communicate with a
single node.
• This technique will help in load balancing.
• It is necessary to segregate data correctly, for this
technique to be effective. Data that are accessed together
should be stored in a single node.
5
4

• The following considerations should be made when
segregating (or sharding) the data
- Location: Place data close to the physical location of
access.
- Load Balancing: Make sure that each node will get
approximately similar number of requests.
- Order of access: Aggregates that will be read in
sequence can be stored in a single node.
5
5

• Auto-sharding is a feature given by most of the NoSQL
databases, where the responsibility of splitting and storing
data is given to the database itself, ensuring that data goes
to the correct shard.
• Sharding will improve read performance as well as write
performance.
- Improve read performance by replication and caching
- Improve write performance by horizontally scaling
writes.
• It is hard to achieve reliability only with the use of sharding.
• To improve reliability, it is necessary to use data replication
along with sharding. Otherwise, even though the data can
be accessed from different nodes, a failure of a node can
make the shard unavailable.
5
6
Distributed Database Design Techniques

3. Master-slave replication
• In this model, one node is selected as the master (primary)
and it is considered as the authorized source for data.
• Master is the node which is responsible for updates.
• All the other nodes are treated as slaves (secondary).
• There is a process called synchronization to sync data
inside master with the slaves.
Master
Slave Reads are done on Master or slaves Slave

5
7
• Master-slave model is suitable for a system with read-
intensive dataset.
• By adding more slaves, you can increase the efficiency of
read operations since the read requests can be processed
by any slave node.
• However, there is still a limitation on writes, because only
master can process the writes to the database.
• If the master fails, it should be recovered or a slave node
has to be appointed as the new master.
5
8
• Appointment of the new master can be either an
automatic or a manual process.
• The disadvantage of having replicated nodes is the
inconsistency that may occur in between nodes.
• If the changes are not propagated to all the slave nodes,
there is a chance of different clients who are accessing
various slave nodes read different values.
5
9

4. Peer-to-peer replication
• In master-slave model, the master is still a bottleneck and
a single point of failure.
• In peer-to-peer model, there is no master and all the
nodes are of the equal weight.
• All the replicas can accept writes. Due to this reason, there
will be no loss of access to data due to failure of a single
node.
• However, with this model, we have to accept the problem of
inconsistency.
• After you write on a node, two users who are accessing
that changed data item from different nodes may read
two different values until data propagation is completed.
6
0

• One solution for this inconsistency problem is, ensuring the
coordination between replicas to synchronize with all the
nodes after performing a write operation.
• Another solution would be coping with an inconsistent write.
Reads and writes are done on all nodes
6
1

Combining sharding and replication
• We can use both master-slave replication and sharding
together.
- In that approach, we have multiple masters. But
there is only one master for each data item.
• Also, we can combine peer-to-peer replication and
sharding.
- A common application of this can be seen in
column-family databases.
6
2
Activity
Mark the following statements as true (T) and false (F).

1. Scaling up is including larger data servers with higher
storage capacity to cater the increasing data storage
requirement. ( )
2. Scale out is the process of running the database on a
cluster of servers. ( )
3. We can ensure reliability of a DDB by using the technique
sharding. ( )
4. Single server is the most recommended distribution model
( )
5. Read reliance is one of the advantages of master-slave
replication model. ( )

4.4 Query Processing and Optimization in
Distributed Databases
3. 4.
2.
Global Query Local Query
Query Mapping Localization
Optimization Optimization
These are the steps involved in distributed query processing. We will

discuss each step in detail.
6
4

Step 01: Query Mapping.

• The query inserted is specified in query language.
• Then it is translated into an algebraic query.
• The translation process is referred to global conceptual
schema; here it does not consider the replicas and
shards.
• The algebraic query is then normalized and analyzed for
semantic errors.
• This step is performed at a central control site.
6
5
Step 02: Localization.

• In this phase, the distributed query in global schema is
mapped to separate queries on fragments.
• For this, data distribution and fragmentation details are
used.
• performed at a central control site.
6
6
Step 03: Global Query Optimization.

• Optimization is selecting the optimal strategy from a list of
candidate strategies.
• These candidate strategies can be obtained by permuting
the order of operations generated in previous step.
• To measure the cost associated with each set of
operations, we use the execution time.
• The total cost is calculated using costs such as CPU cost,
I/O costs, and communication costs.
• Since the nodes are connected via network in a DDB, the
most significant cost is for the communication between
these nodes.
6
7
Step 04: Local Query Optimization.

• This stage is common to all sites in the DDB.
• The techniques are similar to those used in centralized
systems.
• performed locally at each site.
6
8
• In comparison to a centralized database system, in a

distributed database there is additional complexity
involved in query processing.
• One is the cost of transferring data among sites.
• Intermediate files or the final result set can be transferred
in between nodes via the network.
• Reducing the amount of data to be transferred among
nodes is considered as an optimization criteria in the
query optimization algorithms used in DDBMS.
6
9
Example
Suppose Employee table and Department table are stored at node
01 and node 02 respectively. Results are expected to be presented
in node 03.
Employee Department
Size of one record =100 bytes Size of one record =35 bytes
No. of records=10000 No. of records=100
Node 01 Node 02
Results
Node 03
7
0
Example
According to the details given, let’s calculate the size of each
relation.
No. of records in Employee relation = 10000

size of 1 record in Employee relation= 100
Size of the Employee relation = 100*10000 = 1000000 bytes
No. of records in Department relation = 100

size of 1 record in Department relation= 35
Size of the Department relation = 100*35 =3500 bytes
7
1
The sizes of attributes in Employee and Department relations are given
below.
EMPLOYEE
Fname Lname Ssn Bdate Address Sex Salary
Fname field is 15 bytes long, Lname field is 15 bytes long, Address field is 10 bytes long
DEPARTMENT
Dname DNumber Mgr_ssn Mgr_start_date

Dnumber field is 4 bytes long, Dname field is 10 bytes long, Mgr_ssn field is 9 bytes long
7
2
Assume we want to write a query to retrieve first name, last

name and department for each employee.
We can represent it in relational algebra as follows.
Let’s call this query, Q.
Q: π Fname,Lname,Dname (EMPLOYEE ∞Dno=Dnumber DEPARTMENT)
7
3
Q: π Fname,Lname,Dname ( EMPLOYEE ∞Dno=Dnumber DEPARTMENT)
We will discuss 3 strategies to execute this distributed query .

Method 1
Explanation Transfer data in the EMPLOYEE relation and
the DEPARTMENT relation into the result site (node 03). Then
perform the join operation at node 3.
Calculation
Total no. of bytes to be transferred= Size of the Employee
relation + Size of the Department relation
= 1,000,000 + 3,500
= 1,003,500 bytes
7
4
Method 2
Explanation Transfer the EMPLOYEE relation to site
2. Execute the join at site 2. Send the result to site 3.
Calculation
Total no. of bytes to be transferred= Size of the Employee
table + The size of the query result
=
1,000,000 + (40 * 10,000 )
= 1,400,000 bytes
Note: One record in result query consist of Fname (15 bytes),

LName ( 15 bytes) and Dname (10 bytes). Altogether 40 bytes.
There are 10,000 records retrieve as result.
Therefor size if the result query is 40 * 10000
7
5
Method 3
Explanation Transfer the DEPARTMENT relation to
site 1. Execute the join at site 1. Send the result to site 3.
Calculation
Total no. of bytes to be transferred = Size of the Department
table + size of the query result
= 3,500 + (40 * 10,000)
= 403,500 bytes
7
6
When considering the three methods we discussed,

Total no. of bytes to be transferred in method 1 = 1,003,500
Total no. of bytes to be transferred in method 2 = 1,400,000
Total no. of bytes to be transferred in method 3 = 403,500
The least amount of data transfer occurs in method 3.

Therefore, we choose method 3 as the optimal solution, since it
transfers the minimum amount of data.
7
7
Activity
Suppose STUDENT table is stored in site 1 and COURSE table

is stored in site 2. The tables are not fragmented and the results
are stored in site 3. Every student is assigned to only one course.
STUDENT(Sid, StudentName, Address, Grade, CourseID)
1000 records, each record is 50 bytes long
Sid: 5 bytes, StudentName;10 bytes, Address: 20 bytes
COURSE( Cid, CourseName)

Cid: 5 bytes, CourseName:10 bytes
Query: Retrieve the Student Name and Course Name which the
student is following.
Write the relational algebra for the above query.

Activity
Suppose STUDENT table is stored in site 1 and COURSE able is
stored in site 2. The tables are not fragmented and the results
are stored in site 3. Every student is assigned to one course.

If we are to transfer STUDENT and COURSE relations into node
3 and perform join operation, how many bytes need to be
transferred? Explain your answer.
Activity
are stored in site 3. Every student is assigned to one course.

If we are to transfer STUDENT table into site 2, and then execute
join and send result into site 3, how many bytes need to be
Activity
are stored in site 3.

If we are to transfer COURSE table into site 1, and then execute
join and send result into site 3, how many bytes need to be
4.5 NoSQL Characteristics related to Distributed
Databases and Distributed System
1. Scalability
• NoSQL databases are typically used in applications with
high data growth.
• Scalability is the potential of a system to handle a growing
amount of data.
• In Distributed Databases, there are two strategies for
scaling a system.
- Horizontal scalability: When the amount of data
increases, distributed system can be expanded by
adding more nodes into the system.
- Vertical scalability: Increasing the storage capacity of
existing nodes.
• It is possible to carry out horizontal scalability while the
system is on operation. We can distribute the data among
newly added sites without disturbing the operations of
system.
8
2
2. Availability, Replication and Eventual Consistency:
• Most of the applications that are using NoSQL DBs,
require availability.
• It is achieved by replicating data in several nodes.
• With this technique, even if one node fails, the other
nodes who have the replication of same data will
response to the data requests.
• Read performance is also improved by having replicas.
When the number of read operations are higher, clients
can access the replicated nodes without making a single
node busy.
8
3
Availability, Replication and Eventual Consistency Cont.:
• But having replications may not be effective for write
operations because after a write operation, all the nodes
having same data item should be updated in order to keep
the system consistent.
• Due to this requirement of updating all the nodes with the
same data item, the system can get slower.
• However, most of the NoSQL applications prefer eventual
consistency.
• Eventual consistency will be discussed in next slide.
8
4
Availability, Replication and Eventual Consistency Cont.:
• Eventual Consistency
This means that at any time there may be nodes with
replication inconsistencies but if there are no further updates,
eventually all the nodes will synchronise and will be updated
to the same value.
For example, If Kamal updates the value of Z to 10, it will be

updated in the node A. But if Saman accesses the value of Z
from node B the value will not be 10; as the change hasn’t
propagated from node A to node B. After sometime, when
Saman access the value of Z from node B, then it will have
the value 10. This means that Saman will eventually see the
change in value Z made by Kamal. This is the eventual
consistency.
8
5
3. Replication Models:
• The main replication models that are used in NoSQL context
is master-slave and master-master replication.
- Master-slave replication: The primary node refers to as
master is responsible for all write operations. Then the
updates are propagated to slave nodes keeping the
eventual consistency. There can be different techniques
for read operation. One option is making all reads on
master node. Another option would be making all reads
on slave nodes. But with this second option, there is no
guarantee for all reads to have the same value on all
data item after accessing several nodes. (Because the
system gets consistent eventually)
8
6
Replication Models Cont.:
- Master-master replication: All the nodes are treated
similarly. Reads and writes can be performed on any
of the nodes. But it is not assured that all reads done
on different nodes see the same value. Since it is
possible for multiple users to write on a single data
item at the same time, system can be temporarily
inconsistent.
8
7
4. Sharding of Files:
• We have discussed the concept sharding in slide 55.
• In many NoSQL applications, there can be millions of data
records accessed by thousands of users concurrently.
• Effective responses can be provided by storing partitions of
data in several nodes.
• By using the technique called sharding (horizontal
partitioning), we can distribute the load across multiple
sites.
• Combination of sharding and replication improves load
balancing and data availability.
8
8
5. High-Performance Data Access:
• In many NoSQL applications, it might be necessary to find
a single data value or a file among billions of records.
• To achieve this, techniques such as hashing and range
partitioning are used.
- Hashing: A hash function h(k) applied on a given
key K, provides the location of a particular object.
- Range partitioning: Object’s location can be
identified from range of key values. For example,
location i would hold the objects whose key values
K are in the range Kimin ≤ K ≤ Ki max.
• We can use other indexes to locate objects based on
attribute conditions (different from the key K).
8
9
Activity
Fill in the blanks with the most suitable word given.
(horizontal, vertical, eventual consistency, consistency, master,
slave, availability, usability)
__________ scalability can be performed while the system is on

operation.
A relaxed form of consistency preferred by most of the NoSQL
systems is known as _____________.
In master-slave replication model, ___________ is used as the
source of write operations.
Load balancing and ________ can be achieved in a system which
uses the combination of sharding and replication.

Summary
Distributed Database
Concepts, Components
and Advantages
Types of Distributed
Database Systems
Distributed Database Fragmentation, Replication and Allocation,

Design Techniques Distribution Models
© 2020 e-Learning Centre, UCSC
9
1
Summary
Query Processing and

Optimization in Distributed Distributed Query Processing, Data Transfer
Costs of Distributed Query Processing
Databases
NoSQL Characteristics
related to Distributed
Databases and
Distributed Systems
9
2
5 : Consistency and Transaction Processing
Concepts

Overview
• This lesson on consistency and transaction processing

defines what a transaction is and its properties.
• Here we look at schedules and serializability.
• Finally, we explore transaction support in SQL and
maintaining consistency in NoSQL.

At the end of this lesson, you will be able to;

• Understand what transaction processing is
• Define properties of transactions
• Understand schedules and serializability
• Identify different types of serializability techniques
• Explain transaction support in SQL
• Understand consistency in NoSQL

List of subtopics
5.1. Introduction to Transaction Processing
5.1.1. Single-user systems, multi-user systems and
Transactions
5.1.2. Transaction states
5.1.3. Problems in concurrent transaction processing,
introduction to concurrency control, DBMS failures, introduction
to data recovery
5.2. Properties of Transactions

5.2.1. ACID properties, levels of isolation
5.3. Schedules
5.3.1. Schedules of Transactions
5.3.2. Schedules Based on Recoverability
5.4. Serializability
5.4.1. Serial, Nonserial, and Conflict-Serializable Schedules

List of subtopics
5.4.2. Testing for Serializability of a Schedule

5.4.3. Using Serializability for Concurrency Control
5.4.4. View Equivalence and View Serializability
5.5. Transaction Support in SQL
5.6. Consistency in NoSQL
5.6.1. Update Consistency
5.6.2. Read Consistency
5.6.3. Relaxing Consistency
5.6.4. CAP theorem
5.6.5. Relaxing Durability and Quorums
5.6.6. Version Stamps

5.1.1. Single-user systems, Multi-user systems and
Transactions
Databases can be classified based on the number of

concurrent users.
• Single - User Systems - Database can be accessed by
one user at a time. Most commonly these are used by
personal computer systems.
• Multi - User Systems - Database can be accessed by
many users at the same time. This is the concurrent use
of database. Database systems used in airline reservation
systems, supermarkets, hospitals, banks and stock
exchange systems are accessed by hundreds or
thousands of users at the same time.

Transactions
• Multiprogramming is the concept behind this
simultaneous access of the database by several users. In
multiple programming, operating system of the computer
is allowed to execute multiple programmes at a time.
• In Central Processing Unit (CPU), only one process can
be executed at a time.
• Hence, in multiprogramming systems, CPU executes set
of commands from one process and then suspends it and
again executes a set of commands from another process.
• A suspended process will resume again from the point
where it was suspended when it gets the chance to use
the CPU again. This pattern continues to keep running
multiple processes.
• This way, the actual process of concurrent execution is
interleaved.
Transactions
(i) Interleaved processing (ii) Parallel processing

Transactions
• When interleaving is not allowed, several problems may
occur as given below:
⁻ Some processes have to remain idle if the active
process wants to execute I/O operations, like reading
a block from disk. The reason is that it is not allowed
to switch CPU to execute another process.
⁻ Some delays could occur since some processes have
to wait until long processes finish execution.
• When there are multiple CPUs present in a computer,
parallel processing can take place as shown in C,D of
the figure (ii) in the previous diagram.
• In general, the theories on concurrency control of DBMS
are based on interleaved concurrency .

Transactions
• A transaction is an executing program that forms a

logical unit of database processing. Therefore, a
transaction is defined as a logical unit of database
operations.
• A transaction may consist of one or more database
access operations. These include insertion, deletion,
modification (update), or retrieval operations.
• The database operations that form a transaction can
either be embedded within an application program or they
can be specified interactively via a high-level query
language such as SQL.
1
0
Transactions
• The transaction boundaries can be specified with explicit

begin transaction and end transaction statements in an
application program.
• In this case, all database access operations between
these two statements are considered as forming one
transaction.
• If the database operations in a transaction do not update
the database but only retrieve data, the transaction is
called a read-only transaction; otherwise, it is known as a
read-write transaction.
1
1
Transactions
• Concurrency control and recovery mechanisms are mainly

concerned with the database commands in a transaction.
• Transactions submitted by the various users may execute
concurrently and may access and update the same
database items.
• If the concurrent execution is uncontrolled, it may lead to
problems, such as an inconsistent database.
1
2
5.1.2. Transaction States
• The system needs to keep track of when each transaction

starts, terminates, and commits/aborts for recovery
purposes.
• Therefore, following operations need to be tracked by the
recovery manager of the DBMS.
‒ BEGIN_TRANSACTION
Marks the start of executing a transaction.
‒ READ or WRITE
Defines read or write operations on the database
‒ END_TRANSACTION
Indicates the end of all READ/WRITE operations and
characterizes the end of transaction execution.
1
3
• At the end of a transaction, it might be necessary to check

whether the transaction is committed or aborted.
• COMMIT_TRANSACTION
- Indicates successful completion of a transaction and
capability to safely commit the updates resulted by
the transaction to the database. These updates
made to the database will not be undone.
• ROLLBACK or ABORT
- Indicates the end of an unsuccessful transaction. Any
change/ update to the database made by the aborted
transaction must be undone.
1
4
Following state transition diagram illustrates how a transaction

moves through its execution states.
Read, Write
End
Begin
Transaction Commit
Transaction Partially
Active Committed
Committed
Abort
Abort
Failed Terminated
1
5
• Just after the start of a transaction, it goes into active

state. At this state, the transaction executes its read and
write operations.
• Once the transaction ends, it shifts to partially committed
state. Some concurrency control protocols, and additional
checks might be applied to find out whether the
transaction can be committed or not.
• Further, some recovery protocols are essential to assure
that no system failures will occur, and changes resulted
from the transactions can be permanently recorded to the
database.
1
6
• If the above checks are successful, it goes to the

committed state where all the updates are successfully
recorded into the database. Otherwise, the transaction will
be aborted and goes to the failed state, where all the
updates should be rolled back.
• Also, a transaction might go to the failed state if it was
aborted during the active state.
• In the terminated state, the transaction leaves the system.
• Failed or aborted transactions might start again
automatically or through a resubmission done by the user.
1
7
5.1.3. Problems in Concurrent Execution
Example Transaction
Account balance of A (X) is 1000;
Account balance of B (Y) is 2000;
Transaction T1 - Rs.50 is withdrawn from A and deposited in B.
Transaction T2 - Rs.100 deposited to account A.
T1 T1 T2
T2
A = 1000 A = 1000
read_item(X) read_item(X)
X:= X-N; A – 50= 950 A + M = 1100
X:= X+M;
write_item (X); write_item (X); A = 950 A = 1100
read_item(Y); B = 2000
Y:= Y+N;
B = 2000 + 50
write_item(Y);
B = 2050
After T1 and T2 completed without interleaving, the final values of

A and B should be, A=1050 and B=2050.
1
8
5.1.3. Problems in Concurrent Execution
But, when these two transactions execute in interleaved

fashion, following problems can be occurred.
1.Lost Update Problem

2.Temporary Update (Dirty Read) Problem
3.Incorrect Summary Problem
4.Unrepeatable Read Problem
Let’s discuss these problems in detail.
1
9
5.1.3. Problems in Concurrent Transaction Processing
The Lost Update Problem.

• When two interleaved transactions access the same item from
the database, it would result in an incorrect value for that item.
• Assume the two transactions T1, T2 (example in slide 18) have
been submitted in an interleaved fashion, as shown in the
table.
T1 T2
read_item(X); 1000
X=X-N Item X has an incorrect
read_item(X); 1000 value because its
X = X +M; update by T1 is lost
write_item(X); 950 (overwritten).
Thus, the final value of
read_item(Y); 2000
X will be 1100 with
write_item(X); 1100
respect to this
Y = Y + N; execution instead of
write_item(Y); 2050 1050.
2
0
The Lost Update Problem - Example

Suppose there are 2 trains X and Y, which has 80 reservations
for X and 100 reservations for Y. (Refer the next slide for
tabular representation)
• One person submits a cancellation of 8 seats (N=8) for X
train, and do a reservation of 8 seats on train Y. At the
same time another person submits a reservation of 2
seats (M = 2) for train X.
• At the end of these two processes, resulting reservations
should be X=(80-8+2)= 74 and Y=(100+8)=108.
2
1
The Lost Update Problem -Example
T1 T2 T1 T2
READ(X) X=80
Still the value
X=X-N X=80-8 of X is 80,
READ(X) X=80
because the
change done
WRITE(X) X=72 by T1 is not yet
X=X+M X=80+2 written.
WRITE(X) X=82
READ(Y) Y=100
Y=Y+N Y=100+8
WRITE(Y) Y=108
• But, after the execution, the results ( X=82 and Y=108) does not
match with expected calculations (X=74 and Y=108).
• The resulting X value is incorrect because the update done by T1
for X is lost and T2 gets the X value directly from the DB. 2
2
The Temporary Update (or Dirty Read) Problem.
• Dirty reads happen when one transaction updates a

database item and, however that particular transaction
has failed to complete due to some reason.
• In the meantime, some other transaction has accessed
the same database item which is temporarily updated
and has not yet rolled back to its original value. This is
a dirty read.
2
3
The Temporary Update (or Dirty Read) Problem
T1 T2
read_item(X);
X = X - N; • In the given table,
write_item(X);
transaction T1 has updated
read_item(X); the value of X, and then
X = X + M;
write_item(X); transaction T2 has read the
read_item(Y);
updated value of X.
rollback;
• However, at some point, T1 fails and by that time T2 has

already accessed the temporary updated value of X, which
would eventually roll back (changed) to its old value.
• The value accessed by T2 is a dirty read, because it has
read a value modified by an incomplete transaction which
is not committed. 2
4
The Temporary Update (or Dirty Read) Problem
Example
X= 80; Y=100;, N=5;M=4;
T1 T2 T1 T2
X value should
READ(X); 80 be 80 when T1
is rolled back.
X=X-N; X=80-5
But, T2 has read
WRITE(X); 75 X from the
temporary
READ(X); 75
update done by
X=X+M; X=75+4 T1.
WRITE(X); 79
READ(Y); 100
ROLLBACK; ROLLBACK
When T1 is rolled back, X value is being 80 again. But T2 has read

an incorrect value from an uncommitted transaction. 2
5
The Incorrect Summary Problem

• Take an instance where one transaction is getting the
aggregated summary of database items and another
transaction is updating the values of the same
database items. Both these transactions are running in
an interleaved manner.
• This results in some values being not updated yet and
some values being already updated when they are
getting read by the aggregate function.
• Hence, gives a wrong summary.
2
6
T1 T3 T1 changes the value of X by

sum = 0;
subtracting N.
read_item(A); T3 reads X (after N is
sum = sum + A; subtracted)
read_item(X);
T3 reads Y (before N is added).
X = X - N; T1 changes the value of Y by
write_item(X); adding N.
The change of Y value done by
read_item(X);
sum = sum + X; T1 is not considered when
read_item(Y); calculating the sum.
sum = sum + Y; Therefore, the resulted sum is
read_iten(Y); wrong.
Y = Y + N;
write_item(Y);
2
7
X= 80; Y=100;, N=5; M=4; A=5;
T1 T3 T1 T3 T3 reads X after it is
SUM=0; 0 updated by T1. The
READ(A); 5 correct value of X is
SUM+=A; 5
taken for the sum. But
T3 reads Y before it is
READ(X); 80
getting updated and
X=X-N; 80-5 hence read an
WRITE(X); 75 incorrect value for the
READ(X); 75 sum.
SUM+=X; 5+75
The correct sum after
READ(Y); 100
reading Y should be 80
SUM+=Y; 80+100 + 105.
READ(Y); 100 But instead it gives 80 +
Y=Y+N;
100 since y is read as
100+5
100 instead of 105.
WRITE(Y); 105
2
8
5.1.3. Problems in Concurrent Transaction
Processing
The Unrepeatable Read Problem.

• This occurs when one transaction reads a particular
database item twice and get two different values.
• The reason is some other transaction has made an
update on the very same database item between the
two reads.
2
9
The Unrepeatable Read Problem -Example

X= 80
T1 T2 T1 T2
READ(X) 80
READ(X) 80
X=X-5 80-5
WRITE(X) 75
READ(X) 75
T1 gives 2 different values, when reading the same data item.
3
0
5.1.3. DBMS Failures
• A Transaction is considered as committed, if all the

operations of the submitted transaction are successfully
executed and the effect of that particular transaction on
the database items is permanently recorded.
• If a transaction fails to complete successfully, it is
considered as aborted, where the database gets no
effect.
• When some operations fail to execute in the transaction,
the previous, successful operations relevant to that
transaction should be undone to make sure there is no
effect to the database.
3
1
• A computer failure (System Crash).

Occurs due to hardware, software, or network error in
the computer system during transaction execution.
• A transaction or system error.
Occurs due to the errors in operation such as integer
overflow or division by zero. Some other reasons are
inaccurate parameters, logical programming errors and
interruptions from the user.
3
2
• Local errors or exception conditions detected by the

transaction.
– Some exceptions in the programme may cause
cancellation of a transaction.
– For instance, the data might not be available to
complete the transaction or the existing values do not
meet the required conditions.
– As an example, we cannot withdraw money from an
account which does not have a sufficient balance.
3
3
• Concurrency control enforcement.

- Failures may take place due to the enforcement of
concurrency control.
• Disk failure.
- Disk failures may occur while performing read/write
operations.
- Due to read or write malfunction, data in some disk
blocks may lose.
- Sometimes this can happen as a result of a crash in
disk read/write head.
3
4
• Physical problems and catastrophes.

- This includes numerous problems such as power loss,
failure in air-conditioning, natural disasters, theft,
sabotage and mistakenly overwriting disks or tapes etc.
3
5
ACID properties
ACID are the properties of transactions which are
imposed by concurrency control and recovery methods of
the DBMS.
ACID stands for
i) A – Atomicity
ii) C – Consistency
iii) I – Isolation
iv) D – Durability
A detailed description of each property is explained in the
upcoming slides.
3
6
i) Atomicity
• A transaction is an atomic unit with regards to transaction

processing. This infers that a transaction either ought to
be executed completely or not performed at all. Thus, the
atomicity property necessitates that a transaction
executes to its completion.
• The transaction recovery subsystem of a DBMS
guarantees atomicity. In the event that a transaction fails
to finish (for example, a system crash occurs amidst its
execution), the recovery strategy should fix any impacts of
the transaction on the database through undo/redo
operations. Write operations of a committed transaction
must be written to disk.
3
7
Consider the following transaction T1 as an example to

illustrate the properties of transactions.
Transfer 50 from account A to account B.
A = 1000; B =2000;
T1:
Begin read (A);
A=A-50;
write (A);
read (B);
B=B+50
write (B)
End;
3
8
Example for Atomicity

• Definition :- Either a transaction is performed in its entirety
or not performed at all.
• When considering T1(in previous slide):

If a transaction failure occurs after write (A), but before
write (B);
then A=950; B=2000; 50 is lost.
Data is now inconsistent with A+B = 2950 instead of
3000.
• Therefore, the transaction should either be fully executed,

or else data should reflect as if the transaction never
started at all.
3
9
ii) Consistency
• A transaction should be completely executed from
beginning to end without getting interfered by other
transactions to preserve the consistency. A transaction
leads the database from one consistent state to another.
• A database state is a collection of all the data values in the
database at a given point.
• The conservation of consistency is viewed as the
responsibility of the developers who compose the programs
and of the DBMS module that upholds integrity constraints.
4
0
ii) Consistency
• A consistent state of the database fulfils the requirements
indicated in the schema and other constraints on the
database that should hold.
• If a database is in a consistent state before executing the
transaction, then it will be in a consistent state after the
complete execution of the transaction (assuming that no
interference occurs with other transactions).
4
1
Example for Consistency

• Definition :- Take database from one consistent state to
another.
• When considering T1:

initially, A = 1000 and B = 2000.
A + B = 3000.
After the transaction T1, A = 950 and B = 2050

A + B = 3000.
Thus, the value of A+B = 3000 should be the same

before and after the transaction.
4
2
iii) Isolation
• During the execution of a transaction, it should appear
as if it is isolated from other transactions even though
there are many transactions happening concurrently.
• The execution of a transaction should not interfere
with other transactions executing simultaneously.
• The isolation property is authorized by the
concurrency control subsystem of the DBMS.
• In the event that each transaction doesn't make its
write updates apparent to other transactions until it is
submitted, one type of isolation is authorized that
takes care of the temporary update issue.
4
3
Example for Isolation

• Definition :- Updates not visible to other transactions until
committed
• When considering T1:

• Initially, A = 1000 and B = 2000.
• Between WRITE(A) and WRITE (B) of T1, if another

transaction performs READ(A) and READ(B)
operations, the values seen is inconsistent
(A+B=2950).
4
4
Levels of Isolation
Before talking about isolation levels, let’s discuss about
database locks.
Database Locks
A database lock is used to "lock" data in a database table so that
only one transaction/user/session may edit it. Database locks are
used to prevent two or more transactions from changing the same
piece of data at the same time.
4
5
Levels of Isolation
There have been attempts to define the level of isolation
of a transaction.
• Level 0 (zero) isolation (known as Read
Uncommitted) - If a transaction does not overwrite the
dirty reads of higher-level transactions.
• Level 1 (one) isolation (known as Read Committed) -
If a transaction has no lost updates.
• Level 2 isolation (known as Repeatable Read) - If a
transaction has no lost updates and no dirty reads.
• Level 3 Isolation / True isolation (known as
Serializable Read) - If a transaction has no lost
updates, no dirty reads and no repeatable reads.
4
6
Levels of Isolation
Example for Level 0 (zero) isolation
T1 T2
update employee
set salary = salary - 100
where emp_number = 25
select sum(salary)
from employee
Commit;
Rollback;
T1 updates the salary of one employee by subtracting Rs. 100.

T2 requests the sum of salaries of all employees. Then T2 ends.
T1 rolls back, invalidating the results from T2, since T2 reads a value
updated by an uncommitted transaction.
4
7
Levels of Isolation
Example for Level 0 (zero) isolation
Therefore, in level 0 Isolation it allows a transaction to read

uncommitted changes. This is also known as a dirty read,
since the new transaction may display results that are later
rolled back with respect to the older transaction.
4
8
Levels of Isolation
Example for Level 1 isolation
T1 T2
update employee
set salary = salary - 100
where emp_number = 25;
select sum(salary)
from employee
where emp_number < 50;
rollback
commit
T1 updates the salary of employee with emp_number 25 by subtracting Rs.

100.
T1 does not commit its update. T2 queries to get the sum of salaries of the
set of employees with emp_number less than 50 in employee table.
However, T1’s update is not captured in this query as it is not committed.
This is because Level 1 isolation does not consider uncommitted
transactions. T1 rolls back. With Level 1 isolation, dirty reads are
prevented. 4
9
Levels of Isolation
Explanation of the example is given in the next slide.
T1 T2
select sum(salary)
from employee
where emp_number < 25
update employee
set salary= salary- 100
where emp_number = 22
commit transaction
select sum(salary)
from employee
where emp_number < 25
commit transaction
5
0
Levels of Isolation
In the example in previous slide;
T1 queries to get the sum of salaries of employees whose
emp_number is less than 25. T2 updates the salary of the
employee whose emp_number is 22. Then T1 executes the
same query again.
If transaction T2 modifies and commits the changes to the
employee table after the first query in T1, but before the second
one, the same two queries in T1 would produce different
results. Isolation level 2 blocks transaction T2 from executing. It
would also block a transaction that attempted to delete the
selected row. Thus, lost updates and dirty reads are avoided.
5
1
Phantoms
• If a database table includes a record which was not
present at the start of a transaction but is present at the
end then it is called a phantom record.
• For example, If transaction T2 enters a record to a table
that transaction T1 currently reads (the record also
satisfies the filtering conditions used in T1), then that
record is a phantom because it was not there when T1
started but is there when T1 ends.
• If the equivalent serial order is T1 followed by T2, then the
record should not be seen. But if it is T2 followed by T1,
then the phantom record should be in the result given to
T1.
5
2
Levels of Isolation
Consider the following example on phantom reads
T1 T2
select * from employee

where salary>45000
insert into employee

(emp_number, salary)
values (19, 50000)
commit transaction

where salary>45000
commit transaction
5
3
Levels of Isolation
Example for Level 2 isolation (Phantom reads in Level 2
Isolation)
In the example given in the previous slide;
T1 retrieves the rows from employee table where salaries are more than 45000.
Then T2 inserts a row that meets the criteria given in T1 (an employee whose
salary is greater than 45000) and commits. T1 issues the same query again.
The number of rows retrieved for the same select query in T1 are different when
the isolation level is 2.
Total no. of records retrieved by executing second select statement = total no.
of records retrieved by first select statement is +1.
This creates a phantom. Phantoms occur when one transaction reads a set of
rows that satisfy a search condition, and then a second transaction modifies
those data. If the first transaction repeats the read with the same search
conditions, it obtains a different set of rows.
In the above example, T1 sees a phantom row in the second select query.
5
4
Levels of Isolation
Example for Level 2 isolation (Phantom reads in Level 2
Isolation)
The issue on phantom reads in Level 2 isolation, is prevented

in Level 3 isolation which is also known as Serializable Read.
Level 3 isolation has no lost updates, no dirty reads and no
repeatable reads in transactions.
Let’s look at how this mitigation is done in Level 3 isolation.

Let’s utilise the same example we used in phantom reads to
discuss the mitigation mechanism.
5
5
Levels of Isolation
Explanation of the example is given in the next slide.
T1 T2

where salary>45000
insert into employee

(emp_number, total)
values (19, 50000)
commit transaction

where salary>45000
commit transaction
5
6
Levels of Isolation
In the table shown in previous slide;
T1 retrieves a set of rows where salaries are more than 45000,

and holds a database lock. Then T2 inserts a row that meets
this criteria for the query in T1, but must wait until T1 releases
its lock (locked items are only accessed by the transaction
which holds the lock). Thereafter, T1 makes the same query
and gets same results (unlike what we discussed in slide 54).
Then, T1 ends and releases its lock. Now T2 gets its lock,
inserts new row, and ends.
This prevents phantoms.
In Level 3 isolation, database locks are utilized to avoid
phantom reads (slide 45).
5
7
Snapshot isolation
• Another kind of isolation is called snapshot isolation, which is

utilized in some commercial DBMSs. Several concurrency control
techniques depend on this.
• At the start of a transaction, it sees the data items that it reads

based on the committed values of the items in the database
snapshot (or database state).
• Due to this property, it ensures that the phantom read problem

does not occur (since the database transaction will only see the
records that were committed at the time the transaction starts).
• Therefore, any insertions, deletions, or updates that occur after

starting the transaction, will not be seen by it.
5
8
Snapshot isolation
Example for Snapshot isolation
T1 T2 empID empname
SELECT *
FROM employee
100 Upul
ORDER BY empID;
200 Manjitha
INSERT INTO employee
(empID, empname)
VALUES(600, 'Anura');
COMMIT; This is the output of the first
SELECT * select query of T1. It only
FROM employee
ORDER BY empID;
generates the data that is
available in the current
(empID, empname) snapshot.
VALUES(700, 'Arjuna')
COMMIT;
SELECT * FROM employee

ORDER BY empID;
5
9
Snapshot isolation
T1 T2 empID empname
SELECT *
FROM employee
100 Upul
ORDER BY empID;
200 Manjitha
(empID, empname)
COMMIT;
The second select statement
SELECT *
FROM employee of T1 produces the same
ORDER BY empID; result as the first select
INSERT INTO employee statement, because T1 has
(empID, empname)
VALUES(700, 'Arjuna') not committed yet. The
COMMIT; snapshot taken by T1
remains without changing
SELECT * FROM employee until it commits.
ORDER BY empID;
6
0
Snapshot isolation
T1 T2 empID empname
SELECT * 100 Upul

FROM employee
ORDER BY empID; 200 Manjitha
(empID, empname)
300 Anura
COMMIT;
SELECT *
This is the output of the third select
FROM employee query of T1.
ORDER BY empID; Since T1 is committed, a new
INSERT INTO employee snapshot is taken for further
(empID, empname)
VALUES(700, 'Arjuna') queries. As the first insert query of
COMMIT; T2 is now committed, that inserted
row can be seen in the snapshot
SELECT * FROM
employee
with the updated value of ‘Aruna’.
ORDER BY empID; However, ‘Arjuna’ does not get
updated as that insertion query
has not committed yet in T2.
6
1
Snapshot isolation
T1 T2 empID empname
SELECT * 100 Upul

FROM employee
ORDER BY empID; 200 Manjitha
(empID, empname)
300 Anura
COMMIT;
As explained in the previous slide,
SELECT *
FROM employee third select is based on a new
ORDER BY empID; snapshot. The new snapshot
INSERT INTO employee contains all the commits up to now,
(empID, empname)
VALUES(700, 'Arjuna') which includes the insertion of
COMMIT; Anura. Therefore it shows all three
records.
SELECT * FROM employee However, the insertion of Arjuna is
ORDER BY empID;
not seen because this second
insertion in T2 is not committed.
6
2
iv) Durability
• Durability or permanency means, once the changes of a
transaction are committed to the database, those changes
must remain in the database and should not be lost.
• Therefore, this property ensures that once the transaction

has completed execution, the updates and modifications
to the database are stored in and written to disk and they
persist even if a system failure occurs.
• These updates now become permanent and are stored in

non-volatile memory. Therefore, effects of the transaction
are never lost.
6
3
Example for Durability
• Definition : Changes must never be lost because of
subsequent failures (eg: power failure)
• In the transaction T1, if transaction failure occurs after
write (A), but before write (B);
To recover the database,
i. We must remove changes of partially done transactions.
Therefore, the change done on A should be rolled back.
(before crash, A was 950. Then it needs to be rolled
back to 1000)
ii. We need to reconstruct completed transactions.
If the system fails after the commit operation of a
transaction, but before the data could be written on to
the disk, then that transaction needs to be
reconstructed.
The database should keep all its latest updates even if the
system fails. If a transaction commits after updating data, then
the database should have the modified data.
6
4
Activity
Mark the following as true or false.
1. Single - User System databases can be accessed only by

one user at a time.
2. Multi - User Systems enables concurrent use of the
database.
3. In Central Processing Unit (CPU), many processes can be
executed at the same time.
4. Multiprogramming is the concept behind simultaneous
access of the database by several user.
5. In multiprogramming systems, CPU executes one
command from one process and then suspends it and
then executes a set of commands from another process.
6
5
Activity
Identify the problem that would result in the following

transaction processing.
T1 T2
read (x)
x=x-n
read (x)
x=x+m
write (x)
read (y)
write (x)
y=y+n
write (y)
6
6
Activity
Identify the problem that would result in the following

transaction processing.
T1 T2
read (x)
x=x-n
write (x)
read (x)
x=x+m
write (x)
commit
read (y)
abort
6
7
Activity
T1 T2
sum = 0
Identify the problem that
would result in the given read (a)
transaction processing. sum = sum + a
read (x)
x=x-n
write (x)
read (x)
sum = sum + x
read (y)
sum = sum + y
read (y)
y=y+n
write (y)
6
8
Activity
State the main problems in concurrent transaction processing.

a. ________________
b. ________________
c. ________________
d. ________________
6
9
Activity
Fill in the blanks using the correct option.

1. A transaction is considered as ________ , if all the
operations of the submitted transactions are successfully
executed and the effect of that particular transaction on
the database items are permanently recorded.
2. If a transaction did not successfully complete, it is
considered as __________.
3. A transaction is a _________ unit of database operations.
4. When some operations get failed to execute in the
transaction, the previous successful operations should be
________ to make sure there is no effect to the ______ .
7
0
Activity
Drag and drop the matching words for the sentence
1. Problems caused by hardware, software, or network error
that occurs in the computer system during transaction
execution. –
2. Occurs due to the errors in operation such as integer
overflow or division by zero. –
3. Occurs due to some exception in the programme cause
the cancellation of a transaction –
4. Occurs due to read or write malfunction and data in some
disk blocks may get lost –
5. Problems such as power loss, failure in air-conditioning,
natural disasters, theft, sabotage, mistakenly overwriting
disks or tape, and mounting of a wrong tape by the
operator –
7
1
Activity
item table=>
item_no 1 2 3 4 5 6 7
price 5500 2500 4500 2000 3250 4900 2100
A list of item numbers and their prices are given in the above
table. After the two transactions T1, T2 were executed on the
above item table, the output was 14,500.
What can be the least possible isolation level used in T2?
T1 T2
update item set price = price - 1000

where item_no = 2;
select sum(price) from item

where item_no < 5;
rollback
commit
7
2
Activity
State the four main properties of a transaction.

1. ____________
2. ____________
3. ____________
4. ____________
7
3
5.3 Schedules
Schedules of Transactions
• The arrangement or order of operations in a
transaction is named as a schedule.
S = T1, T2, T3,.....,Tn
• A schedule can be interleaved, which executes

operations from different transactions.
• But, every transaction Ti that appears in the Schedule
S, should follow the order of operations as if the
transaction executes in isolation.
7
4
5.3 Schedules
• In this slide set, we will be using a set of notations for
the operations included in a transaction and to identify
the transaction number we will be adding a subscript.
• Following are the notations and their descriptions, that
we use in this slide set.
b begin_transaction
r read_item
w write_item
e end_transaction
c commit
a abort
7
5
5.3 Schedules
S a : r 1 (X); r 2 (X); w 1 (X); r 1 (Y); w 2 (X); w 1 (Y);

The above schedule can be arranged into a tabular form by
separating the operations into 2 transactions T1 and T2.
T1 T2
r(X)
r(X)
w(X)
r(Y)
w(X)
w(Y)
7
6
5.3 Schedules
• Schedules of Transactions
If two operations in a schedule have the following
properties, it is known as a conflict.
1. Operations are from different transactions.
2. Do the operation on same data item.
3. At least one of the two operations is a write (insert,
update, delete)
Ex: r 1 (X) and w 2 (X) -> conflict

w1 (X) and w2 (X) -> conflict
r 1 (X) and r2 (X) -> not a conflict
r 1 (X) and w 2 (Y) -> not a conflict
7
7
5.3 Schedules
• Schedules Based on Recoverability

• After a system failure, we have to recover the system.
• There are some schedules, which are easy to recover
and some of the schedules cannot be recovered.
• We are going to characterize the schedules based on
recoverability.
• Recoverable schedules:- Rolling back should not be
needed once a transaction is committed.
7
8
5.3 Schedules

• A schedule with the following properties is known as a
recoverable schedule.
1. Every transaction T1 , in the schedule S should not be

committed until all transactions T2 which has written
values that T1 read is committed.
2. T2 should not have been aborted before T1 reads item
X.
3. In between T2 writes X and T1 reads X, there should be
no transactions that write X.
7
9
5.3 Schedules

(Example) T1 T2
S’= r 1 (X); w 1 (X); r 2 (X); r 1 (Y); w 2 (X); c 2 ; a
r(X)
1;
S’ is non recoverable. w(X)
Reason: T2 reads item X written by T1, but T2 r(X)

commits before T1 commits. The problem occurs by r(Y)
T1 aborting after the c 2 operation. Then the value of
X that T2 read is no longer valid and T2 cannot be w(X)
aborted as it has already been committed, which c
leads to a schedule that is not recoverable.
a
For the schedule to be recoverable, the c2
operation in S’ must be postponed until T1 commits.
8
0
5.3 Schedules
Schedules Based on Recoverability

• As the name implies, Cascading rollback is rolling back
an uncommitted transaction due to the fact that it performs
a read on a failed transaction.
S’’ =r1 (X); w1(X); r2(X); r1(Y); w2(X); w1(Y); a1; a2;
In the above example, T2 is also aborted since T1 aborted.
The reason here is that, T2 reads the X value from T1.
• A schedule where cascading rollback does not happen is

known as a cascadeless schedule.
8
1
5.3 Schedules
Schedules Based on Recoverability

• In a Strict schedule, no item can be read or written by a
transaction until the commit operation of the last
transaction which performed the write of that item occurs.
• In simple terms, in a strict schedule, it is not possible to
read or write a value which is written by an uncommitted
transaction.
8
2
5.4 Serializability
Serial, Nonserial, and Conflict-Serializable Schedules

• Serializable schedules are the schedules which are
considered as correct when executing in the interleaved
fashion.
• If all the transactions of a schedule can perform all its
operations consecutively, it is known as serial.
S’ = r1(X); w1(X); r1(Y); w1(Y); c1; r2(X); w2(X); c2;
• In S’ schedule, all the operations of T1 are completed and
then operations of T2 are started. Therefore, S’ is a serial
schedule.
• Theoretically, we can say that a serial schedule is always
correct, since it performs one transaction after the
commit/abort of the previous transaction.
8
3
5.4 Serializability

• But there are problems in serial schedules:
- Limits the concurrency execution.
- It may be time consuming, because while one
transaction waits for I/O operation to be completed, it
does not allow to execute another transaction.
- If one transaction is long, the next transaction
has to wait a considerable amount of time until the
previous transaction completes.
• Solution would be, instead of running transactions in a
serial schedule, we can allow the other non-serial
schedules which are equivalent to serial, to be
executed.
8
4
5.4 Serializability
T1 T2 Is this a serial schedule? Yes

Why? T2 transaction starts execution only after T1
r(a)=90 is completed.
a=a -3
w(a)=87
Initial values of a=90 and b=90
r(b)=90
b=b+3
w(b)=93
What is the final value of a and b after
completion of T1 and T2?
c
r(a)=87 a= 89
a=a+2 b=93
w(a)=89
c Is this a correct schedule? Yes

88
5
5.4 Serializability
T1 T2 Is this a serial schedule? No

Why? T2 transaction starts execution before
r(a) a= 90 T1 is completed.
a=a -3
r(a) a= 90 Initial values of a=90 and b=90

a=a+2
What is the final value of a and b after
w(a)=87
completion of T1 and T2?
r(b)=90
w(a)=92 a= 92
b=93
c
b=b+3 Is this a correct schedule? No. The final

answers are not correct.
w(b)=93
c T2 still reads a as 90 since the changes made by

T1 has not written yet.
8
6
5.4 Serializability
T1 T2 Is this a serial schedule? No

Why? T2 transaction starts execution before T1
r(a)=90
is completed.
a=a -3
w(a)=87
Initial values of a=90 and b=90
r(a)=87
a=a+2 What is the final value of a and b after

w(a)=89 completion of T1 and T2?
c
a= 89
r(b)=90 b=93
b=b+3
w(b)=93
Is this a correct schedule? Yes. The final
c
answers are correct.
8
7
5.4 Serializability

• According to the examples provided in previous 3 slides, we
can see that there can be non-serial schedules which give
the expected correct result as well as erroneous results.
• We can use the serializability concept to check whether a
given schedule is correct or not.
• Definition for serializability=> A schedule of n transactions is
serializable if it is equivalent to some serial schedule of the
same n transactions.
• For two schedules to be equivalent, the operations applied
on each data item by the schedule, should be applied to
that item in both schedules in the same order.
8
8
5.4 Serializability

• Given two schedules, if the order of conflicting operations
are the same in both schedules, the schedules are conflict
equivalent.
• If the order of conflicting operations performed in 2
schedules are different, the effect made on the database
would be different. Hence, those 2 schedules are not conflict
equivalent.
S1 = r1(X); w2(X);
S2= w2(X); r1(X);
S1 and S2 are not conflict equivalent since the order of conflicting
operations are different.
8
9
5.4 Serializability
P Q
T1 T2
• A schedule S is serializable, if it
T1 T2
is conflict equivalent to a serial
r(a)
r(a) schedule S’.
a=a -3
a=a -3
w(a)
r(b) w(a) Ex:
b=b+3 r(a) • Schedule P is a serial schedule.
w(b)
a=a+2 • Schedule Q performs all the
c
conflicting operations in the
r(a) w(a)
same order as schedule P.
a=a+2 c
Therefore, P and Q schedules
w(a) r(b) are conflict equivalent.
c
b=b+3 • Hence, Q is a serializable
w(b)
schedule.
c
9
0
5.4 Serializability
Testing for Serializability of a Schedule

• We use an algorithm to determine the conflict serializability
of a schedule by constructing a precedence graph.
• The algorithm looks at only the read_item and write_item
operations in a schedule.
• It is a directed graph G = (N, E) which has of a set of
nodes = {T1, T2, … , Tn } and a set of directed
edges = {e1, e2, … , em }.
• The algorithm is explained in the next slide.
9
1
5.4 Serializability

• Algorithm:-
a. For every transaction Ti in schedule S, create a
node labeled Ti in the precedence graph.
b. For each case in S where Tj executes a
read_item(X) after Ti executes a write_item(X),
create an edge (Ti → Tj) in the precedence graph.
c. For each case in S where Tj executes a
write_item(X) after Ti executes a read_item(X),
d. For each case in S where Tj executes a
write_item(X) after Ti executes a write_item(X),
e. The schedule S is serializable if and only if the
precedence graph has no cycles.
9
2
5.4 Serializability
Constructing the precedence graph

Step-by-step example
S : r1(X), r1(Y), w2(X), w1(X), r2(Y)
Step 1 – The given schedule S has operations from two

transactions. Thus, make two nodes corresponding to the two
transactions T1 and T2.
T1 T2
9
3
5.4 Serializability

S : r1(X), r1(Y), w2(X), w1(X), r2(Y)
Step 2 - For the conflicting pair r1(X) w2(X), where r1(X)

happens before w2(X), draw an edge from T1 to T2.
T1 T2
X
9
4
5.4 Serializability

S : r1(X), r1(Y), w2(X), w1(X), r2(Y)
Step 3 - For the conflicting pair w2(X) w1(X), where w2(X)

happens before w1(X), draw an edge from T2 to T1.
T1 T2
X
9
5
5.4 Serializability

S : r1(X), r1(Y), w2(X), w1(X), r2(Y)
Step 3 - Check whether the graph contains cycles.
T1 T2
X
Since the graph is cyclic, we can conclude that the schedule

S is not serializable.
9
6
5.4 Serializability
T1 T2 T3
Testing for Serializability of a
1. r(Z)
Schedule
2 r(Y)
Let’s consider another example. 3 w(Y)
According to the schedule S given in 4 r(Y)

tabular form, we can find the following 5 r(Z)
set of edges for the precedence graph. 6 r(X)
7 w(X)
Line no. 3 and 4: T2->T3 (Y) 8 w(Y)
Line no. 1 and 9: T2->T3 (Z) 9 w(Z)
Line no. 7 and 10: T1->T2 (X) 10 r(X)
Line no. 3 and 11: T2->T1 (Y) 11 r(Y)
Line no. 8 and 12: T3->T1 (Y) 12 w(Y)
13 r(X)
9
7
5.4 Serializability
Based on the edges

x identified, we can draw the
given graph.
Graph contains cycles.
T1 T2
Therefore, no equivalent
y
serial schedule to the given
schedule S exists.
Hence, the schedule S, is
y not serializable.
y,z
T3
9
8
5.4 Serializability
Using Serializability for Concurrency Control

• A serial schedule may slow down the execution process, as
it does not utilize the processing time efficiently when
- executing long transactions in a serial schedule
- waiting for I/O operations
• However, serializable schedules allow concurrent execution
without giving up the accuracy.
• In practical scenarios, it is difficult to test serializability of
schedules as the execution of processes are determined by
the operating system scheduler.
• It is difficult to pre determine the order of operations in
advance to ensure serializability.
9
9
5.4 Serializability
Using Serializability for Concurrency Control
• Most of the DBMSs have designed a rule set, which are to

be followed by all the transactions hence, the result will be
a serializable schedule.
• Rarely some may allow non serializable schedules to be
executed in order to reduce the overhead of transactions.
1
5.4 Serializability
View Equivalence and View Serializability
• The idea behind view equivalence is, we need to get the

same result from the write operations of transactions as
long as the result that is being read by each read operation
is generated by the same write operation in both
schedules.
• Offers less restrictive definition of schedule equivalence
than conflict serializability.
• To be view serializable, a schedule has to be view
equivalent to a serial schedule.
1
5.4 Serializability
• Criteria for two schedules S and S′ to be view equivalent is
as follows.
1. The same set of transactions participate in S & S′, and S

& S′ include the same operations of those transactions.
Simply, S and S’ should have same transactions
and operations.
2. For any operation ri(X) of Ti in S, if the value of X read by
the operation has been written by an operation wj(X) of
Tj (or if it is the original value of X before the schedule
started), the same condition must hold for the value of X
read by operation ri(X) of Ti in S′.
Simply, S and S’ both schedules should read data
items from the same source operation
1
5.4 Serializability
3. If the operation wk(Y) of Tk is the last operation to write

item Y in S, then wk(Y) of Tk must also be the last
operation to write item Y in S′.
Simply, the transaction which has done the last
write of a particular data item should be same in
both S and S’ schedules.
1
5.4 Serializability

Consider the schedules A and B given to illustrate view
serializability.
T1 T2 T1 T2
r(a) r(a)
w(a) w(a)
r(a) r(b)
w(a) w(b)
r(b) r(a)
w(b) w(a)
r(b) r(b)
w(b) w(b)
S P 1
5.4 Serializability

• In both schedules S and P, we can see two transactions T1 and
T2. All the operations included in both T1 and T2 are also the
same.
• Therefore, the set of transactions and their operations are the
same in S and P.
• For every data item, T2 reads what T1 has written in schedule S.
In schedule P also we can see the same sequence of T2 reading
what T1 has written.
• In S schedule, last write of all the data items is performed by T2.
Similarly, in P schedule also last write of data items has done by
T2.
• Since all 3 properties are satisfied, we can conclude S and P are
view equivalent.
• P is a serial schedule. There for S is a view serializable schedule.
1
5.5 Transaction Support in SQL
• The basic interpretation of a SQL transaction is same as

the already defined concept of a transaction. That is, a
transaction is a logical unit and it is assured to be atomic.
• Consistently, a particular SQL statement is atomic - either
it completes the execution without any error, or it fails and
the database remains unchanged.
1
• In SQL, there isn’t any explicit “Begin_Transaction”

statement. When specific SQL statements are
encountered, transaction initiation is done implicitly.
• Every transaction must have an explicit end statement,
which is either a “COMMIT” or a “ROLLBACK”.
• Further, every transaction is characterized by some
attributes. those are;
‒ Access mode
‒ Diagnostic area size
‒ Isolation level
• In SQL, there is “SET TRANSACTION” statement to
specify above characteristics
1
• The access mode can be specified as READ ONLY or

READ WRITE. The default is READ WRITE.
• READ WRITE allows select, update, insert, delete, and
create commands to be executed.
• READ ONLY is simply for data retrieval.
• Syntax → SET TRANSACTION READ ONLY ;
SET TRANSACTION READ WRITE ;
1
• DIAGNOSTIC SIZE n, is the SQL option to set diagnostics

area size. n is an integer value which specifies the
number of conditions allowed to have together in the
diagnostic area. Also it provides feedback information
such as errors and exceptions, about most recently
executed n SQL statements.
1
SET TRANSACTION
READ ONLY,
ISOLATION LEVEL READ UNCOMMITTED,
DIAGNOSTIC SIZE 6;
This statement defines a transaction which has read only

access mode , read uncommitted isolation level and
provides feedback information about most recently executed
6 statements.
1
• ISOLATION LEVEL <isolation>, is the SQL option to set

isolation level of the transaction. for <isolation>, following
values can be applied.
- READ UNCOMMITTED
- READ COMMITTED
- REPEATABLE READ
- SERIALIZABLE (default isolation level)
• In here, the term “SERIALIZABLE” has been used
based on the prevention of violations that produce dirty
read, unrepeatable read, and phantoms. (discussed in
5.1.2).
• Thus, even if the transactions are executed
concurrently, serializable isolation makes sure that the
outcome of this concurrent execution would produce the
same effects as if they were executed serially.
1
Possible Violations Based on Isolation Levels as Defined in

SQL.
Dirty Read Non-repeatable read phantoms
Read
Uncommitted
Read Committed
Repeatable Read
Serializable
1
Read Uncommitted: Declares that transaction can read rows
that have been modified by other transactions but not yet
committed. Thus, result in dirty reads, non-repeatable reads
and phantoms.
• Example - Consider the following transactions T1 and T2
occurs on an account that holds Rs.50,000 of initial balance.
Transaction (T1) →
Deduct Rs: 1000 from an account (Customer_ID=Cid_1105)
due to an automated bill payment happens every month.
But, since an error occurred, T1 transaction rolled back
without committing.
At the same time while T1 executes, customer
(Customer_ID=Cid_1105) checks his account balance.
1
• This is the Tabular representation of the Transactions T1

and T2 explained in the example of the previous slide.
1
Suppose we have set isolation level to Read

Uncommitted in T2, as shown in following SQL query.
SET TRANSACTION ISOLATION LEVEL READ

UNCOMMITTED;
BEGIN TRAN;
SELECT balance
FROM Customer_tbl
WHERE Customer_ID = 'Cid_1105';
COMMIT TRAN;
1
• We get output = 49,000 as the result of the

transaction T2.
• However, the actual balance should be 50,000 since T1 is
rolled back to the original value.
• Explanation→ 49,000 is the balance updated by T1. T2
reads the balance (as 49,000) before T1 rollback. This
dirty read occurred because we have set the isolation
level to “READ UNCOMMITTED” in T2.
1
Read Committed: Declares that transaction can only read
data that has been committed by other transactions. Thus,
prevent dirty reads. But result in non-repeatable reads and
phantoms.
• Example - Consider the following transactions T1 and T2
occurs on an account that holds Rs.50,000 of initial balance.
This transaction was successfully completed and committed
to the database.
(Customer_ID=Cid_1105) checks his account balance twice
consequently. T2 Reads the account balance twice.
1
• This is the tabular representation of Transactions T1 and

T2 explained in the example of the previous slide.
1
Suppose we have set isolation level to Read Committed in

T2, as shown in following SQL query.
SET TRANSACTION ISOLATION LEVEL READ

COMMITTED;
BEGIN TRAN;
SELECT balance
FROM Customer_tbl
COMMIT TRAN;
1
• We get output = 50,000 as the result of the first read

in transaction T2 and output = 49,000 for the
second read in transaction T2.
• Explanation→ Because we have set the isolation level to
“READ COMMITTED” in T2, it only reads committed data
by other transactions. Since T1 is not committed at the
first time T2 reads the balance, T2 reads the balance as
50,000.
• Thus will result in non-repeatable read and phantoms.
1
Repeatable read: Declares that,

• Statements cannot read data that has been modified
but not yet committed by other transactions
and
• No other transactions can modify the data that the
current transaction has read until the current
transaction has completed.
1
• Example - Consider the transactions T1 and T2 occurs on

an account that holds Rs.50,000 of initial balance.
Then T1 transaction commits.
(Customer_ID=Cid_1105) checks his account balance twice
consequently.
1
• This is the tabular representation of transactions T1 and

1
Suppose we have set isolation level to Repeatable Read in

T1, as shown in following SQL query.
SET TRANSACTION ISOLATION LEVEL REPEATABLE

READ;
BEGIN TRAN;
UPDATE Customer_tbl
SET balance=balance-1000
COMMIT TRAN;
1
• First read statement of T2 will not get the balance, but the
second read statement in T2 will get the output =
49,000.
• Explanation→ We have set the isolation level to
“REPEATABLE READ” in T1, the first read statement in
T2 will not allowed to read the balance because T1 has
updated the balance and not committed yet.
• When T2 reads the balance again, T1 has been
completed and committed to the database. Hence it gets
the output= 49,000.
1
• Serializable: Declares that,

• Statements cannot read data that has been modified
but not yet committed by other transactions
and
• No other transactions can modify the data that the
current transaction has read until the current
transaction has completed.
• Until the current transaction completes, other
transactions cannot insert new rows with key values
that fall inside the range of keys read by any
statements in the current transaction.
1
• Example - Consider the transactions T1 and T2 occurs

on an account that holds Rs.50,000 of initial balance.
Reads details of employees who are working in the “123”
department twice consecutively.
At the same time new record is inserted into the employee
table with name =”June” who is working in the "123"
department.
1
• This is the tabular representation of transactions T1 and

1
Suppose we have set isolation level to Serializable in T1, as

shown in following SQL query.
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

BEGIN TRAN;
SELECT *
FROM employee
WHERE dept_id=”123”
COMMIT TRAN;
1
• For T1, both read statements will return 65 rows.

• Explanation→ We have set the isolation level to
“SERIALIZABLE” in T1. Therefore T2 cannot insert
details of an employee who is working in the”123”
department which is the key read by T1.
• When T1 has been completed and committed to the
database, T2 will be executed and update the employee
table. Then the read in T2 will return 66 rows with new
insertion.
1
5.6 Consistency in NoSQL
Consistency
• As we discussed in previous slides, a transaction leads
the database from one consistent state to another.
• In other words, transactions must affect database only
in valid ways.
Consistency in NoSQL
• In NoSQL databases, eventual consistency is preferred
over immediate consistency.This will be discussed in
detail later.
1
Update Consistency
• Update consistency in NoSQL make sure that write-
write conflicts doesn’t occur.
• Write-write conflict occurs when two transactions
update same data item at the same time. If the server
serialize the updates, a lost update occurs.
• There are 2 types of approaches for maintaining
consistency.
– Pessimistic approach: Prevents conflicts from
occurring.
– Optimistic approach: Let the conflicts occur but
detects and takes action to sort them out.
1
Update Consistency cont.

• Selection of a consistency approach would depend upon
the fundamental tradeoff between safety (avoiding errors
such as update conflicts) and liveliness (responding
quickly to clients).
• Liveliness is described as “something good will eventually
occur” and safety as “something bad does not occur”.
• For example, in a banking system it is important to maintain
the consistency of transactions among bank accounts
every time. Therefore safety should be the priority.
• In an informational site, which displays real time score of a
cricket match, it is important to prioritize liveliness over
security.
1

• Pessimistic approach (Prevents conflicts from occurring)
- Write lock
In order to write, a transaction need to acquire a lock on
the record. When two transactions attempt to acquire the
write lock, system ensures only one transaction can get
the lock.
Second transaction will see the result of first transaction’s
update before deciding whether to make its own update.
Pessimistic approaches often severely degrade the
responsiveness of a system and may even lead to
deadlocks.
1

• Optimistic approach (Let the conflicts occur but detects
and takes action to sort them out)
- Conditional update
Before a transaction updates the value of a data item, it
checks whether the value has changed since it’s last
read. If the value has changed, the update will fail.
Update will continue otherwise.
1

Conditional update - Example
Samanali and Krishna read the record A which has the value
100. Samanali wants to add 50 to the A value. Just before
writing the value, she checks the value of A, to make sure it
has not changed since her last read and then does the
modification. Meanwhile Krishna wants to subtract 20 from the
value A. Just before the modification, he also checks the value
of A to make sure the value remain unchanged as 100. But as
Samanali has changed A to 150, Krishna fails to do the
update.
1

• Optimistic approach (Let the conflicts occur but detects
and takes action to sort them out)
- Save both updates and mark as conflicts
Allow different modification on the same data item to be
completed and then merge those updates. Merging can
be done by either showing all the modified values to
user and ask him to sort it out, or automatically merged
by the system.
1

Save both updates and mark as conflicts - Example
Samanali and Krishna read the record A which has the value
100. Then Samanali add 50 to this value and write it.
Meanwhile Krishna subtract 20 from value of A and write it.
DBMS will save the both values 150 (changed by Samanali)
and 80 (changed by Krishna) as possible values for A and
then mark them as conflicts.
1
Read Consistency
• Read consistency in NoSQL will guarantee that readers
will always get consistent responses to their requests.
• Read consistency will prevent “inconsistent read” or
“read-write conflict”.
• Read consistency will preserve ,
- Logical consistency (ensure that different
data items make sense together).
- Replication consistency (ensure that same
data item has the same value when read from
different replicas).
- Session consistency (within user’s session
there is read-your-writes consistency. It means once
you’ve made an update, you are guaranteed to
continue seeing that update).
1
Read Consistency cont.

• Session consistency can be maintained using
- sticky session (session tied to one node) or
- version stamps (will be discussed later).
• Sometimes nodes may have replication inconsistencies.
However, if there are no further updates, eventually all
nodes will be updated to the same value. This is known
as eventual consistency.
• The length of time where the inconsistency present
before eventual consistency is know as inconsistency
window.
• In this inconsistency window some of the data becomes
out of date and they are known as stale data.
1
Replication
• Creating multiple copies of data items over different
servers is known as replication.
• Can be implemented using following two forms.
- Master-Slave : In master-slave replication, the
master processes the updates and then changes are
propagated to slaves.
- Peer-to-peer: In peer-to-peer replication, all the
nodes can process updates and then synchronize
their copies of data.
1
Master-Slave Replication
• Master Master
- The authoritative source for the data

- Responsible for processing updates
- Can be appointed manually or automatically
• Slaves Slave Slave
- Changes propagate to slaves from the master
- If master fails, a slave can be appointed as the new master.
1
Master-Slave Replication cont.

• Pros
- Can be easily scale out if more read requests received
- If master fails, slaves can still handle the read requests
- Suitable for read-intensive systems
• Cons
- Only master can process updates, therefore it may
create a bottleneck
- If master fails, ability to do the updates are eliminated
- A inconsistency window is inevitable
- Not suitable for write-intensive systems
1
Peer-to-Peer Replication
• All the replicas have equal weight
• Every replica can process updates
• Even if one replica fails, system can operate normally.
• Pros
- Resistant to node failures
- Can easily add nodes to improve performance
• Cons
- Write-write inconsistencies can occur
- Read-write inconsistencies can occur due to slow
propagation
1
Relaxing Consistency
• Even though consistency is a good property, normally it is
impossible to achieve consistency without significant
sacrifices to other characteristics of the system such as
availability.
• Transactions will enforce consistency but it is possible to
relax isolation levels to enable individual transactions to
read data that has not been committed yet.
• Relaxing isolation level will improve the performance but
will reduce the consistency.
1
CAP Theorem
• In a database which has several connected nodes, given
the three properties of Consistency, Availability and
Partition tolerance, it is possible to enforce only two
properties at a time.
- Consistency: (We discussed earlier).
- Availability: Every request received by a non failing
node in the system must result in a response.
- Partition tolerance: The system continues to operate
despite communication breakages that separate the
cluster into multiple partitions which are unable to
communicate with each other.
• The resulting system designed using CAP theorem will not
be perfectly consistent or perfectly available but would
have a reasonable combination.
1
CAP Theorem cont. Consistency
CP Category CA Category
Some data might Network problems might
become unavailable. stop the system.
Partition tolerance
P A Availability
AP Category
Data inconsistencies may
occur.

CAP Theorem cont.

CAP theorem categorizes systems into three categories.
• CP Category
- Availability is sacrificed only in the case of a network
partition.
• CA Category
- Consistent and available systems in the absence of
any network partition.
• AP Category
- Systems that are available and partition tolerant but
cannot guarantee consistency.
1
Durability
• Durability means that committed transactions will survive
permanently (even if the system crashed). This is
achieved by flushing the records to disk (Non-volatile
memory) before acknowledging the commit.
Relaxing Durability
• In relaxing durability, database can apply updates in
memory and periodically flush changes to the disk. If the
durability needs can be specified on a call-by-call basis,
more important updates can be flushed to disk.
• By relaxing durability, we can gain higher performance.
1
Relaxing Durability cont.

• Durability trade off for higher performance may be
worthwhile for some scenarios as given below:
- Storing user-session: There are many activities with
respect to a user session, which affect the
responsiveness of the website. Thus, losing the
session data will create less annoyance than a
slower website.
- Capturing telemetric data from physical devices: It

may be necessary to capture data at a faster rate, at
the cost of missing the last updates if the server
crashes.
1
Relaxing Durability
• Another class of durability tradeoffs comes up with
replicated data.
• A failure of replication durability occurs when a node
processes an update but fails before that update is
replicated to the other nodes.
• For example, assume a peer-to-peer replicated system
with three nodes, R1 , R2 and R3. If the transaction is
updated to the memory of R1, but it crashed before the
update is sent to R2 and R3, a failure of replication will
occurr. This can be avoided by setting the durability level.
If the system doesn’t acknowledge the commit until the
update is propagated to majority of nodes, above
scenario will not have occurred.
1
Quorums
• Answers the question, “How many nodes need to be
involved to get strong consistency?”
• Write quorum specifies the number of nodes with non
conflicting writes.
• If W > N/2 ; then the system said to have a strong
consistency.
• W - Number of nodes participating in the write
• N - Number of nodes involved in replication
• The number of replicas is known as the replication
factor.
• If number of nodes required to contact for a read is R;
when R + W > N you can have a strong consistent
read. 1
Quorums Example
• Let’s consider a system with replication factor 3. How
many nodes are required to confirm a write?
For a system to have a strong consistency, W should be
greater than N/2. ( N is replication factor)
Here, W needs to be greater than 3/2
W>1.5
Therefore we need at least 2 nodes to confirm a write.
• What is the number of nodes you need to contact for a
read?
R + W >N (according to definition in previous slide)
R > N -W
R> 3 - 2
R>1
Therefore the number of nodes you need to contact for a read is 2. 1
Version Stamps
• We need human intervention to work with updates in a
transactional system as transactions have limitations.
• Applying locks for longer period of time will affect the
performance of the system. Solution for this is version
stamps, a field that changes every time the underlying
data in the record changes.
• System can note the version stamp when reading the
data and can check whether it’s changed before writing
the data.
1
Version Stamps Cont.

• Version stamps can be created by:
i. Using an incrementing counter at each update of
the resource.
ii. Create a GUID, which is a large random number
that is unique.
iii. Make a hash of the contents of the resource.
iv. Use the timestamp of the last update.
We will discuss the advantages and disadvantages of each
method in coming slides.
1
Version Stamps cont.
i. Using an incrementing counter at each update of the

resource.
• Pros
- Easy to compare and find the most recent
version
• Cons
- Requires a server to generate counter values
- Need a single master to ensure the counters are
not duplicated
1
ii.Create a GUID
• Pros
- Can be generated by any node
• Cons
- Large numbers
- Unable to compare and find the most recent
version directly.
1
iii. Make a hash of the content

• Pros
- Can be generated by any node
- Deterministic (any node will generate the
same hash for the same content)
• Cons
- Lengthy
- Cannot be directly compared for recentness
1
iv. Use the timestamp of the last update

• Pros
- Reasonably short
- Can be directly compared for recentness
- Does not need single master
• Cons
- Clocks of all nodes should be synchronized
- Duplicates can occur if the timestamp is too
granular
1

• Vector stamps - A special form of version stamp, that is
used by peer-to-peer NoSQL systems.
• A vector stamp is a set of counters that are defined for
each node.
Example
• Assume there are 3 nodes, A, B and C.
• Vector stamp for these nodes may look like
[A:10,B:15,C:5]
• When there is an internal update, the node will update
its counter.
• Therefore, an update in B will change the vector stamp
to [A:10,B:16,C:5] (increment B count)
• Whenever two nodes communicates, they synchronize
their vector stamps. 1
Activity
• Check whether the given schedule is serializable by

drawing a precedence graph. Justify your answer.
S=r3(Y);r3(Z);r1(X);w1(X);w3(Y);w3(Z);r2(Z);r1(Y)
;w1(Y);r2(Y);w2(Y);r2(X);w2(X);
1
Activity
Consider T1 and T2
65 rows
transactions given in tabular
format. If T1 reads 65 row
and 66 rows respectively in
Read1 and Read2
operations,
what is the minimum
isolation level of transaction
T1?
66 rows
1
Activity
65 rows
Consider T1 and T2
transactions given in
tabular format. If T1
reads 65 rows in both
Read1 and Read2
operations,
what is the minimum
isolation level of
transaction T1?
65 rows
1
Activity
Consider T1 and T2 transactions

given in tabular format. If the
transaction T1 was executed after
setting isolation level as follows,
SET TRANSACTION ISOLATION
LEVEL REPEATABLE READ;
BEGIN TRAN;
INSERT INTO employee (name,
emp_id, dept_id)
VALUES ("Gabi",
0986,"123");
COMMIT TRAN;
What would be the output of

transaction T2?
1
Activity
Consider T1 and T2 transactions

given in tabular format. Suppose
each of the Read operations
given in T2 transaction was
executed after setting the
isolation level as follows.
SET TRANSACTION ISOLATION
LEVEL SERIALIZABLE;
BEGIN TRAN;
SELECT balance
FROM customer
WHERE customer_ID =
‘5467';
COMMIT TRAN;
If the customer_ID '5467' had 100,000 in the account at the beginning
of T2. What would be the output of second read statement in
transaction T2? 1
Activity
Consider a schedule S with two transactions T1 and T2 as

follows;
S: r1(X); r2(X); w1(Y); w2(Y); c1; c2;
Are there conflicting operations in this schedule?
Represent the schedule in a tabular format and explain the
conflicting operations if any.
1
Activity

follows;
S: r1(X); w2(X); r1(X); w1(Y); c1; c2;
Is the schedule S conflict serializable? Provide reasons for

you answer.
1
Activity
Consider the given schedule S for transactions T1, T2 and T3.

S : r1(X); r2(Y); r3(Z); w2(Y); w1(X); w3(X); r2(X); w2(X)
What is the equivalent serial schedule for the above schedule
S?
1
Activity
Consider a schedule S with three transactions T1, T2 and T3

as follows.
S: r1(X); w1(X); r1(Y); r1(Z); c1; r2(X); w2(X); r2(Z); w2(Z); c2;
r3(Y); w3(Y); r3(Z); w3(Z); c3;
Is the schedule S a serial schedule? Explain the answer.
1
Activity

follows.
Is the following schedule S, a recoverable schedule? Justify
your answer.
S: r1(A); r2(A); w1(A); r1(B); w2(A); w1(B); c1; c2;
1
Activity
Write whether the given statements regarding schedule S are

true or false.
S: r1(X); r2(Y); w3(X); r2(X); r1(Y);
1. S is conflict serializable and view serializable. (_______)

2. S does not have any blind writes.(_______)
3. S is conflict serializable but not view serializable. (_______)
1
Schedule S:
Activity
T1 T2 T3 T4
r(X)
Write whether the given statements
are true or false considering the given
w(X) schedule S.
c
1. S is conflict serializable and
w(X) recoverable. (_______)
c 2. S is conflict serializable but not
recoverable. (_______)
w(Y)
3. S includes blind writes. (_______)
r(Z) 4. S is recoverable but not conflict
c serializable. (_______)
r(X)
r(Y)
1
Activity
Match each of the property given in left column with relevant

explanation given in the right column.
Property Explanation
Consistency System continues to operate even in the
presence of node failure
Availability System continues to operate in spite of
network failures.
Partition Tolerance All the users can see the same data at
same time.
1
Activity
For a system consisted of 15 nodes with replication factor 5,

what is the write quorum and read quorum respectively?
Write quorum =>
Read quorum =>
1
Activity
Fill in the blanks with the most suitable word provided.

_______ replication is ideal for write intensive system while,
__________ replication is better for read intensive system .
In master-slave replication, _________ node is a single
point of failure.
Conditional update is a/an ________ approach of
maintaining consistency in NoSQL databases,
(peer-to-peer, master-slave, primary, secondary, optimistic,

pessimistic)
1
Activity
• Drag and drop the correct answer from the given list.
Version stamps help users to detect conflicts.

Among the different version stamp creation methods,
__________ approach might suffer from getting duplicates if the
system get many updates per millisecond.
Version stamp is a field that changes ___________, when the
underlying data in the record changes.
(concurrency , update , read , version stamp, counters, every

time, often, rarely, use the timestamp of the last update, make
a hash of the content, create a GUID )
1
Summary
Single-user systems, multi-user systems and

Introduction to Transactions
Transaction Problems in concurrent transaction processing,
Processing introduction to concurrency control, DBMS
failures, introduction to data recovery
Transaction states
Properties of
Transactions ACID properties, levels of isolation
1
Summary
Schedules Schedules Based on Recoverability

Serializability Using Serializability for Concurrency Control
Update Consistency, Read Consistency,

Consistency in NoSQL Relaxing Consistency, CAP theorem, Relaxing
Durability and Quorums, Version Stamps
1

IT3306 - Data Management Systems - AllInOne

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IT3306 - Data Management Systems - AllInOne

Uploaded by

Copyright:

Available Formats

1 : Data Management Evolution

IT3306 – Data Management

© e-Learning Centre, UCSC

• This lesson on data management evolution discusses the

© e-Learning Centre, UCSC 2

At the end of this lesson, you will be able to;

© e-Learning Centre, UCSC 3

1.1. Major concepts of object-oriented, XML, and NoSQL databases

© e-Learning Centre, UCSC 4

1.1.2. XML Databases

© e-Learning Centre, UCSC 5

1.1.3. NoSQL Databases

© e-Learning Centre, UCSC 6

1.2 Contrast and compare relational databases concepts and non-

© e-Learning Centre, UCSC 7

Overview of Object Database Concepts

© e-Learning Centre, UCSC 8

Overview of Object Database Concepts

© e-Learning Centre, UCSC 9

Overview of Object Database Concepts

Overview of Object Database Concepts

© e-Learning Centre, UCSC 11

Introduction to object-oriented concepts and features

© e-Learning Centre, UCSC 12

Introduction to object-oriented concepts and features

OOPL has two object categories based on their existence

Introduction to object-oriented concepts and features

Single-valued or atomic types

Struct (or tuple) constructor

struct EmpName<FirstName: string, MiddleName: string,

Collection (or multivalued) type constructors

This enables to represent a collection of elements, which

A set is an unordered collection of elements {i1, i2, … , in} of the

A bag (also called a multiset) is an unordered collection of

Collection (or multivalued) type constructors (contd..)

In contrast to sets and bags, a list is an ordered collection of

An array is a dynamically sized ordered collection of elements that

A dictionary is an unordered sequence of key-value pairs (K, V),

State the correct answer by filling the provided spaces.

1. The major components of an object are _____, _______ .

• The term class is often used to refer to a type definition, along

Define a class to store following details about a vehicle.

Registration_No: String, Color: String, Engine_capacity: integer,

Operations - Start: boolean, Speed: float, Stop: Boolean

• An object B is said to be reachable from an object A if a

For example, to make the DEPARTMENT object persistent the

• It is necessary to understand the difference between relational

• Thus, object model allows transient and persistent objects

Shape Rectangle and Triangle

Type Hierarchies and Inheritance

Type Hierarchies and Inheritance

Type Hierarchies and Inheritance

Type Hierarchies and Inheritance

Type Hierarchies and Inheritance

Toy: product_ID, colour, age_limit, battery, toy_type, price.

ENGINEER MANAGER SALARIED-EMPLOYEE

Problems with multiple inheritance

Problems with multiple inheritance

Polymorphism of operations / operator overloading

Polymorphism of operations / operator overloading

Polymorphism of operations / operator overloading

Match the correct phrase from the given list.

1. A problem with multiple inheritance

1. The major components of an object are _, ___ .