You are on page 1of 31

DEPARTMENT OF COMPUTER SCIENCE

DUAL DEGREE INTEGRATED POST GRADUATE PROGRAM

RAJIV GANDHI PROUDYOGIKI VISHWAVIDYALAYA, BHOPAL (M.P.)

Advanced Concept in Databases Assignment File

Roll Number: 0007CS16DD05


Semester: VIII (DDI-PG)

Submitted by: Submitted to:


ANIMESH SINGH PROF. SMITA MADAM
Q.1. What do you mean by data modeling? Compare different data models.

Ans- Data modeling is the process of creating a data model for the data to be stored in a Database. This data model is a
conceptual representation of Data objects, the associations between different data objects and the rules. Data modeling
helps in the visual representation of data and enforces business rules, regulatory compliances, and government policies
on the data. Data Models ensure consistency in naming conventions, default values, semantics, security while ensuring
quality of the data.

Data model emphasizes on what data is needed and how it should be organized instead of what operations need to be
performed on the data. Data Model is like architect's building plan which helps to build a conceptual model and set the
relationship between data items.

Types of Data Models

There are mainly three different types of data models:

1. Conceptual: This Data Model defines WHAT the system contains. This model is typically created by Business
stakeholders and Data Architects. The purpose is to organize, scope and define business concepts and rules.
2. Logical: Defines HOW the system should be implemented regardless of the DBMS. This model is typically
created by Data Architects and Business Analysts. The purpose is to developed technical map of rules and data
structures.
3. Physical: This Data Model describes HOW the system will be implemented using a specific DBMS system. This
model is typically created by DBA and developers. The purpose is actual implementation of the database.

Conceptual Data Model

A conceptual data model is a summary-level data model that is most often used on strategic data projects.  It typically
describes an entire enterprise.  Due to its highly abstract nature, it may be referred to as a conceptual model.
Common characteristics of a conceptual data model:

 Enterprise-wide coverage of the business concepts.  Think Customer, Product, Store, Location, Asset.


 Designed and developed primarily for a business audience
 Contains around 20-50 entities (or concepts) with no or extremely limited number of attributes described.
Sometimes architects try to limit it to printing on one page.
 Contains relationships between entities, but may or may not include cardinality and nullability.
 Entities will have definitions.
 Designed and developed to be independent of DBMS, data storage locations or technologies.  In fact, it
would address digital and non-digital concepts. This means it would model paper records and artifacts as well as
database artifacts.

Logical Data Model

A logical data model is a fully-attributed data model that is independent of DBMS, technology, data storage or
organizational constraints.  It typically describes data requirements from the business point of view.  While common
data modeling techniques use a relational model notation, there is no requirement that resulting data implementations
must be created using relational technologies.
Common characteristics of a logical data model:

 Typically describes data requirements for a single project or major subject area.
 May be integrated with other logical data models via a repository of shared entities
 Typically contains 100-1000 entities, although these numbers are highly variable depending on the scope of
the data model.
 Contains relationships between entities that address cardinality and nullability (optionality) of the
relationships.
 Designed and developed to be independent of DBMS, data storage locations or technologies.  In fact, it may
address digital and non-digital concepts.
 Data attributes will typically have datatypes with precisions and lengths assigned.
 Data attributes will have nullability (optionality) assigned.
 Entities and attributes will have definitions.
 All kinds of other meta data may be included (retention rules, privacy indicators,  volumetrics, data lineage,
etc.) In fact, the diagram of a logical data model may show only a tiny percentage of the meta data contained
within the model.
A logical data model will normally be derived from and or linked back to objects in a conceptual data model.

Physical Data Model

A physical data model is a fully-attributed data model that is dependent upon a specific version of a data persistence
technology.  The target implementation technology may be a relational DBMS, an XML document, a NoSQL data storage
component, a spreadsheet or any other data implementation option.
Common characteristics of a physical data model:

 Typically describes data requirements for a single project or application. Sometimes even a portion of an
application.
 May be integrated with other physical data models via a repository of shared entities
 Typically contains 10-1000 tables, although these numbers are highly variable depending on the scope of the
data model.
 Contains relationships between tables that address cardinality and nullability (optionality) of the
relationships.
 Designed and developed to be dependent on a specific version of a DBMS, data storage location or
technology.
 Columns will have datatypes with precisions and lengths assigned.
 Columns will have nullability (optionality) assigned.
 Tables and columns will have definitions.
Q.2. Draw an E-R diagram of University by determining entities of interest and the relationships that exist between
these entities.

Ans:

Q.3. Compare OODBMS and DBMS.


Ans: DBMS is any Database Management System. The most popular DBMS are relational database management systems
in which we store everything as a relation between entities. Entities are Tables.

Eg. Customer entity data is stored in CUSTOMER table. Order entity data table is stored in ORDER table. Then we
establish relation between CUSTOMER and ORDER table by using a foreign key.
OODBMS stands for Object Oriented Database Management System. In an OODBMS we store data in Object form. One
Object can be composed of more Objects. An Object can inherit another Object. We use OODBMS with Object Oriented
Programming languages.

For Java programming, we use Hibernate framework that helps us in mapping Object domain of our program to a
relational database. This is one of the most popular trends these days.

Q.4. Explain different types of keys. What are foreign key constraints? Why is such constraint important?

What are Keys?

A DBMS key is an attribute or set of an attribute which helps you to identify a row(tuple) in a relation(table). They allow
you to find the relation between two tables. Keys help you uniquely identify a row in a table by a combination of one or
more columns in that table.

Various Keys in Database Management System

DBMS has following seven types of Keys each have their different functionality:

 Super Key
 Primary Key
 Candidate Key
 Alternate Key
 Foreign Key
 Compound Key
 Composite Key

What is the Super key?

A superkey is a group of single or multiple keys which identifies rows in a table. A Super key may have additional
attributes that are not needed for unique identification.

What is a Primary Key?

PRIMARY KEY is a column or group of columns in a table that uniquely identify every row in that table. The Primary Key
can't be a duplicate meaning the same value can't appear more than once in the table. A table cannot have more than
one primary key.

Rules for defining Primary key:

 Two rows can't have the same primary key value


 It must for every row to have a primary key value.
 The primary key field cannot be null.
 The value in a primary key column can never be modified or updated if any foreign key refers to that primary
key.

What is the Alternate key?


ALTERNATE KEYS is a column or group of columns in a table that uniquely identify every row in that table. A table can
have multiple choices for a primary key but only one can be set as the primary key. All the keys which are not primary
key are called an Alternate Key.

What is a Candidate Key?

CANDIDATE KEY is a set of attributes that uniquely identify tuples in a table. Candidate Key is a super key with no
repeated attributes. The Primary key should be selected from the candidate keys. Every table must have at least a single
candidate key. A table can have multiple candidate keys but only a single primary key.

Properties of Candidate key:

 It must contain unique values


 Candidate key may have multiple attributes
 Must not contain null values
 It should contain minimum fields to ensure uniqueness
 Uniquely identify each record in a table

What is the Foreign key?

FOREIGN KEY is a column that creates a relationship between two tables. The purpose of Foreign keys is to maintain
data integrity and allow navigation between two different instances of an entity. It acts as a cross-reference between
two tables as it references the primary key of another table.

The purpose of the foreign key constraint is to enforce referential integrity but there are also performance benefits to be
had by including them in your database design.

Q.5. Explain the concept of Generalization and Aggregation with appropriate examples.
Ans: Generalization and Aggregation in ER model are used for data abstraction in which abstraction mechanism is used
to hide details of a set of objects.
Generalization is the process of extracting common properties from a set of entities and create a generalized entity
from it. It is a bottom-up approach in which two or more entities can be generalized to a higher level entity if they have
some attributes in common.
An ER diagram is not capable of representing relationship between an entity and a relationship which may be required in
some scenarios. In those cases, a relationship with its corresponding entities is aggregated into a higher level entity.

Q.6.What are different types of relational query languages? Explain different techniques for optimizing the queries.
Ans: Relational query languages use relational algebra to break the user requests and instruct the DBMS to execute the
requests. It is the language by which user communicates with the database. These relational query languages can be
procedural or non-procedural.
Procedural Query Language
A procedural query language will have set of queries instructing the DBMS to perform various transactions in the
sequence to meet the user request. For example, get_CGPA procedure will have various queries to get the marks of
student in each subject, calculate the total marks, and then decide the CGPA based on his total marks. This procedural
query language tells the database what is required from the database and how to get them from the database.
Relational algebra is a procedural query language.

Non-Procedural Query Language


Non-procedural queries will have single query on one or more tables to get result from the database. For example, get
the name and address of the student with particular ID will have single query on STUDENT table. Relational Calculus is a
non-procedural language which informs what to do with the tables, but doesn’t inform how to accomplish this.
These query languages basically will have queries on tables in the database. In the relational database, a table is known
as relation. Records / rows of the table are referred as tuples. Columns of the table are also known as attributes. All
these names are used interchangeably in relational database.

There are two methods of query optimization.

1. Cost based Optimization (Physical)

This is based on the cost of the query. The query can use different paths based on indexes, constraints, sorting methods
etc. This method mainly uses the statistics like record size, number of records, number of records per block, number of
blocks, table size, whether whole table fits in a block, organization of tables, uniqueness of column values, size of
columns etc.

Suppose, we have series of table joined in a query.

T1 ∞ T2 ∞ T3 ∞ T4∞ T5 ∞ T6
For above query we can have any order of evaluation. We can start taking any two tables in any order and start
evaluating the query. Ideally, we can have join combinations in (2(n-1))! / (n-1)! ways. For example, suppose we have 5
tables involved in join, then we can have 8! / 4! = 1680 combinations. But when query optimizer runs, it does not
evaluate in all these ways always. It uses dynamic programming where it generates the costs for join orders of any
combination of tables. It is calculated and generated only once. This least cost for all the table combination is then
stored in the database and is used for future use. i.e.; say we have a set of tables, T = { T1 , T2 , T3 .. Tn}, then it
generates least cost combination for all the tables and stores it.
 Dynamic Programming
As we learnt above, the least cost for the joins of any combination of table is generated here. These values are stored in
the database and when those tables are used in the query, this combination is selected for evaluating the query.
While generating the cost, it follows below steps :
Suppose we have set of tables, T = {T1 , T2 , T3 .. Tn}, in a DB. It picks the first table, and computes cost for joining with
rest of the tables in set T.  It calculates cost for each of the tables and then chooses the best cost. It continues doing the
same with rest of the tables in set T. It will generate 2n – 1 cases and it selects the lowest cost and stores it. When a
query uses those tables, it checks for the costs here and that combination is used to evaluate the query. This is called
dynamic programming.
In this method, time required to find optimized query is in the order of 3n, where n is the number of tables. Suppose we
have 5 tables, then time required in 35 = 243, which is lesser than finding all the combination of tables and then deciding
the best combination (1680). Also, the space required for computing and storing the cost is also less and is in the order
of 2n. In above example, it is 25 = 32.
 Left Deep Trees
This is another method of determining the cost of the joins. Here, the tables and joins are represented in the form of
trees. The joins always form the root of the tree and table is kept at the right side of the root. LHS of the root always
point to the next join. Hence it gets deeper and deeper on LHS. Hence it is called as left deep tree.

Here instead of calculating the best join cost for set of tables, best join cost for joining with each table is calculated. In
this method, time required to find optimized query is in the order of n2n, where n is the number of tables. Suppose we
have 5 tables, then time required in 5*25 =160, which is lesser than dynamic programming. Also, the space required for
computing storing the cost is also less and is in the order of 2n. In above example, it is 25 = 32, same as dynamic
programming.
 Interesting Sort Orders
This method is an enhancement to dynamic programming. Here, while calculating the best join order costs, it also
considers the sorted tables. It assumes, calculating the join orders on sorted tables would be efficient. i.e.; suppose we
have unsorted tables T1 , T2 , T3 .. Tn and we have join on these tables.
(T1 ∞T2)∞ T3 ∞… ∞ Tn
This method uses hash join or merge join method to calculate the cost. Hash Join will simply join the tables. We get
sorted output in merge join method, but it is costlier than hash join. Even though merge join is costlier at this stage,
when it moves to join with third table, the join will have less effort to sort the tables. This is because first table is the
sorted result of first two tables. Hence it will reduce the total cost of the query.
But the number of tables involved in the join would be relatively less and this cost/space difference will be hardly
noticeable.
All these cost based optimizations are expensive and are suitable for large number of data. There is another method of
optimization called heuristic optimization, which is better compared to cost based optimization.
2. Heuristic Optimization (Logical)
This method is also known as rule based optimization. This is based on the equivalence rule on relational expressions;
hence the number of combination of queries get reduces here. Hence the cost of the query too reduces.
This method creates relational tree for the given query based on the equivalence rules. These equivalence rules by
providing an alternative way of writing and evaluating the query, gives the better path to evaluate the query. This rule
need not be true in all cases. It needs to be examined after applying those rules. The most important set of rules
followed in this method is listed below:
 Perform all the selection operation as early as possible in the query. This should be first and foremost set of
actions on the tables in the query. By performing the selection operation, we can reduce the number of records
involved in the query, rather than using the whole tables throughout the query.
Suppose we have a query to retrieve the students with age 18 and studying in class DESIGN_01. We can get all the
student details from  STUDENT table, and class details from CLASS table. We can write this query in two different ways.

Here both the queries will return same result. But when we observe them closely we can see that first query will join the
two tables first and then applies the filters. That means, it traverses whole table to join, hence the number of records
involved is more. But he second query, applies the filters on each table first. This reduces the number of records on each
table (in class table, the number of record reduces to one in this case!). Then it joins these intermediary tables. Hence
the cost in this case is comparatively less.

Instead of writing query the optimizer creates relational algebra and tree for above case.
 Perform all the projection as early as possible in the query. This is similar to selection but will reduce the number
of columns in the query.
Suppose for example, we have to select only student name, address and class name of students with age 18 from
STUDENT and CLASS tables.

Here again, both the queries look alike, results alike. But when we compare the number of records and attributes
involved at each stage, second query uses less records and hence more efficient.

 Next step is to perform most restrictive joins and selection operations. When we say most restrictive joins and
selection means, select those set of tables and views which will result in comparatively less number of records. 
Any query will have better performance when tables with few records are joined. Hence throughout heuristic
method of optimization, the rules are formed to get less number of records at each stage, so that query
performance is better. So is the case here too.

Suppose we have STUDENT, CLASS and TEACHER tables. Any student can attend only one class in an academic year and
only one teacher takes a class. But a class can have more than 50 students. Now we have to retrieve STUDENT_NAME,
ADDRESS, AGE, CLASS_NAME and TEACHER_NAME of each student in a school.

∏STD_NAME, ADDRESS, AGE, CLASS_NAME, TEACHER_NAME ((STUDENT ∞ CLASS_ID CLASS)∞ TECH_IDTEACHER) 


Not So efficient
∏STD_NAME, ADDRESS, AGE, CLASS_NAME, TEACHER_NAME (STUDENT ∞ CLASS_ID (CLASS∞ TECH_IDTEACHER)) 
Efficient

In the first query, it tries to select the records of students from each class. This will result in a very huge intermediary
table. This table is then joined with another small table. Hence the traversing of number of records is also more. But in
the second query, CLASS and TEACHER are joined first, which has one to one relation here. Hence the number of
resulting records is STUDENT table give the final result. Hence this second method is more efficient.
 Sometimes we can combine above heuristic steps with cost based optimization technique to get better results.
All these methods need not be always true. It also depends on the table size, column size, type of selection, projection,
join sort, constraints, indexes, statistics etc. Above optimization describes the best way of optimizing the queries.
Q.7. Explain select, project and join operations with examples

Ans:

Select Operation

A select operation reduces the length of a table by filtering out unwanted rows. By specifying conditions in
the where clause, the user can filter unwanted rows out of the result set, in sum, the select operation reduces the
results vertically.

For example, find all employees born after 1st Jan 1950:

 SELECT dob '01/JAN/1950'(employee)

Project Operation

Just as the select operation reduces the number of rows, the project operation reduces the number of columns. The
column names specified in the SQL select determine those columns that are displayed. In sum, the project operation
reduces the size of the result set horizontally.

For example, find all employees born after 1st Jan 1950:

 SELECT dob '01/JAN/1950'(employee)

Join Operation

A join operation is used to relate two or more independent tables that share a common column. In a join, two or more
independent tables are merged according to a common column value.

Q.8. Describe the three-level architecture of DBMS.Explain how does it lead to data independence.
Ans: The goal of the three-schema architecture is to separate the user applications from the physical database. In this
architecture, schemas can be defined at the following three levels:

A) Physical Data Level

The physical schema describes details of how data is stored: files, indices, etc. on the random access disk system.  It also
typically describes the record layout of files and type of files (hash, b-tree, flat).

Early applications worked at this level - explicitly dealt with details. E.g., minimizing physical distances between related
data and organizing the data structures within the file (blocked records, linked lists of blocks, etc.)

Problem:

 Routines are hardcoded to deal with physical representation.


 Changes to data structures are difficult to make.
 Application code becomes complex since it must deal with details.
 Rapid implementation of new features very difficult.

B) Conceptual Data Level

Also referred to as the Logical level. Conceptual hides details of the physical level.

 In the relational model, the conceptual schema presents data as a set of tables.

The DBMS maps data access between the conceptual to physical schemas automatically.

 Physical schema can be changed without changing application:


 DBMS must change mapping from conceptual to physical.
 Referred to as physical data independence.

C) External Data Level

In the relational model, the external schema also presents data as a set of relations. An external schema specifies
a view of the data in terms of the conceptual level. It is tailored to the needs of a particular category of users. Portions of
stored data should not be seen by some users and begins to implement a level of security and simplifies the view for
these users

Examples:

 Students should not see faculty salaries.


 Faculty should not see billing or payment data.

Information that can be derived from stored data might be viewed as if it were stored.

 GPA not stored, calculated when needed.

Applications are written in terms of an external schema. The external view is computed when accessed.  It is not stored.
Different external schemas can be provided to different categories of users. Translation from external level to
conceptual level is done automatically by DBMS at run time. The conceptual schema can be changed without changing
application:

 Mapping from external to conceptual must be changed.


 Referred to as conceptual data independence.

Further the concept of Data Independence has 2 parts which are as following:

Logical data independence

 Immunity of external models to changes in the logical model


 Occurs at user interface level

Physical data independence

 Immunity of logical model to changes in internal model


 Occurs at logical interface level

Q.9. How concurrency is performed? Explain the protocol used that is used to maintain the concurrency concept.

Ans: Concurrency control is the procedure in DBMS for managing simultaneous operations without conflicting with each
another. Concurrent access is quite easy if all users are just reading data. There is no way they can interfere with one
another. Though for any practical database, would have a mix of reading and WRITE operations and hence the
concurrency is a challenge.

Concurrency control is used to address such conflicts which mostly occur with a multi-user system. It helps you to make
sure that database transactions are performed concurrently without violating the data integrity of respective databases.

Therefore, concurrency control is a most important element for the proper functioning of a system where two or
multiple database transactions that require access to the same data, are executed simultaneously.

Concurrency Control Protocols

Different concurrency control protocols offer different benefits between the amount of concurrency they allow and the
amount of overhead that they impose.

 Lock-Based Protocols
 Two Phase
 Timestamp-Based Protocols
 Validation-Based Protocols

Lock-based Protocols

A lock is a data variable which is associated with a data item. This lock signifies that operations that can be performed on
the data item. Locks help synchronize access to the database items by concurrent transactions.

All lock requests are made to the concurrency-control manager. Transactions proceed only once the lock request is
granted.

Binary Locks: A Binary lock on a data item can either locked or unlocked states.

Shared/exclusive: This type of locking mechanism separates the locks based on their uses. If a lock is acquired on a data
item to perform a write operation, it is called an exclusive lock.

1. Shared Lock (S):


A shared lock is also called a Read-only lock. With the shared lock, the data item can be shared between transactions.
This is because you will never have permission to update data on the data item.

For example, consider a case where two transactions are reading the account balance of a person. The database will let
them read by placing a shared lock. However, if another transaction wants to update that account's balance, shared lock
prevents it until the reading process is over.

2. Exclusive Lock (X):

With the Exclusive Lock, a data item can be read as well as written. This is exclusive and can't be held concurrently on
the same data item. X-lock is requested using lock-x instruction. Transactions may unlock the data item after finishing
the 'write' operation.

For example, when a transaction needs to update the account balance of a person. You can allows this transaction by
placing X lock on it. Therefore, when the second transaction wants to read or write, exclusive lock prevent this
operation.

3. Simplistic Lock Protocol

This type of lock-based protocols allows transactions to obtain a lock on every object before beginning operation.
Transactions may unlock the data item after finishing the 'write' operation.

4. Pre-claiming Locking

Pre-claiming lock protocol helps to evaluate operations and create a list of required data items which are needed to
initiate an execution process. In the situation when all locks are granted, the transaction executes. After that, all locks
release when all of its operations are over.

Starvation

Starvation is the situation when a transaction needs to wait for an indefinite period to acquire a lock.

Following are the reasons for Starvation:

 When waiting scheme for locked items is not properly managed


 In the case of resource leak
 The same transaction is selected as a victim repeatedly

Deadlock

Deadlock refers to a specific situation where two or more processes are waiting for each other to release a resource or
more than two processes are waiting for the resource in a circular chain.

Two Phase Locking (2PL) Protocol

Two-Phase locking protocol which is also known as a 2PL protocol. It is also called P2L. In this type of locking protocol,
the transaction should acquire a lock after it releases one of its locks.

This locking protocol divides the execution phase of a transaction into three different parts.

 In the first phase, when the transaction begins to execute, it requires permission for the locks it needs.
 The second part is where the transaction obtains all the locks. When a transaction releases its first lock, the third
phase starts.
 In this third phase, the transaction cannot demand any new locks. Instead, it only releases the acquired locks.

The Two-Phase Locking protocol allows each transaction to make a lock or unlock request in two steps:

 Growing Phase: In this phase transaction may obtain locks but may not release any locks.
 Shrinking Phase: In this phase, a transaction may release locks but not obtain any new lock

It is true that the 2PL protocol offers serializability. However, it does not ensure that deadlocks do not happen.

In the above-given diagram, you can see that local and global deadlock detectors are searching for deadlocks and solve
them with resuming transactions to their initial states.

Strict Two-Phase Locking Method

Strict-Two phase locking system is almost similar to 2PL. The only difference is that Strict-2PL never releases a lock after
using it. It holds all the locks until the commit point and releases all the locks at one go when the process is over.

Centralized 2PL

In Centralized 2 PL, a single site is responsible for lock management process. It has only one lock manager for the entire
DBMS.

Primary copy 2PL

Primary copy 2PL mechanism, many lock managers are distributed to different sites. After that, a particular lock manager
is responsible for managing the lock for a set of data items. When the primary copy has been updated, the change is
propagated to the slaves.

Distributed 2PL

In this kind of two-phase locking mechanism, Lock managers are distributed to all sites. They are responsible for
managing locks for data at that site. If no data is replicated, it is equivalent to primary copy 2PL. Communication costs of
Distributed 2PL are quite higher than primary copy 2PL

Timestamp-based Protocols

The timestamp-based algorithm uses a timestamp to serialize the execution of concurrent transactions. This protocol
ensures that every conflicting read and write operations are executed in timestamp order. The protocol uses the System
Time or Logical Count as a Timestamp.
The older transaction is always given priority in this method. It uses system time to determine the time stamp of the
transaction. This is the most commonly used concurrency protocol.

Lock-based protocols help you to manage the order between the conflicting transactions when they will execute.
Timestamp-based protocols manage conflicts as soon as an operation is created.

Q.10. What problems occur in the database when transactions do not satisfy ACID properties? Explain explicitly with
suitable examples.
A transaction is a single logical unit of work which accesses and possibly modifies the contents of a database.
Transactions access data using read and write operations.
In order to maintain consistency in a database, before and after the transaction, certain properties are followed. These
are called ACID properties.

Atomicity
By this, we mean that either the entire transaction takes place at once or doesn’t happen at all. There is no midway i.e.
transactions do not occur partially. Each transaction is considered as one unit and either runs to completion or is not
executed at all. It involves the following two operations.
—Abort: If a transaction aborts, changes made to database are not visible.
—Commit: If a transaction commits, changes made are visible.
Atomicity is also known as the ‘All or nothing rule’.

Consider the following transaction T consisting of T1 and T2: Transfer of 100 from account X to account Y.

If the transaction fails after completion of T1 but before completion of T2.( say, after write(X) but before write(Y)), then
amount has been deducted from X but not added to Y. This results in an inconsistent database state. Therefore, the
transaction must be executed in entirety in order to ensure correctness of database state.
 
Consistency
This means that integrity constraints must be maintained so that the database is consistent before and after the
transaction. It refers to the correctness of a database. Referring to the example above,
The total amount before and after the transaction must be maintained.
Total before T occurs = 500 + 200 = 700.
Total after T occurs = 400 + 300 = 700.
Therefore, database is consistent. Inconsistency occurs in case T1 completes but T2 fails. As a result T is incomplete.
 
Isolation
This property ensures that multiple transactions can occur concurrently without leading to the inconsistency of database
state. Transactions occur independently without interference. Changes occurring in a particular transaction will not be
visible to any other transaction until that particular change in that transaction is written to memory or has been
committed. This property ensures that the execution of transactions concurrently will result in a state that is equivalent
to a state achieved these were executed serially in some order.
Let X= 500, Y = 500.
Consider two transactions T and T”.

Suppose T has been executed till Read (Y) and then T’’ starts. As a result , interleaving of operations takes place due to
which T’’ reads correct value of X but incorrect value of Y and sum computed by
T’’: (X+Y = 50, 000+500=50, 500)
is thus not consistent with the sum at end of transaction:
T: (X+Y = 50, 000 + 450 = 50, 450).
This results in database inconsistency, due to a loss of 50 units. Hence, transactions must take place in isolation and
changes should be visible only after they have been made to the main memory.
 
Durability:
This property ensures that once the transaction has completed execution, the updates and modifications to the
database are stored in and written to disk and they persist even if a system failure occurs. These updates now become
permanent and are stored in non-volatile memory. The effects of the transaction, thus, are never lost.
The ACID properties, in totality, provide a mechanism to ensure correctness and consistency of a database in a way such
that each transaction is a group of operations that acts a single unit, produces consistent results, acts in isolation from
other operations and updates that it makes are durably stored.

Q.11.Explain the deferred and immediate modifications versions of log-based recovery schemes.

Ans: Log-Based Recovery

The most widely used structure for recording database modifications is the log. The log is a sequence of log records and
maintains a history of all update activities in the database. There are several types of log records.
An update log record describes a single database write:

 Transactions identifier.
 Data-item identifier.
 Old value.
 New value.

Other special log records exist to record significant events during transaction processing, such as the start of a
transaction and the commit or abort of a transaction. We denote the various types of log records as:

 <Ti start>. Transaction Ti has started.


 <Ti, Xj, V1, V2> Transaction Ti has performed a write on data item Xj. Xj had value V1 before write, and will have
value V2 after the write.
 < Ti commit> Transaction Ti has committed.
 < Ti abort> Transaction Ti has aborted.

Whenever a transaction performs a write, it is essential that the log record for that write be created before the database
is modified. Once a log record exists, we can output the modification that has already been output to the database. Also
we have the ability to undo a modification that has already been output to the database, by using the old-value field in
the log records.
For log records to be useful for recovery from system and disk failures, the log must reside on stable storage. However,
since the log contains a complete record of all database activity, the volume of data stored in the log may become
unreasonable large.

Deferred Database Modification


The deferred-modification technique ensures transaction atomicity by recording all database modifications in the log,
but deferring all write operations of a transaction until the transaction partially commits (i.e., once the final action of the
transaction has been executed). Then the information in the logs is used to execute the deferred writes. If the system
crashes or if the transaction aborts, then the information in the logs is ignored.
           Let T0 be transaction that transfers $50 from account A to account B:
                             T0: read (A);
                                  A: = A-50;
                                   Write (A);
                                   Read (B);
                                    B: = B + 50;
                                    Write (B).
Immediate Database Modification

The immediate-update technique allows database modifications to be output to the database while the transaction is
still in the active state. These modifications are called uncommitted modifications. In the event of a crash or transaction
failure, the system must use the old-value field of the log records to restore the modified data items.
          Transactions T0 and T1 executed one after the other in the order T0 followed by T1. The portion of the log
containing the relevant information concerning these two transactions appears in the following,
Portion of the system log corresponding to T0 and T1
                             < T0 start >
                                                 < T0, A, 1000, 950 >
                                                  < T0, B, 2000, 2050 >
                              < T0 commit >
                              < T1 start >
                               < T1, C, 700, 600 >
                                < T0 commit >

Checkpoints

When a system failure occurs, we must consult the log to determine those transactions that need to be redone and
those that need to be undone. Rather than reprocessing the entire log, which is time-consuming and much of it
unnecessary, we can use checkpoints:

 Output onto stable storage all the log records currently residing in main memory.
 Output to the disk all modified buffer blocks.
 Output onto stable storage a log record, <checkpoint>.

Now recovery will be to only process log records since the last
Q.12.What is triggers? Explain various types of triggers.
Ans: Trigger is a statement that a system executes automatically when there is any modification to the database. In a
trigger, we first specify when the trigger is to be executed and then the action to be performed when the trigger
executes. Triggers are used to specify certain integrity constraints and referential constraints that cannot be specified
using the constraint mechanism of SQL.
Example –
Suppose, we are adding a tupple to the ‘Donors’ table that is some person has donated blood. So, we can design a
trigger that will automatically add the value of donated blood to the ‘Blood_record’ table.
Types of Triggers –
We can define 6 types of triggers for each table:

1. AFTER INSERT activated after data is inserted into the table.


2. AFTER UPDATE: activated after data in the table is modified.
3. AFTER DELETE: activated after data is deleted/removed from the table.
4. BEFORE INSERT: activated before data is inserted into the table.
5. BEFORE UPDATE: activated before data in the table is modified.
6. BEFORE DELETE: activated before data is deleted/removed from the table.

Q.13.What is two phase locking and how does it guarantee serializability.

Ans: Two Phase Locking (2PL) Protocol

Two-Phase locking protocol which is also known as a 2PL protocol. It is also called P2L. In this type of locking protocol,
the transaction should acquire a lock after it releases one of its locks.

This locking protocol divides the execution phase of a transaction into three different parts.

 In the first phase, when the transaction begins to execute, it requires permission for the locks it needs.
 The second part is where the transaction obtains all the locks. When a transaction releases its first lock, the third
phase starts.
 In this third phase, the transaction cannot demand any new locks. Instead, it only releases the acquired locks.

The Two-Phase Locking protocol allows each transaction to make a lock or unlock request in two steps:

 Growing Phase: In this phase transaction may obtain locks but may not release any locks.
 Shrinking Phase: In this phase, a transaction may release locks but not obtain any new lock

It is true that the 2PL protocol offers serializability. However, it does not ensure that deadlocks do not happen.

In the above-given diagram, you can see that local and global deadlock detectors are searching for deadlocks and solve
them with resuming transactions to their initial states.
Strict Two-Phase Locking Method

Strict-Two phase locking system is almost similar to 2PL. The only difference is that Strict-2PL never releases a lock after
using it. It holds all the locks until the commit point and releases all the locks at one go when the process is over.

Centralized 2PL

In Centralized 2 PL, a single site is responsible for lock management process. It has only one lock manager for the entire
DBMS.

Primary copy 2PL

Primary copy 2PL mechanism, many lock managers are distributed to different sites. After that, a particular lock manager
is responsible for managing the lock for a set of data items. When the primary copy has been updated, the change is
propagated to the slaves.

Distributed 2PL

In this kind of two-phase locking mechanism, Lock managers are distributed to all sites. They are responsible for
managing locks for data at that site. If no data is replicated, it is equivalent to primary copy 2PL. Communication costs of
Distributed 2PL are quite higher than primary copy 2PL

Q.14.What is distributed Database System? How it is different from the centralized Database System?

Database management system is a software which is used by the organizations to store and manage the data in an
efficient manner (Ramamurthy, 2017). This system helps in maintaining the security and integrity of the data. There are
two types of database management system such as:

 Distributed database system


 Centralized database system

Distributed database system:

Distributed database management system is basically a set of multiple and logical interrelated database which is
distributed over the network. It includes single database which is further divided into sub fragments. Each fragment is
integrated with each other and is controlled by individual database. It provides a mechanism that helps the users in
distributing the data transparently. Distributed database management system is mostly used in warehouse to access and
process the database of the clients at single time.

Centralized database management system:

Centralized data base is another type of database system which is located, maintained and stored in a single location
such as mainframe computer. Data stored in the centralized DBMS is distributed across the network computers. It
includes set of records which can easily be accessed from any location by using internet connection such as WAN and
LAN.  Centralized database system is commonly used in the organizations such as banks, schools, colleges etc to manage
all their data in an appropriate manner (Onsman, 2018).
Difference:

Centralized DBMS Distributed DBMS


Data is stored only on one site Data is stored on different sites
Data stored in single computer can be used by multiple Data is stored over different sites which are connected with
users each other.
If one of the system fails, then user can access the data
If centralized system fails, then the entire system is halted.
from other sites.
Centralized DBMS is less reliable and reactive Distributed DBMS is more reliable and reactive
Centralized DBMS is less sophisticated Distributed DBMS is more sophisticated

Q.15. Prove that a relation which is in 4 NF must be in BCNF.


Ans: A relation is in 4NF if, for every nontrivial multivalued dependency X →→ Y in F+, X is a superkey for R. So, if a
relation is in BCNF, but not in 4NF, then there must exists a nontrivial multivalued dependency (MVD) X →→ Y such that
X is not the key. We will show that this is in contradiction with the fact that the relation is in BCNF and has a candidate
key K constituted by a unique attribute (simple candidate key).
Consider the fact that, in a relation R(T), when we have a nontrivial MVD X →→ Y, (assuming, without loss of generality
that X and Y are disjoint), then also the MVD dependency X →→ Z must hold in the same relation, with Z = T - X - Y (that
is Z are all the other attributes of the relation). We can now prove that each candidate key must contain at least an
attribute of Z and an attribute of Y (so it must contain at least 2 attributes!).
Since we have X →→ Y and X →→ Z, and X is not a candidate key, assume that the hypothesis is false, that is that there is
a candidate K which does not contain a member of Y (and for symmetry, neither a member of Z). But, since K is a key, we
have that K → Y, with K and Y disjoint.
Now, there is an inference rule that says that, in general, if V →→ W and U → W, where U and W are disjoint, then V →
W.

Applying this rule to our case, since X →→ Y, and K → Y, we can say that X → Y. But this is a contradiction, since we have
said that R is in BCNF, and X is not a candidate key.
In other words, if a relation is not in 4NF, than each key must have at least 2 attributes.
And given the initial hypothesis, that we have a relation in BCNF with at least a simple candidate key, for the previous
lemma, the relation must be in 4NF (otherwise every key should be constituted by at least 2 attributes!).

Q.16. Explain the conceptual modeling of Data Warehouses.


Ans: A Data warehouse conceptual data model is nothing but a highest-level relationship between the different entities
(in other word different table) in the data model.
Features of Data Warehouse Conceptual Data Model

Following are the features of conceptual data model:

 This is initial or high-level relation between different entities in the data model. Conceptual model includes the
important entities and the relationships among them.
 In the Data warehouse conceptual data model you will not specify any attributes to the entities.
 You also not define any primary key yet.
Schematic Representation of Data Warehouse Logical Data Model

The figure below is an example of a conceptual data model.

Q.17. How is data mining different from KDD? Explain various data mining techniques.

Ans: Data, in its raw form, is just a collection of things, where little information might be derived. Together with the
development of information discovery methods (Data Mining and KDD), the value of the info is significantly improved.

Data mining is one among the steps of Knowledge Discovery in Databases (KDD) as can be shown by the image below.
KDD is a multi-step process that encourages the conversion of data to useful information. Data mining is the pattern
extraction phase of KDD. Data mining can take on several types, the option influenced by the desired outcomes.
Knowledge Discovery in Databases Steps

Data Selection
KDD isn’t prepared without human interaction. The choice of subset and the data set requires knowledge of the domain
from which the data is to be taken. Removing non-related information elements from the dataset reduces the search
space during the data mining phase of KDD. The sample size and structure are established during this point, if the
dataset can be assessed employing a testing of the info.

Pre-processing
Databases do contain incorrect or missing data. During the pre-processing phase, the information is cleaned. This
warrants the removal of “outliers”, if appropriate; choosing approaches for handling missing data fields; accounting for
time sequence information, and applicable normalization of data.

Transformation
Within the transformation phase attempts to reduce the variety of data elements can be assessed while preserving the
quality of the info. During this stage, information is organized, changed in one type to some other (i.e. changing nominal
to numeric) and new or “derived” attributes are defined.

Data mining
Now the info is subjected to one or several data-mining methods such as regression, group, or clustering. The
information mining part of KDD usually requires repeated iterative application of particular data mining methods.
Different data-mining techniques or models can be used depending on the expected outcome.

Evaluation
The final step is documentation and interpretation of the outcomes from the previous steps. Steps during this period
might consist of returning to a previous step up the KDD approach to help refine the acquired knowledge, or converting
the knowledge into a form clear for the user.In this stage the extracted data patterns are visualized for further reviews.

Q.18. What is an Association Rule? Explain various methods to discover Association Rules.
Ans: Association rules are if-then statements that help to show the probability of relationships between data items
within large data sets in various types of databases. Association rule mining has a number of applications and is widely
used to help discover sales correlations in transactional data or in medical data sets.

How association rules work

Association rule mining, at a basic level, involves the use of machine learning models to analyze data for patterns, or co-
occurrence, in a database. It identifies frequent if-then associations, which are called association rules.

An association rule has two parts: an antecedent (if) and a consequent (then). An antecedent is an item found within the
data. A consequent is an item found in combination with the antecedent.

Association rules are created by searching data for frequent if-then patterns and using the
criteria support and confidence to identify the most important relationships. Support is an indication of how frequently
the items appear in the data. Confidence indicates the number of times the if-then statements are found true. A third
metric, called lift, can be used to compare confidence with expected confidence.

Association rules are calculated from itemsets, which are made up of two or more items. If rules are built from analyzing
all the possible itemsets, there could be so many rules that the rules hold little meaning. With that, association rules are
typically created from rules well-represented in data.
Association rule algorithms

Popular algorithms that use association rules include AIS, SETM, Apriori and variations of the latter.

With the AIS algorithm, itemsets are generated and counted as it scans the data. In transaction data, the AIS algorithm
determines which large itemsets contained a transaction, and new candidate itemsets are created by extending the
large itemsets with other items in the transaction data.

The SETM algorithm also generates candidate itemsets as it scans a database, but this algorithm accounts for the
itemsets at the end of its scan. New candidate itemsets are generated the same way as with the AIS algorithm, but the
transaction ID of the generating transaction is saved with the candidate itemset in a sequential structure. At the end of
the pass, the support count of candidate itemsets is created by aggregating the sequential structure. The downside of
both the AIS and SETM algorithms is that each one can generate and count many small candidate itemsets

Examples of association rules in data mining

A classic example of association rule mining refers to a relationship between diapers and beers. The example, which
seems to be fictional, claims that men who go to a store to buy diapers are also likely to buy beer. Data that would point
to that might look like this:

A supermarket has 200,000 customer transactions. About 4,000 transactions, or about 2% of total transactions, include
the purchase of diapers. About 5,500 transactions (2.75%) include the purchase of beer. Of those, about 3,500
transactions, 1.75%, include both the purchase of diapers and beer. Based on the percentages, that number should be
much lower. However, the fact that about 87.5% of diaper purchases include the purchase of beer indicates a link
between diapers and beer.

Q.19. What is clustering? What are various different clustering techniques.


Ans:
 It is a data mining technique used to place the data elements into their related groups.
 Clustering is the process of partitioning the data (or objects) into the same class, The data in one class is
more similar to each other than to those in other cluster.
 The process of partitioning data objects into subclasses is called as cluster.
 A cluster consists of data object with high inter similarity and low intra similarity.
 The quality of cluster depends on the method used.
 Clustering is also called as data segmentation, because it partitions large data sets into groups according
to their similarity

Clustering can be helpful in many fields, such as:


1. Marketing:
Clustering helps to find group of customers with similar behavior from a given data set customer record.

2. Biology:
Classification of plants and animal according to their features.

3. Library:
Clustering is very useful in book ordering.
Types of clustering

Clustering methods can be classified into the following categories:

1. Partitioning
In this approach, several partitions are created and then evaluated based on given criteria.

2.  Hierarchical method
In this method, the set of data objects are decomposed (multilevel) hierarchically by using certain criteria.

3. Density-based method
This method is based on density (density reachability and density connectivity).
       
4. Grid-based methods
This approach is based on multi-resolution grid data structure.

Q.20. Write short notes on:


(i) Multidimensional Data Model
(ii) Data Marts
(iii) OLAP

Ans: (i) A multidimensional model views data in the form of a data-cube. A data cube enables data to be modeled and
viewed in multiple dimensions. It is defined by dimensions and facts.

The dimensions are the perspectives or entities concerning which an organization keeps records. For example, a shop
may create a sales data warehouse to keep records of the store's sales for the dimension time, item, and location. These
dimensions allow the save to keep track of things, for example, monthly sales of items and the locations at which the
items were sold. Each dimension has a table related to it, called a dimensional table, which describes the dimension
further. For example, a dimensional table for an item may contain the attributes item_name, brand, and type.

A multidimensional data model is organized around a central theme, for example, sales. This theme is represented by a
fact table. Facts are numerical measures. The fact table contains the names of the facts or measures of the related
dimensional tables.
(ii)
A data mart is a structure / access pattern specific to data warehouse environments, used to retrieve client-facing data.
The data mart is a subset of the data warehouse and is usually oriented to a specific business line or team. Whereas data
warehouses have an enterprise-wide depth, the information in data marts pertains to a single department. In some
deployments, each department or business unit is considered the owner of its data mart including all
the hardware, software and data.[1] This enables each department to isolate the use, manipulation and development of
their data. In other deployments where conformed dimensions are used, this business unit ownership will not hold true
for shared dimensions like customer, product, etc.
Organizations build data warehouses and data marts because the information in the database is not organized in a way
that makes it readily accessible, requiring queries that are too complicated and it's very difficult to access or resource-
consuming.
While transactional databases are designed to be updated, data warehouses or marts are read only. Data warehouses
are designed to access large groups of related records. Data marts improve end-user response time by allowing users to
have access to the specific type of data they need to view most often by providing the data in a way that supports the
collective view of a group of users.
A data mart is basically a condensed and more focused version of a data warehouse that reflects the regulations and
process specifications of each business unit within an organization. Each data mart is dedicated to a specific business
function or region. This subset of data may span across many or all of an enterprise’s functional subject areas. It is
common for multiple data marts to be used in order to serve the needs of each individual business unit (different data
marts can be used to obtain specific information for various enterprise departments, such as accounting, marketing,
sales, etc.).

(iii)
Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It allows managers, and
analysts to get an insight of the information through fast, consistent, and interactive access to information. This chapter
cover the types of OLAP, operations on OLAP, difference between OLAP, and statistical databases and OLTP.

Types of OLAP Servers

We have four types of OLAP servers −

 Relational OLAP (ROLAP)


 Multidimensional OLAP (MOLAP)
 Hybrid OLAP (HOLAP)
 Specialized SQL Servers

Relational OLAP

ROLAP servers are placed between relational back-end server and client front-end tools. To store and manage
warehouse data, ROLAP uses relational or extended-relational DBMS.
ROLAP includes the following −

 Implementation of aggregation navigation logic.


 Optimization for each DBMS back end.
 Additional tools and services.

Multidimensional OLAP

MOLAP uses array-based multidimensional storage engines for multidimensional views of data. With multidimensional
data stores, the storage utilization may be low if the data set is sparse. Therefore, many MOLAP server use two levels of
data storage representation to handle dense and sparse data sets.

Hybrid OLAP

Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP and faster computation
of MOLAP. HOLAP servers allows to store the large data volumes of detailed information. The aggregations are stored
separately in MOLAP store.

Specialized SQL Servers

Specialized SQL servers provide advanced query language and query processing support for SQL queries over star and
snowflake schemas in a read-only environment.

OLAP Operations

Since OLAP servers are based on multidimensional view of data, we will discuss OLAP operations in multidimensional
data.
Here is the list of OLAP operations −

 Roll-up
 Drill-down
 Slice and dice
 Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −

 By climbing up a concept hierarchy for a dimension


 By dimension reduction
The following diagram illustrates how roll-up works.
 Roll-up is performed by climbing up a concept hierarchy for the dimension location.
 Initially the concept hierarchy was "street < city < province < country".
 On rolling up, the data is aggregated by ascending the location hierarchy from the level of city to the level of
country.
 The data is grouped into cities rather than countries.
 When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways −

 By stepping down a concept hierarchy for a dimension


 By introducing a new dimension.
The following diagram illustrates how drill-down works −
 Drill-down is performed by stepping down a concept hierarchy for the dimension time.
 Initially the concept hierarchy was "day < month < quarter < year."
 On drilling down, the time dimension is descended from the level of quarter to the level of month.
 When drill-down is performed, one or more dimensions from the data cube are added.
 It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a new sub-cube. Consider the
following diagram that shows how slice works.
 Here Slice is performed for the dimension "time" using the criterion time = "Q1".
 It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the following diagram
that shows the dice operation.

The dice operation on the cube based on the following selection criteria involves three dimensions.

 (location = "Toronto" or "Vancouver")


 (time = "Q1" or "Q2")
 (item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative
presentation of data. Consider the following diagram that shows the pivot operation.

You might also like