Professional Documents
Culture Documents
A Data Model in Database Management System (DBMS), is the concept of tools that
are developed to summarize the description of the database.
It is classified into 3 types:
Data independence is the ability to modify the scheme without affecting the
programs and the application to be rewritten. Data is separated from the programs,
so that the changes made to the data will not affect the program execution and the
application.
We know the main purpose of the three levels of data abstraction is to achieve data
independence. If the database changes and expands over time, it is very important
that the changes in one level should not affect the data at other levels of the
database. This would save time and cost required when changing the database.
There are two levels of data independence based on three levels of abstraction.
These are as follows −
Physical Data Independence
Logical Data Independence
DBMS is a software that allows access to data stored in a database and provides an
easy and effective method of –
Defining the information.
Storing the information.
Manipulating the information.
Protecting the information from system crashes or data theft.
Differentiating access permissions for different users.
The Structure of Database Management System is also referred to as Overall
System Structure or Database Architecture but it is different from the tier
architecture of Database.
The database system is divided into three components: Query Processor, Storage
Manager, and Disk Storage
1. Query Processor :
It interprets the requests (queries) received from end user via an application
program into instructions. It also executes the user request which is received
from the DML compiler.
Query Processor contains the following components –
DML Compiler –
It processes the DML statements into low level instruction (machine language), so
that they can be executed.
DDL Interpreter –
It processes the DDL statements into a set of table containing meta data (data
about data).
Embedded DML Pre-compiler –
It processes DML statements embedded in an application program into
procedural calls.
Query Optimizer –
It executes the instruction generated by DML Compiler.
2. Storage Manager :
Storage Manager is a program that provides an interface between the data stored in the database
and the queries received. It is also known as Database Control System. It maintains the consistency
and integrity of the database by applying the constraints and executes the DCL statements. It is
responsible for updating, storing, deleting, and retrieving data in the database.
Authorization Manager –
It ensures role-based access control, i.e,. checks whether the particular person is privileged to
perform the requested operation or not.
Integrity Manager –
It checks the integrity constraints when the database is modified.
Transaction Manager –
It controls concurrent access by performing the operations in a scheduled way that it receives the
transaction. Thus, it ensures that the database remains in the consistent state before and after the
execution of a transaction.
File Manager –
It manages the file space and the data structure used to represent information in the database.
Buffer Manager –
It is responsible for cache memory and the transfer of data between the secondary storage and
main memory.
COURSE: DM UNIT: 3 Pg. 19
Cont.,
The overall design of the Database Management System (DBMS) depends on its
architecture. A large amount of data on web servers, Personal Computers (PC) and
other elements are linked with networks with the help of basic client or server
architecture.
PCs and workstations are part of Client architecture that are connected over the
network. The architecture of DBMS depends on how the users are linked to the
database.
There are three kinds of DBMS Architecture, which are as follows −
Tier-1 Architecture.
Tier-2 Architecture.
Tier-3 Architecture.
In 1-tier architecture, the DBMS is the only entity where the user directly sits on the
DBMS and uses it. Any changes done here will directly be done on the DBMS itself. It
does not provide handy tools for end-users. Database designers and programmers
normally prefer to use single-tier architecture.
If the architecture of DBMS is 2-tier, then it must have an application through which
the DBMS can be accessed. Programmers use 2-tier architecture where they access
the DBMS by means of an application. Here the application tier is entirely
independent of the database in terms of operation, design, and programming.
3-tier Architecture:
A 3-tier architecture separates its tiers from each other based on the complexity of
the users and how they use the data present in the database. It is the most widely
used architecture to design a DBMS.
Life Cycle
Database Designing
The next step involves designing the database considering the user-based requirements and
splitting them out into various models so that load or heavy dependencies on a single aspect
are not imposed. Therefore, there has been some model-centric approach and that's where
logical and physical models play a crucial role.
Physical Model - The physical model is concerned with the practices and implementations of
the logical model.
Logical Model - This stage is primarily concerned with developing a model based on the
proposed requirements. The entire model is designed on paper without any implementation or
adopting DBMS considerations.
CREATE TABLE CUSTOMERS( ID INT NOT NULL, NAME VARCHAR (20) NOT NULL, AGE INT NOT
NULL, ADDRESS CHAR (25) , SALARY DECIMAL (18, 2), PRIMARY KEY (ID) );
Example: The attribute dept name appears in both entity sets. Since it is the primary key
for the entity set department , it is redundant in the entity set instructor and needs to be
removed. Removing the attribute dept name from the instructor entity set may appear
rather unintuitive, since the relation instructor that we used in the earlier chap-ters had
an attribute dept name . As we shall see later, when we create a relational schema from
the E-R diagram, the attribute dept name in fact gets added to the relation instructor , but
only if each instructor has at most one associated depart-ment. If an instructor has ore
than one associated department, the relationship between instructors and departments is
recorded in a separate relation inst dept. Treating he connection between instructors and
departments uniformly as a relationship, rather than as an attribute of instructor , makes
the logical relationship explicit, and helps avoid a premature assumption that each
instructor is associated with only one department. Similarly, the student entity set is
related to the department entity set through the relationship set student dept and thus
there is no need for a dept name attribute in student .
Let there be ‘m’ number of entities and ‘n’ number of relationships in a given
E-R diagram.
The number of relations, after converting the E-R diagram to relations, is m+n.
Example:
The basic E-R concepts can model most database features, some aspects of a database may be more
aptly expressed by certain extensions to the basic E-R model. The extended E-R features are
specialization, generalization, higher- and lower-level entity sets, attribute inheritance, and
aggregation.
Specialization – An entity set broken down sub-entities that are distinct in some way from other
entities in the set. For instance, a subset of entities within an entity set may have attributes that are
not shared by all the entities in the entity set. The E-R model provides a means for representing these
distinctive entity groupings.Specialization is an “aTop-down approach” where a high-level entity is
specialized into two or more level entities.
Generalization – It is a process of extracting common properties from a set of entities and creating a
generalized entity from it. generalization is a “Bottom-up approach”. In which two or more entities
can be combined to form a higher-level entity if they have some attributes in common. In
generalization, Subclasses are combined to make a superclass.
Inheritance – An entity that is a member of a subclass inherits all the attributes of the entity as the
member of the superclass, the entity also inherits all the relationships that the superclass participates
in. Inheritance is an important feature of Generalization and Specialization. It allows lower-level
entities to inherit the attributes of higher-level entities.
Aggregation – In aggregation, the relation between two entities is treated as a single entity. In
aggregation, the relationship with its corresponding entities is aggregated into a higher-level entity.
Domain constraints
Domain constraints can be defined as the definition of a valid set of values for an
attribute.
The data type of domain includes string, character, integer, time, date, currency,
etc. The value of the attribute must be available in the corresponding domain.
Entity integrity constraints
The entity integrity constraint states that primary key value can't be null.
This is because the primary key value is used to identify individual rows in relation
and if the primary key has a null value, then we can't identify those rows.
A table can contain a null value other than the primary key field.
Referential Integrity Constraints
A referential integrity constraint is specified between two tables.
Key constraints
Keys are the entity set that is used to identify an entity within its entity set
uniquely.
An entity set can have multiple keys, but out of which one key will be the primary
key. A primary key can contain a unique and null value in the relational table.
COURSE: DM UNIT: 3 Pg. 39
Querying relational data,
Relational algebra is used to break the user requests and instruct the DBMS to execute them.
Relational Query language is used by the user to communicate with the database. They are
generally on a higher level than any other programming language.
This is further divided into two types
Procedural Query Language
Non-Procedural Language
Procedural Query Language
The user instructs the system to perform a set of operations on the database to determine the
desired results.
Non-Procedural Language
The user outlines the desired information without giving a specific procedure for attaining the
information.
Relational Algebra
The query language ‘Relational Algebra’ defines a set of operations on relations.
Consider STUD, DEPT and FACULTY DATABASES.
There are five types of operators :
Select (𝛔) : Returns rows of the input relation that satisfy the predicate.
Projection(ℼ) : Outputs specified attributes from all the rows of the input Relation. Remove
duplicate tuples from the output.
Natural Join (|X|) : Outputs pairs of rows from the two input relations that have the same
value on all common attributes.
r1 |X| r2
Cartesian Product (X) : Outputs all pairs of rows from both input relations.
r1 X r2
Union (U) : Outputs the union of tuples from both the relations.
r1 U r2
A view is a virtual or logical table that allows to view or manipulate parts of the tables. To
reduce REDUNDANT DATA to the minimum possible, Oracle allows the creation of an object
called a VIEW.
A View is mapped, to a SELECT sentence. The table on which the view is based is described in
the FROM clause of the SELECT statement.
Creating view
A view can be created using the CREATE VIEW statement. We can create a view from a single
table or multiple tables.
Syntax:
CREATE VIEW view_name AS
SELECT column1, column2.....
FROM table_name
WHERE condition;
Relational algebra specifies procedures and methods to fetch data hence is called
as a procedural query language ,whereas relational calculus is a non procedural
query language focuses on just fetching data rather than how the query will
work and how data will be fetched
Simply relational calculus is nothing but focusing on what to do rather than
focusing on how to do
Relational calculus is present in two formats -
Tuples relational calculus(TRC)
Domain relational calculus(DRC)
Again, the above query will return the names and ages of the students in the table
Student who not greater than 21 years old
These commands can be used to add, remove or modify tables within a database.
DDL has pre-defined syntax for describing the data.
SQL Datatype
SQL Datatype is used to define the values that a column can contain.
Every column is required to have a name and data type in the database table.
Database Schema
A database schema is the skeleton structure that represents the logical view of the entire
database. It defines how the data is organized and how the relations among them are associated.
It formulates all the constraints that are to be applied on the data.
A database schema defines its entities and the relationship among them. It contains a descriptive
detail of the database, which can be depicted by means of schema diagrams. It’s the database
designers who design the schema to help programmers understand the database and make it
useful.
The fundamental structure of SQL queries includes three clauses that are select,
from, and where clause. What we want in the final result relation is specified in
the select clause. Which relations we need to access to get the result is specified
in from clause. How the relation must be operated to get the result is specified in
the where clause.
• In the select clause, you have to specify the attributes that you want to see in the
result relation
• In the from clause, you have to specify the list of relations that has to be accessed
for evaluating the query.
• In the where clause involves a predicate that includes attributes of the relations that
we have listed in the from clause.
Special value that is supported by SQL is called as null which is used to represent values
of attributes that are unknown or do not apply for that particular row
For example age of a particular student is not available in the age column of student
table then it is represented as null but not as zero
It is important to know that null values is always different from zero value
A null value is used to represent the following different interpretations
Value unknown (value exists but is not known)
Value not available (exists but is purposely hidden)
Attribute not applicable (undefined for that row)
A nested query is a query that has another query embedded within it. The
embedded query is called a subquery.
A subquery typically appears within the WHERE clause of a query. It can
sometimes appear in the FROM clause or HAVING clause.
Example:
Find the names of employee who have regno=103
The query is as follows −
select E.ename from employee E where E.eid IN (select S.eid from salary S where
S.regno=103);
Arithmetic Operators
These operators are used to perform operations such as addition, multiplication,
subtraction etc.
Logical Operators
The logical operators are used to perform operations such as ALL, ANY, NOT,
BETWEEN etc.
The following scalar functions perform an operation on a string input value and return a
string or numeric value:
ASCII NCHAR STRING_AGG
CHAR PATINDEX STRING_ESCAPE
CHARINDEX QUOTENAME STRING_SPLIT
CONCAT REPLACE STUFF
CONCAT_WS REPLICATE SUBSTRING
DIFFERENCE REVERSE TRANSLATE
FORMAT RIGHT TRIM
LEFT RTRIM UNICODE
LEN SOUNDEX UPPER
LOWER SPACE
LTRIM STR
Triggers
Triggers are the SQL statements that are automatically executed when there is any change
in the database. The triggers are executed in response to certain events(INSERT, UPDATE
or DELETE) in a particular table. These triggers help in maintaining the integrity of the data
by changing the data of the database in a systematic fashion.
Syntax
create trigger Trigger_name (before | after) [insert | update | delete] on [table_name]
[for each row] [trigger_body]
Redundancy means having multiple copies of same data in the database. This problem
arises when a database is not normalized. Suppose a table of student details attributes are:
student Id, student name, college name, college rank, course opted.
1. Insertion Anomaly –
If a student detail has to be inserted whose course is not being decided yet then insertion
will not be possible till the time course is decided for student.
2. Deletion Anomaly –
If the details of students in this table are deleted then the details of college will also get
deleted which should not occur by common sense.
This anomaly happens when deletion of a data record results in losing some unrelated
information that was stored as part of the record that was deleted from a table.
It is not possible to delete some information without loosing some other information in the
table as well.
3. Updation Anomaly –
Suppose if the rank of the college changes then changes will have to be all over the
database which will be time-consuming and computationally costly.
There are many problems regarding the decomposition in DBMS mentioned below:
Redundant Storage
Many instances where the same information gets stored in a single place can
confuse the programmers. It will take lots of space in the system.
Insertion Anomalies
It isn’t essential for storing important details unless some kind of information is
stored in a consistent manner.
Deletion Anomalies
It isn’t possible to delete some details without eliminating any sort of
information.
When existence of one or more rows in a table implies one or more other rows in the
same table, then the Multi-valued dependencies occur.
If a table has attributes P, Q and R, then Q and R are multi-valued facts of P.
It is represented by double arrow −
->->
In the above case, Multivalued Dependency exists only if Q and R are independent
attributes.
A table with multivalued dependency violates the 4NF.
Concurrency Control is the management procedure that is required for controlling concurrent
execution of the operations that take place on a database.
But before knowing about concurrency control, we should know about concurrent execution.
Concurrent Execution in DBMS
In a multi-user system, multiple users can access and use the same database at one time, which
is known as the concurrent execution of the database. It means that the same database is
executed simultaneously on a multi-user system by different users.
While working on the database transactions, there occurs the requirement of using the
database by multiple users for performing different operations, and in that case, concurrent
execution of the database is performed.
The thing is that the simultaneous execution that is performed should be done in an interleaved
manner, and no operation should affect the other executing operations, thus maintaining the
consistency of the database. Thus, on making the concurrent execution of the transaction
operations, there occur several challenging problems that need to be solved.
A deadlock is a condition where two or more transactions are waiting indefinitely for
one another to give up locks. Deadlock is said to be one of the most feared
complications in DBMS as no task ever gets finished and is in waiting state forever.
For example: In the student table, transaction T1 holds a lock on some rows and
needs to update some rows in the grade table. Simultaneously, transaction T2 holds
locks on some rows in the grade table and needs to update the rows in the Student
table held by Transaction T1.
Deadlock Detection
The resource scheduler can detect a deadlock as it keeps track of all the resources that
are allocated to different processes. After a deadlock is detected, it can be resolved
using the following methods:
All the processes involved in the deadlock are terminated. This is not a good
approach as all the progress made by the processes is destroyed.
Resources can be preempted from some processes and given to others till the
deadlock is resolved.
Deadlock Prevention
It is imperative to prevent a deadlock before it can occur. So, the system rigorously
checks each transaction before it is executed to make sure it does not lead to deadlock.
If there is even a chance that a transaction may lead to deadlock, it is never allowed to
execute.
Deadlock Avoidance
It is better to avoid a deadlock rather than take measures after the deadlock has
occurred. The wait for graph can be used for deadlock avoidance. This is however only
useful for smaller databases as it can get quite complex in larger databases.
COURSE: DM UNIT: 3 Pg. 75
Use of Lock Conversions
ARIES algorithm
Assumptions
– Strict 2PL => no cascaded aborts
– “in place” disk updates: data overwritten on disk
• Page read into buffer, changed in buffer, written out again
• Write of page to disk is atomic
Log:
– Sequential writes on separate disk
– Write differences only
• Multiple updates on single log page
• Each log record has unique Log Sequence Number
– LSN strictly sequential.
Crash recovery is the operation through which the database is transferred back to a
compatible and operational condition. In DBMS, this is performed by rolling back
insufficient transactions and finishing perpetrated transactions that even now
existed in memory when the crash took place.
With many transactions being implemented with each second shows that, DBMS
may be a tremendously complex system. The fundamental hardware of the system
manages to sustain robustness and stiffness of software which depends upon its
complex design. It’s anticipated that the system would go behind with some
methodology or techniques to restore lost data when it fails or crashes in between
the transactions.
System Crash: There are issues which may stop the system unexpectedly from
outside and may create the system condition to crash. For example, disturbance or
interference in the power supply may create the system condition of fundamental
hardware or software to crash or failure.
File Organization refers to the logical relationships among various records that
constitute the file, particularly with respect to the means of identification and access
to any specific record. In simple terms, Storing the files in certain order is called file
Organization. File Structure refers to the format of the label and data blocks and of
any logical control record.
Hash index
This technique is widely used for creating indices in main memory because its fast
retrieval by nature. It has average O(1) operation complexity and O(n) storage
complexity.
In many books, people use the term bucket to denote a unit of storage that stores one
or more records
•Hash index is suitable for equality or primary key lookup. Queries can benefit from hash
index to get amortized O(1) lookup cost.
For example: SELECT name, id FROM student WHERE id = '1315';
B+Tree:
This is a self-balancing tree data structure that keeps data in sorted order and allows fast search
within each node, typically using binary search.
B+Tree is a standard index implementation in almost all relational database system.
B+Tree is basically a M-way search tree that have the following structure:
perfectly balance: leaf nodes always have the same height.
every inner node other than the root is at least half full (M/2 − 1 <= num of keys <= M − 1).
every inner node with k keys has k+1 non-null children.
Every node of the tree has an array of sorted key-value pairs. The key-value pair is constructed
from (search-key value, pointer) for root and inner nodes. Leaf node values can be 2
possibilities:
the actual record
the pointer to actual record
Lookup a value v
Start with root node
While node is not a leaf node, we do:
Find the smallest Ki where Ki >= v
If Ki == v: set current node to the node pointed by Pi+1
Otherwise, set current node to node pointed by Pi
COURSE: DM UNIT: 3 Pg. 85
Cont.,
Duplicate keys
In general, search-key can be duplicate, to solve this, most database implementations
come up with composite search key. For example, we want to create an index
on student_name then our composite search key should be (student_name, Ap) where
Ap is the primary key of the table.
Pros
There're two major features that B+tree offers:
•Minimizing I/O operations
•Reduced height: B+Tree has quite large branching factor (value between 50 and
2000 often used) which makes the tree fat and short. The figure below illustrates
a B+Tree with height of 2. As we can see nodes are spread out, it takes fewer
nodes to traverse down to a leaf. The cost of looking up a single value is the
height of the tree + 1 for the random access to the table.
•Scalability:
•You have predictable performance for all cases, O(log(n)) in particular. For
databases, it is usually more important than having better best or average case
performance.
Conclusion:
Although hash index performs better in terms of exact match queries, B+Tree is
arguably the most widely used index structure in RDBMS thanks to its consistent
performance in overall and high scalability.
B+Tree Hash
Lookup Time O(log(n)) O(log(1))
Insertion Time O(log(n)) O(log(1))
Deletion Time O(log(n)) O(log(1))
Recently, the log-structured merge tree (LSM-tree) has attracted significant interest as
a contender to B+-tree, because its data structure could enable better storage space
usage efficiency.