You are on page 1of 28

DBMS – Joins

Join is a combination of a Cartesian product followed by a selection


process. A Join operation pairs two tuples from different relations, if and
only if a given join condition is satisfied.
Inner Join

An inner join focuses on the commonality between two tables. When using an
inner join, there must be at least some matching data between two (or more) tables
that are being compared. An inner join searches tables for matching or overlapping
data. Upon finding it, the inner join combines and returns the information into one
new table.
Example of Inner Join
Let's consider a common scenario of two tables: product prices and quantities. The
common information in the two tables is product name, so that is the logical
column to join the tables on. There are some products that are common in the two
tables; others are unique to one of the tables and don't have a match in the other
table.
An inner join on Products returns information about only those products that are
common in both tables.
Outer Join

An outer join returns a set of records (or rows) that include what an inner join
would return but also includes other rows for which no corresponding match is
found in the other table.
There are three types of outer joins:

 Left Outer Join (or Left Join)


 Right Outer Join (or Right Join)
 Full Outer Join (or Full Join)

Each of these outer joins refers to the part of the data that is being compared,
combined, and returned. Sometimes nulls will be produced in this process as some
data is shared while other data is not.
Left Outer Join
A left outer join will return all the data in Table 1 and all the shared data (so, the
inner part of the Venn diagram example), but only corresponding data from Table
2, which is the right join.
Left Join Example
In our example database, there are two products — oranges and tomatoes — on the
'left' (Prices table) that do not have a corresponding entry on the 'right' (Quantities
table). In a left join, these rows are included in the result set with a NULL in the
Quantity column. The other rows in the result are the same as the inner join.
Right Outer Join
A right outer join returns Table 2's data and all the shared data, but only
corresponding data from Table 1, which is the left join.
Right Join Example
Similar to the left join example, the output of a right outer join includes all rows of
the inner join and two rows — broccoli and squash — from the 'right' (Quantities
table) that do not have matching entries on the left.
Full Outer Join
A full outer join, or full join, which is not supported by the popular MySQL
database management system, combines and returns all data from two or more
tables, regardless of whether there is shared information. Think of a full join as
simply duplicating all the specified information, but in one table, rather than
multiple tables. Where matching data is missing, nulls will be produced.
EX:
EmpDetails Table:

EmployeeID EmployeeName

1 John

2 Samantha

3 Hakuna

4 Silky

5 Ram

6 Arpit

7 Lily

8 Sita

9 Farah

10 Jerry

EmpSalary Table:
EmployeeID EmployeeName EmployeeSalary

1 John 50000

2 Samantha 120000

3 Hakuna 75000

4 Silky 25000

5 Ram 150000
6 Arpit 80000

11 Rose 90000

12 Sakshi 45000

13 Jack 250000

Inner
EmployeeID EmployeeName EmployeeSalary

1 John 50000

2 Samantha 120000

3 Hakuna 75000

4 Silky 25000

5 Ram 150000

6 Arpit 80000

Left
EmployeeID EmployeeName EmployeeSalary

1 John 50000

2 Samantha 120000

3 Hakuna 75000

4 Silky 25000

5 Ram 150000

6 Arpit 80000

7 Lily NULL
8 Sita NULL

9 Farah NULL

10 Jerry NULL

Right
EmployeeID EmployeeName EmployeeSalary

1 John 50000

2 Samantha 120000

3 Hakuna 75000

4 Silky 25000

5 Ram 150000

6 Arpit 80000

NULL NULL 90000

NULL NULL 250000

NULL NULL 250000

Full
EmployeeID EmployeeName EmployeeID EmployeeName EmployeeSalary

1 John 1 John 50000

2 Samantha 2 Samantha 120000

3 Hakuna 3 Hakuna 75000


4 Silky 4 Silky 25000

5 Ram 5 Ram 150000

6 Arpit 6 Arpit 80000

7 Lily NULL NULL NULL

8 Sita NULL NULL NULL

9 Farah NULL NULL NULL

10 Jerry NULL NULL NULL

NULL NULL 11 Rose 90000

NULL NULL 12 Sakshi 250000

NULL NULL 13 Jack 250000

Difference between Inner Join and Outer Join in Tabular Format


Inner Join Outer Join
Returns only the rows that Includes the matching rows as well as
have matching values in both some of the non-matching rows between
the tables. the two tables.
In case there are a large Generally, an OUTER JOIN is slower
number of rows in the tables than an INNER JOIN as it needs to
and there is an index to use, return more number of records when
INNER JOIN is generally compared to INNER JOIN. However,
faster than OUTER JOIN. there can be some specific scenarios
where OUTER JOIN is faster.
When a match is not found, it When a match is not found, a NULL is
does not return anything. placed in the column value returned.
Use INNER JOIN when you Use OUTER JOIN when you want to
want to look up detailed display the list of all the information in
information of any specific the two tables.
column.
INNER JOIN acts like a They act like data-add.
filter. There must be a match
on both the tables for an
inner join to return data.

DBMS – Indexing
o Indexing is used to optimize the performance of a database by
minimizing the number of disk accesses required when a query is
processed.
o The index is a type of data structure. It is used to locate and access
the data in a database table quickly.

Index structure:
Indexes can be created using some database columns.

o The first column of the database is the search key that contains a
copy of the primary key or candidate key of the table. The values
of the primary key are stored in sorted order so that the
corresponding data can be accessed easily.
o The second column of the database is the data reference. It
contains a set of pointers holding the address of the disk block
where the value of the particular key can be found.

Indexing Methods:

1- Ordered indices

The indices are usually sorted to make searching faster. The indices
which are sorted are known as ordered indices.
Example: Suppose we have an employee table with thousands of record
and each of which is 10 bytes long. If their IDs start with 1, 2, 3....and so
on and we have to search student with ID-543.
o In the case of a database with no index, we have to search the disk
block from starting till it reaches 543. The DBMS will read the
record after reading 543*10=5430 bytes.
o In the case of an index, we will search using indexes and the
DBMS will read the record after reading 542*2= 1084 bytes which
are very less compared to the previous case.

2- Primary Index
o If the index is created on the basis of the primary key of the table,
then it is known as primary indexing. These primary keys are
unique to each record and contain 1:1 relation between the records.
o As primary keys are stored in sorted order, the performance of the
searching operation is quite efficient.
o The primary index can be classified into two types: Dense index
and Sparse index.

 Dense index

o The dense index contains an index record for every search key
value in the data file. It makes searching faster.
o In this, the number of records in the index table is same as the
number of records in the main table.
o It needs more space to store index record itself. The index records
have the search key and a pointer to the actual record on the disk.

 Sparse index

o In the data file, index record appears only for a few items. Each
item points to a block.
o In this, instead of pointing to each record in the main table, the
index points to the records in the main table in a gap.
3- Clustering Index

o A clustered index can be defined as an ordered data file.


Sometimes the index is created on non-primary key columns which
may not be unique for each record.
o In this case, to identify the record faster, we will group two or
more columns to get the unique value and create index out of them.
This method is called a clustering index.
o The records which have similar characteristics are grouped, and
indexes are created for these group.

Example: suppose a company contains several employees in each


department. Suppose we use a clustering index, where all employees
which belong to the same Dept_ID are considered within a single
cluster, and index pointers point to the cluster as a whole. Here Dept_Id
is a non-unique key.
The previous schema is little confusing because one disk block is shared
by records which belong to the different cluster. If we use separate disk
block for separate clusters, then it is called better technique.
4- Secondary Index

In the sparse indexing, as the size of the table grows, the size of mapping
also grows. These mappings are usually kept in the primary memory so
that address fetch should be faster. Then the secondary memory searches
the actual data based on the address got from mapping. If the mapping
size grows then fetching the address itself becomes slower. In this case,
the sparse index will not be efficient. To overcome this problem,
secondary indexing is introduced.
In secondary indexing, to reduce the size of mapping, another level of
indexing is introduced. In this method, the huge range for the columns is
selected initially so that the mapping size of the first level becomes
small. Then each range is further divided into smaller ranges. The
mapping of the first level is stored in the primary memory, so that
address fetch is faster. The mapping of the second level and actual data
are stored in the secondary memory (hard disk).

For example:
o If you want to find the record of roll 111 in the diagram, then it
will search the highest entry which is smaller than or equal to 111
in the first level index. It will get 100 at this level.
o Then in the second index level, again it does max (111) <= 111
and gets 110. Now using the address 110, it goes to the data block
and starts searching each record till it gets 111.
o This is how a search is performed in this method. Inserting,
updating or deleting is also done in the same manner.
B+ Tree
o In the B+ tree, leaf nodes denote actual data pointers. B+ tree
ensures that all leaf nodes remain at the same height.
o In the B+ tree, the leaf nodes are linked using a link list. Therefore,
a B+ tree can support random access as well as sequential access.

Structure of B+ Tree

o In the B+ tree, every leaf node is at equal distance from the root
node. The B+ tree is of the order n where n is fixed for every B+
tree.
o It contains an internal node and leaf node.

*Searching a record in B+ Tree

Suppose we have to search 55 in the below B+ tree structure. First, we


will fetch for the intermediary node which will direct to the leaf node
that can contain a record for 55.
So, in the intermediary node, we will find a branch between 50 and 75
nodes. Then at the end, we will be redirected to the third leaf node. Here
DBMS will perform a sequential search to find 55.
B+ Tree Insertion

Suppose we want to insert a record 60 in the below structure. It will go


to the 3rd leaf node after 55. It is a balanced tree, and a leaf node of this
tree is already full, so we cannot insert 60 there.
In this case, we have to split the leaf node, so that it can be inserted into
tree without affecting the fill factor, balance and order.

The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root
node is 50. We will split the leaf node of the tree in the middle so that its
balance is not altered. So we can group (50, 55) and (60, 65, 70) into 2
leaf nodes.
If these two has to be leaf nodes, the intermediate node cannot branch
from 50. It should have 60 added to it, and then we can have pointers to
a new leaf node.
This is how we can insert an entry when there is overflow. In a normal
scenario, it is very easy to find the node where it fits and then place it in
that leaf node.

B+ Tree Deletion

Suppose we want to delete 60 from the above example. In this case, we


have to remove 60 from the intermediate node as well as from the 4th
leaf node too. If we remove it from the intermediate node, then the tree
will not satisfy the rule of the B+ tree. So we need to modify it to have a
balanced tree.
After deleting node 60 from above B+ tree and re-arranging the nodes, it
will show as follows:
DBMS - Hashing
In a huge database structure, it is very inefficient to search all the index
values and reach the desired data. Hashing technique is used to calculate
the direct location of a data record on the disk without using index
structure.
In this technique, data is stored at the data blocks whose address is
generated by using the hashing function. The memory location where
these records are stored is known as data bucket or data blocks.
In this, a hash function can choose any of the column value to generate
the address. Most of the time, the hash function uses the primary key to
generate the address of the data block. A hash function is a simple
mathematical function to any complex mathematical function. We can
even consider the primary key itself as the address of the data block.
That means each row whose address will be the same as a primary key
stored in the data block.
Types of Hashing: Static Hashing and Dynamic Hashing
Comparison of Ordered Indexing and Hashing

Parameters Order Indexing Hashing

Storing of Addresses in the memory are Addresses are always


address sorted according to a key value generated using a hash
called the primary key function on the key
value.

Performance It can decrease when the data Performance of hashing


increases in the hash file. As it will be best when there
stores the data in a sorted form is a constant addition
when there is any and deletion of data.
(insert/delete/update) operation However, when the
performed which decreases its database is huge, then
performance. hash file organization
and its maintenance will
be costlier.

Memory There will be many unused In static and dynamic


management data blocks because of the hashing methods,
delete/update operation. These memory is always
data blocks can't be released for managed. Bucket
re-use. That's why regular overflow is also
maintenance of the memory is handled perfectly to
required. extend static hashing.

DBMS - Transaction
A transaction can be defined as a group of tasks. A single task is the
minimum processing unit which cannot be divided further.
Let’s take an example of a simple transaction. Suppose a bank employee
transfers Rs 500 from A's account to B's account. This very simple and
small transaction involves several low-level tasks.
A’s Account
Open_Account(A)
Old_Balance = A.balance
New_Balance = Old_Balance - 500
A.balance = New_Balance
Close_Account(A)
B’s Account
Open_Account(B)
Old_Balance = B.balance
New_Balance = Old_Balance + 500
B.balance = New_Balance
Close_Account(B)

1. R(A);
2. A = A - 500;
3. W(A);
4. R(B);
5. B = B + 500;
6. W(B);

Operations in Transaction:

The main operations in a transaction are:

1. Read Operation
2. Write Operation

1. Read Operation:

 Read operation reads the data from the database and then stores it
in the buffer in main memory.
 For example: Read (A) instruction will read the value of A from
the database and will store it in the buffer in main memory.

read_item(X): command includes the following steps:


1. Find the address of the disk block that contains item X.
2. Copy that disk block into a buffer in main memory (if that
disk block is not already in some main memory buffer).
3. Copy item X from the buffer to the program variable named
X.
2. Write Operation:

 Write operation writes the updated data value back to the database
from the buffer.
 For example: Write (A) will write the updated value of A from the
buffer to the database.

write_item(X) command includes the following steps:


1. Find the address of the disk block that contains item X.
2. Copy that disk block into a buffer in main memory (if that
disk block is not already in some main memory buffer).
3. Copy item X from the program variable named X into its
correct location in the buffer.
4. Store the updated block from the buffer back to disk (either
immediately or at some later point in time).

States of Transactions

A transaction in a database can be in one of the following states:


 Active: In this state, the transaction is being executed. This is the
initial state of every transaction. All the changes made by the
transaction now are stored in the buffer in main memory.
 Partially Committed: When a transaction executes its final
operation, it is said to be in a partially committed state. It is not
considered fully committed because all the changes made by the
transaction are still stored in the buffer in main memory.
 Committed: If a transaction executes all its operations
successfully, it is said to be committed. All its effects are now
permanently established on the database system.
(Note) it is not possible to undo the changes that has been made by the
transaction.
This is because the system is updated into a new consistent state.
The only way to undo the changes is by carrying out another transaction
called as compensating transaction that performs the reverse operations.
 Failed: A transaction is said to be in a failed state if any of the
checks made by the database recovery system fails. A failed
transaction can no longer proceed further.
 Aborted: If any of the checks fails and the transaction has reached
a failed state, then the recovery manager rolls back all its write
operations on the database to bring the database back to its original
state where it was prior to the execution of the transaction.
Transactions in this state are called aborted. The database recovery
module can select one of the two operations after a transaction
aborts:

 Re-start the transaction


 Kill the transaction

ACID Properties
 It is important to ensure that the database remains the following
properties before and after the transaction.
 These properties are called as ACID Properties of a transaction.

1. Atomicity:
 This property ensures that either the transaction occurs completely or it
does not occur at all.
 In other words, it ensures that no transaction occurs partially.

 That is why, it is also referred to as “All or nothing rule “.

 It is the responsibility of Transaction Control Manager to ensure


atomicity of the transactions.
 Let’s take an example of banking system to understand this: Suppose
Account A has a balance of 400$ & B has 700$. Account A is
transferring 100$ to Account B. This is a transaction that has two
operations a) Debiting 100$ from A’s balance b) Creating 100$ to B’s
balance. Let’s say first operation passed successfully while second
failed, in this case A’s balance would be 300$ while B would be
having 700$ instead of 800$. This is unacceptable in a banking system.
Either the transaction should fail without executing any of the
operation or it should process both the operations. The Atomicity
property ensures that.
2. Consistency:
 This property ensures that integrity constraints are maintained.

 In other words, it ensures that the database remains consistent before


and after the transaction.
 It is the responsibility of DBMS and application programmer to ensure
consistency of the database.
 For example account A is having a balance of 400$ and it is
transferring 100$ to account B & C both. So we have two transactions
here. Let’s say these transactions run concurrently and both the
transactions read 400$ balance, in that case the final balance of A
would be 300$ instead of 200$. This is wrong. If the transaction were
to run in isolation then the second transaction would have read the
correct balance 300$ (before debiting 100$) once the first transaction
went successful.
3. Isolation:
 This property ensures that multiple transactions can occur
simultaneously without causing any inconsistency.
 For every pair of transactions, one transaction should start execution
only when the other finished execution.
 It is the responsibility of concurrency control manager to ensure
isolation for all the transactions.
 The example of Isolation in the Consistency property above.

4. Durability:
 This property ensures that all the changes made by a transaction after
its successful execution are written successfully to the disk.
 It is the responsibility of recovery manager to ensure durability in the
database.

You might also like