You are on page 1of 12

B-tree

B-tree is a special type of self-balancing search tree in which each node can contain more than
one key and can have more than two children. It is a generalized form of the binary search tree.
It is also known as a height-balanced m-way tree.

B-tree
Why do you need a B-tree data structure?
The need for B-tree arose with the rise in the need for lesser time in accessing physical storage
media like a hard disk. The secondary storage devices are slower with a larger capacity. There
was a need for such types of data structures that minimize the disk access.

Other data structures such as a binary search tree, avl tree, red-black tree, etc can store only one
key in one node. If you have to store a large number of keys, then the height of such trees
becomes very large, and the access time increases.

However, B-tree can store many keys in a single node and can have multiple child nodes. This
decreases the height significantly allowing faster disk accesses.

B-tree Properties
1. For each node x, the keys are stored in increasing order.
2. In each node, there is a boolean value x.leaf which is true if x is a leaf.
3. If n is the order of the tree, each internal node can contain at most n - 1 keys along with a pointer
to each child.
4. Each node except root can have at most n children and at least n/2 children.
5. All leaves have the same depth (i.e. height-h of the tree).
6. The root has at least 2 children and contains a minimum of 1 key.
7. If n ≥ 1, then for any n-key B-tree of height h and minimum degree t ≥ 2, h ≥ logt (n+1)/2.

Operations on a B-tree
Searching an element in a B-tree
Searching for an element in a B-tree is the generalized form of searching an element in a Binary
Search Tree. The following steps are followed.

1. Starting from the root node, compare k with the first key of the node.
If k = the first key of the node, return the node and the index.
2. If k.leaf = true, return NULL (i.e. not found).
3. If k < the first key of the root node, search the left child of this key recursively.
4. If there is more than one key in the current node and k > the first key, compare k with the next
key in the node.
If k < next key, search the left child of this key (ie. k lies in between the first and the second
keys).
Else, search the right child of the key.
5. Repeat steps 1 to 4 until the leaf is reached.
Searching Example
1. Let us search key k = 17 in the tree below of degree 3.

B-tree
2. k is not found in the root so, compare it with the root key.

k is not found on the root node


3. Since k > 11, go to the right child of the root node.

Go to the right subtree


4. Compare k with 16. Since k > 16, compare k with the next key 18.

Compare with the keys from left to right

5. Since k < 18, k lies between 16 and 18. Search in the right child of 16 or the left child of 18.
k lies in between 16 and 18
6. k is found

k is found

Hashing
Hashing is a technique or process of mapping keys, and values into the hash table by using a
hash function. It is done for faster access to elements. The efficiency of mapping depends on the
efficiency of the hash function used.
Let a hash function H(x) maps the value x at the index x%10 in an Array. For example if the list
of values is [11,12,13,14,15] it will be stored at positions {1,2,3,4,5} in the array or Hash table
respectively.
Transaction processing:

What is Transaction?

A set of logically related operations is known as a transaction. The main operations of a


transaction are:
Read(A): Read operations Read(A) or R(A) reads the value of A from the database and stores it
in a buffer in the main memory.
Write (A): Write operation Write(A) or W(A) writes the value back to the database from the
buffer.
(Note: It doesn’t always need to write it to a database back it just writes the changes to buffer this
is the reason where dirty read comes into the picture)
Let us take a debit transaction from an account that consists of the following operations:
1. R(A);
2. A=A-1000;
3. W(A);
Assume A’s value before starting the transaction is 5000.
● The first operation reads the value of A from the database and stores it in a buffer.
● the Second operation will decrease its value by 1000. So buffer will contain 4000.
● the Third operation will write the value from the buffer to the database. So A’s final value
will be 4000.
But it may also be possible that the transaction may fail after executing some of its operations.
The failure can be because of hardware, software or power, etc. For example, if the debit
transaction discussed above fails after executing operation 2, the value of A will remain 5000 in
the database which is not acceptable by the bank. To avoid this, Database has two important
operations:
Commit: After all instructions of a transaction are successfully executed, the changes made by a
transaction are made permanent in the database.
Rollback: If a transaction is not able to execute all operations successfully, all the changes made
by a transaction are undone.

Properties of a transaction:

Atomicity: As a transaction is a set of logically related operations, either all of them should be
executed or none. A debit transaction discussed above should either execute all three operations
or none. If the debit transaction fails after executing operations 1 and 2 then its new value of
4000 will not be updated in the database which leads to inconsistency.
Consistency: If operations of debit and credit transactions on the same account are executed
concurrently, it may leave the database in an inconsistent state.
● For Example, with T1 (debit of Rs. 1000 from A) and T2 (credit of 500 to A) executing
concurrently, the database reaches an inconsistent state.
● Let us assume the Account balance of A is Rs. 5000. T1 readsofb(5000) and stores the value
in its local buffer space. Then T2 reads A(5000) and also stores the xvalue in its local buffer
space.
● T1 performs A=A-1000 (5000-1000=4000) and 4000 is stored in T1 buffer space. Then T2
performs A=A+500 (5000+500=5500) and 5500 is stored in the T2 buffer space. T1 writes
the value from its buffer background 66 to the database.
● A’s value is updated to 4000 in the database and then T2 writes the value from its buffer back
to the database. A’s value is updated to 5500 which shows that the effect of the debit
transaction is lost and the database has become inconsistent.
● To maintain consistency of the database, we need concurrency control protocols which will
be discussed in the next article. The operations of T1 and T2 with their buffers and database
have been shown in Table 1.

T1 T1’s buffer space T2 T2’s Buffer Space Database

A=5000

R(A); A=5000 A=5000


A=5000 R(A); A=5000 A=5000

A=A-1000; A=4000 A=5000 A=5000

A=4000 A=A+500; A=5500

W(A); A=5500 A=4000

W(A); A=5500

Isolation: The result of a transaction should not be visible to others before the transaction is
committed. For example, let us assume that A’s balance is Rs. 5000 and T1 debits Rs. 1000 from
A. A’s new balance will be 4000. If T2 credits Rs. 500 to A’s new balance, A will become 4500,
and after this T1 fails. Then we have to roll back T2 as well because it is $e4using the value
produced by T1. So transaction results are not made visible to other transactions before it
commits.
Durable: Once the database has committed a transaction, the changes made by the transaction
should be permanent. e.g.; If a person has credited $500000 to his account, the bank can’t say
that the update has been lost. To avoid this problem, multiple copies of the database are stored at
different locations.

Advantages of Concurrency:

In general, concurrency means, that more than one transaction can work on a system.
The advantages of a concurrent system are:
● Waiting Time: It means if a process is in a ready state but still the process does not get the
system to get execute is called waiting time. So, concurrency leads to less waiting time.
● Response Time: The time wasted in getting the response from the cpu for the first time, is
called response time. So, concurrency leads to less Response Time.
● Resource Utilization: The amount of Resource utilization in a particular system is called
Resource Utilization. Multiple transactions can run parallel in a system. So, concurrency
leads to more Resource Utilization.
● Efficiency: The amount of output produced in comparison to given input is called efficiency.
So, Concurrency leads to more Efftransactions
What is Serializability?
Serializability of schedules ensures that a non-serial schedule is equivalent to a serial schedule. It
helps in maintaining the transactions to execute simultaneously without interleaving one another.
In simple words, serializability is a way to check if the execution of two or more transactions are
maintaining the database consistency or not.
Schedules and Serializable Schedules in DBMS
Schedules in DBMS are a series of operations performing one transaction to the other.
R(X) means Reading the value: X; and W(X) means Writing the value: X.

Schedules in DBMS are of two types:

1. Serial Schedule - A schedule in which only one transaction is executed at a time, i.e.,
one transaction is executed completely before starting another transaction.

Example:
Transaction-1 Transaction-2
R(a)
W(a)
R(b)
W(b)
R(b)
W(b)
R(a)
W(a)

Here, we can see that Transaction-2 starts its execution after the completion of Transaction-1.

2. Non-serial Schedule - A schedule in which the transactions are interleaving or


interchanging. There are several transactions executing simultaneously as they are being
used in performing real-world database operations. These transactions may be working
on the same piece of data. Hence, the serializability of non-serial schedules is a major
concern so that our database is consistent before and after the execution of the
transactions.

Example:

Transaction-1 Transaction-2
R(a)
W(a)
Transaction-1 Transaction-2
R(b)
W(b)
R(b)
R(a)
W(b)
W(a)

We can see that Transaction-2 starts its execution before the completion of Transaction-1, and
they are interchangeably working on the same data, i.e., "a" and "b".

DBMS - Concurrency Control

In a multiprogramming environment where multiple transactions can be executed


simultaneously, it is highly important to control the concurrency of transactions. We have
concurrency control protocols to ensure atomicity, isolation, and serializability of concurrent
transactions. Concurrency control protocols can be broadly divided into two categories −
● Lock based protocols
● Time stamp based protocols
Lock-based Protocols
Database systems equipped with lock-based protocols use a mechanism by which any transaction
cannot read or write data until it acquires an appropriate lock on it. Locks are of two kinds −
● Binary Locks − A lock on a data item can be in two states; it is either locked or unlocked.
● Shared/exclusive − This type of locking mechanism differentiates the locks based on
their uses. If a lock is acquired on a data item to performo a write operation, it is an
exclusive lock. Allowing more than one transaction to write on the same data item would
lead the database into an inconsistent state. Read locks are shared because no data value is
being changed.
There are four types of lock protocols available −
Simplistic Lock Protocol
Simplistic lock-based protocols allow transactions to obtain a lock on every object before a
'write' operation is performed. Transactions may unlock the data item after completing the ‘write’
operation.
Pre-claiming Lock Protocol
Pre-claiming protocols evaluate their operations and create a list of data items on which they
need locks. Before initiating an execution, the transaction requests the system for all the locks it
needs beforehand. If all the locks are granted, the transaction executes and releases all the locks
when all its operations are over. If all the locks are not granted, the transaction rolls back and
waits until all the locks are granted.

Two-Phase Locking 2PL


This locking protocol divides the execution phase of a transaction into three parts. In the first
part, when the transaction starts executing, it seeks permission for the locks it requires. The
second part is where the transaction acquires all the locks. As soon as the transaction releases its
first lock, the third phase starts. In this phase, the transaction cannot demand any new locks; it
only releases the acquired locks.

Two-phase locking has two phases, one is growing, where all the locks are being acquired by the
transaction; and the second phase is shrinking, where the locks held by the transaction are being
released.
To claim an exclusive (write) lock, a transaction must first acquire a shared (read) lock and then
upgrade it to an exclusive lock.
Strict Two-Phase Locking
The first phase of Strict-2PL is same as 2PL. After acquiring all the locks in the first phase, the
transaction continues to execute normally. But in contrast to 2PL, Strict-2PL does not release a
lock after using it. Strict-2PL holds all the locks until the commit point and releases all the locks
at a time.
Strict-2PL does not have cascading abort as 2PL does.
Timestamp-based Protocols
The most commonly used concurrency protocol is the timestamp based protocol. This protocol
uses either system time or logical counter as a timestamp.
Lock-based protocols manage the order between the conflicting pairs among transactions at the
time of execution, whereas timestamp-based protocols start working as soon as a transaction is
created.
Every transaction has a timestamp associated with it, and the ordering is determined by the age
of the transaction. A transaction created at 0002 clock time would be older than all other
transactions that come after it. For example, any transaction 'y' entering the system at 0004 is two
seconds younger and the priority would be given to the older one.
In addition, every data item is given the latest read and write-timestamp. This lets the system
know when the last ‘read and write’ operation was performed on the data item.
Timestamp Ordering Protocol
The timestamp-ordering protocol ensures serializability among transactions in their conflicting
read and write operations. This is the responsibility of the protocol system that the conflicting
pair of tasks should be executed according to the timestamp values of the transactions.
● The timestamp of transaction Ti is denoted as TS(Ti).
● Read time-stamp of data-item X is denoted by R-timestamp(X).
● Write time-stamp of data-item X is denoted by W-timestamp(X).
Timestamp ordering protocol works as follows −
● If a transaction Ti issues a read(X) operation −
o If TS(Ti) < W-timestamp(X)
▪ Operation rejected.
o If TS(Ti) >= W-timestamp(X)
▪ Operation executed.
o All data-item timestamps updated.

● If a transaction Ti issues a write(X) operation −


o If TS(Ti) < R-timestamp(X)
▪ Operation rejected.
o If TS(Ti) < W -timestamp(X)
▪ Operation rejected and Ti rolled back.
o Otherwise, operation executed.

You might also like