You are on page 1of 10

FILE ORGANIZATION IN DBMS

The logical relationships among the many records that make up the file, particularly in terms
of means of identification and access to any given record, are referred to as file organization.
Simply put, file organization is the process of storing files in a specific order.

The File is a set of documents. We can get to the records by using the primary key. The sort
of file organization that was employed for a given set of records can determine the type and
frequency of access.

A logical relationship between distinct records is referred to as file organization. This


method specifies how disc blocks are mapped to file records. The word “file organization”
refers to the method by which records are organized into blocks and then placed on a
storage media.

The first method of mapping a database to a file is to employ many files, each containing
only one fixed-length entry. Another option is to arrange our files so that we can store data
of various lengths. Fixed-length record files are easier to implement than variable-length
record files.

Objectives of File Organization in Data Structure

Here are some ways in which file organization helps:

• It has an ideal record selection, which means records can be selected as quickly as
feasible.
• Insert, delete, and update transactions on records should be simple and rapid.
• Duplicate records cannot be created by inserting, updating, or deleting records.
• Records should be stored efficiently to save money on storage.

Types of File Organization

There are a variety of ways to organize your files. On the basis of access or selection, these
methods offer advantages and disadvantages. The programmers choose the finest file
organization strategy for their needs when it comes to file organization.

The following are the different types of file organization:

Sequential File Organization


This is the most straightforward technique of file arrangement. Files are saved in this
method in sequential order.
Methods of Sequential File Organization

There are two approaches to implementing this method:

1. Pile File Method

It’s a straightforward procedure. In this method, we store the records in sequential order,
one after the other. The record will be entered in the same order as it is inserted into the
tables.

When a record is updated or deleted, the memory blocks are searched for the record. When
it is discovered, it will be marked for deletion, and a new record will be added in its place.

Insertion of the New Record:

Assume we have four records in order, R1, R3 …. R9, and R8. As a result, records are nothing
more than a table row. If we wish to add a new record R2 to the sequence, we’ll have to put
it at the end of the file. Records are nothing more than a row on a table.

2. Sorted File Method

The new record is always added at the end of the file in this approach, and the sequence is
then sorted in descending or ascending order. Records are sorted using any primary key or
other keys.
If a record is modified, it will first update the record, then sort the file, and then store the
revised record in the correct location.

Insertion of the New Record:

Assume there is a pre-existing sorted sequence of four records, R1, R3 … R9, and R8. If a
new record R2 needs to be added to the sequence, it will be added at the end of the file,
and then the series will be sorted.

Heap File Organization


It is the most fundamental and basic form of file organization. It’s based on data chunks. The
records are inserted at the end of the file in the heap file organization. The ordering and
sorting of records are not required when the entries are added.

The new record is put in a different block when the data block is full. This new data block
does not have to be the next data block in the memory; it can store new entries in any data
block in the memory. An unordered file is the same as a heap file. Every record in the file
has a unique id, and every page in the file is the same size. The DBMS is in charge of storing
and managing the new records.

Insertion of a New Record

Let’s say we have five records in a heap, R1, R3, R6, R4, and R5, and we wish to add a fifth
record, R2. If data block 3 is full, the DBMS will insert it into whichever database it chooses,
such as data block 1.
In a heap file organization, if we wish to search, update, or remove data, we must traverse
the data from the beginning of the file until we find the desired record.

Because there is no ordering or sorting of records, searching, updating, or removing records


will take a long time if the database is huge. We must check all of the data in the heap file
organization until we find the necessary record.

Hash File Organization


Hashing in DBMS is a technique to quickly locate a data record in a database irrespective of
the size of the database. For larger databases containing thousands and millions of records,
the indexing data structure technique becomes very inefficient because searching a specific
record through indexing will consume more time. This doesn’t align with the goals of DBMS,
especially when performance and date retrieval time are minimized. So, to counter this
problem hashing technique is used.
What is Hashing?
The hashing technique utilizes an auxiliary hash table to store the data records using a hash
function. There are 2 key components in hashing:
Hash Table: A hash table is an array or data structure, and its size is determined by the total
volume of data records present in the database. Each memory location in a hash table is
called a ‘bucket ‘or hash indices and stores a data record’s exact location and can be
accessed through a hash function.
Bucket: A bucket is a memory location (index) in the hash table that stores the data record.
These buckets generally store a disk block which further stores multiple records. It is also
known as the hash index.
Hash Function: A hash function is a mathematical equation or algorithm that takes one data
record’s primary key as input and computes the hash index as output.
Hash Function
A hash function is a mathematical algorithm that computes the index or the location where
the current data record is to be stored in the hash table so that it can be accessed efficiently
later. This hash function is the most crucial component that determines the speed of
fetching data.
Working of Hash Function
The hash function generates a hash index through the primary key of the data record.
Now, there are 2 possibilities:
1. The hash index generated isn’t already occupied by any other value. So, the address of the
data record will be stored here.
2. The hash index generated is already occupied by some other value. This is called collision
so to counter this, a collision resolution technique will be applied.
3. Now whenever we query a specific record, the hash function will be applied and returns
the data record comparatively faster than indexing because we can directly reach the exact
location of the data record through the hash function rather than searching through indices
one by one.
Example:

Hashing
Types of Hashing in DBMS
There are two primary hashing techniques:
1. Static Hashing
In static hashing, the hash function always generates the same bucket’s address. For
example, if we have a data record for employee_id = 107, the hash function is mod-5 which
is – H(x) % 5, where x = id. Then the operation will take place like this:
H(106) % 5 = 1.
This indicates that the data record should be placed or searched in the 1st bucket (or 1st
hash index) in the hash table.
Example:
Static Hashing Technique
The primary key is used as the input to the hash function and the hash function generates
the output as the hash index (bucket’s address) which contains the address of the actual
data record on the disk block.
Static Hashing has the following Properties
Data Buckets: The number of buckets in memory remains constant. The size of the hash
table is decided initially and it may also implement chaining that will allow handling some
collision issues though, it’s only a slight optimization and may not prove worthy if the
database size keeps fluctuating.
Hash function: It uses the simplest hash function to map the data records to its appropriate
bucket. It is generally modulo-hash function
Efficient for known data size: It’s very efficient in terms when we know the data size and its
distribution in the database.
It is inefficient and inaccurate when the data size dynamically varies because we have limited
space and the hash function always generates the same value for every specific input. When
the data size fluctuates very often it’s not at all useful because collision will keep happening
and it will result in problems like – bucket skew, insufficient buckets etc.
To resolve this problem of bucket overflow, techniques such as – chaining and open
addressing are used. Here’s a brief info on both:
1. Chaining
Chaining is a mechanism in which the hash table is implemented using an array of type
nodes, where each bucket is of node type and can contain a long chain of linked lists to store
the data records. So, even if a hash function generates the same value for any data record it
can still be stored in a bucket by adding a new node.
However, this will give rise to the problem bucket skew that is, if the hash function keeps
generating the same value again and again then the hashing will become inefficient as the
remaining data buckets will stay unoccupied or store minimal data.
2. Open Addressing/Closed Hashing
This is also called closed hashing this aims to solve the problem of collision by looking out for
the next empty slot available which can store data. It uses techniques like linear
probing, quadratic probing, double hashing, etc.
2. Dynamic Hashing
Dynamic hashing is also known as extendible hashing, used to handle database that
frequently changes data sets. This method offers us a way to add and remove data buckets
on demand dynamically. This way as the number of data records varies, the buckets will also
grow and shrink in size periodically whenever a change is made.
Properties of Dynamic Hashing
The buckets will vary in size dynamically periodically as changes are made offering more
flexibility in making any change.
Dynamic Hashing aids in improving overall performance by minimizing or completely
preventing collisions.
It has the following major components: Data bucket, Flexible hash function, and directories
A flexible hash function means that it will generate more dynamic values and will keep
changing periodically asserting to the requirements of the database.
Directories are containers that store the pointer to buckets. If bucket overflow or bucket
skew-like problems happen to occur, then bucket splitting is done to maintain efficient
retrieval time of data records. Each directory will have a directory id.
Global Depth: It is defined as the number of bits in each directory id. The more the number
of records, the more bits are there.
Working of Dynamic Hashing
Example: If global depth: k = 2, the keys will be mapped accordingly to the hash index. K bits
starting from LSB will be taken to map a key to the buckets. That leaves us with the following
4 possibilities: 00, 11, 10, 01.

Dynamic Hashing – mapping


As we can see in the above image, the k bits from LSBs are taken in the hash index to map to
their appropriate buckets through directory IDs. The hash indices point to the directories,
and the k bits are taken from the directories’ IDs and then mapped to the buckets. Each
bucket holds the value corresponding to the IDs converted in binary.

B+ File Organization

The B+ tree file organization is a very advanced method of an indexed sequential access
mechanism. In File, records are stored in a tree-like structure. It employs a similar key-index
idea, in which the primary key is utilised to sort the records. The index value is generated for
each primary key and mapped to the record.

Unlike a binary search tree (BST), the B+ tree can contain more than two children. All
records are stored solely at the leaf node in this method. The leaf nodes are pointed to by
intermediate nodes. There are no records in them.

The above B+ tree shows this:

• The tree has only one root node, which is number 25.
• There is a node-based intermediary layer. They don’t keep the original record. The
only thing they have are pointers to the leaf node.
• The prior value of the root is stored in the nodes to the left of the root node, while
the next value is stored in the nodes to the right, i.e. 15 and 30.
• Only one leaf node contains only values, namely 10, 12, 17, 20, 24, 27, and 29.
• Because all of the leaf nodes are balanced, finding any record is much easier.
• This method allows you to search any record by following a single path and accessing
it quickly.

Indexed Sequential Access Method or ISAM


ISAM (Indexed sequential access method) is an advanced sequential file organization
method. In this case, records are stored in the file with the help of the primary key. For each
primary key, an index value is created and mapped to the record. This index contains the
address of the record in the file.

If a record has to be obtained based on its index value, the data block’s address is retrieved,
and the record is retrieved from memory.

Cluster File Organization

Clusters are created when two or more records are saved in the same file. There will be two
or more than two tables in the very same data block in these files, and key attributes that
are used to link these tables together will only be kept once.

You might also like