You are on page 1of 18

File Organization:

 The File is a collection of records. Using the primary key, we can access the records. The
type and frequency of access can be determined by the type of file organization which
was used for a given set of records.
 File organization is a logical relationship among various records. This method defines
how file records are mapped onto disk blocks.
 File organization is used to describe the way in which the records are stored in terms of
blocks, and the blocks are placed on the storage medium.
 The first approach to map the database to the file is to use the several files and store
only one fixed length record in any given file. An alternative approach is to structure our
files so that we can contain multiple lengths for records. Files of fixed length records are
easier to implement than the files of variable length records.

Objective of file organization:


 It contains an optimal selection of records, i.e., records can be selected as fast as
possible.
 To perform insert, delete or update transaction on the records should be quick and
easy.
 The duplicate records cannot be induced as a result of insert, update or delete.
 For the minimal cost of storage, records should be stored efficiently.

The following are the types of file organization:


(i). Heap File Organization:

 When a file is created using Heap File Organization mechanism, the Operating Systems
allocates memory area to that file without any further accounting details.
 It is the responsibility of software to manage the records.
 Heap File does not support any ordering, sequencing or indexing on its own.
 In a heap file organization, records are stored in no particular order. When a new
record is inserted, it is simply appended to the end of the file. This method is simple but
can lead to inefficient retrieval since the entire file must be scanned to locate specific
records
(ii). Sequential File Organization:

 Every file record contains an attribute to uniquely identify that record.


 In sequential file organization, records are stored in a specific order based on a primary
key or some other ordering criterion.
 Practically, it is not possible to store all the records sequentially in physical form.
(iii). Hash File Organization:
 This mechanism uses a Hash function computation on some field of the records.
 Hash file organization is a method of organizing data files using a hash function. It
involves assigning a unique hash value to each record or data item and using that hash
value to determine the storage location or address of the data in a file.
 The output of hash determines the location of disk block where the records may exist.
iv). Clustered File Organization:

 Clustered file organization is not considered good for large databases.


 In clustered file organization, similar records are physically grouped together based on a
common attribute.
 This organization helps to retrieve data easily based on particular join condition.

Organization of records in files:


We can organize record in two ways unordered and ordered file organization:

Unordered Files: A unordered file, sometimes called a heap file, is the simplest type of file
organization. Records are placed in the file in the same order as they are inserted. A new record
is inserted in the last page of the file; if there is insufficient space in the last page, a new page is
added to the file. This makes insertion very efficient. However, as a heap file has no particular
ordering with respect to field values, a linear search must be performed to access a record. A
linear search involves reading pages from the file until the required record is found. This makes
retrievals from heap files that have more than a few pages relatively slow, unless the retrieval
involves a large proportion of the records in the file.
To delete a record, the required page first has to be retrieved, the record marked as deleted,
and the page written back to disk. The space with deleted records is not reused. Consequently,
performance progressively deteriorates as deletions occur. This means that heap files have to
be periodically reorganized by the Database Administrator (DBA) to reclaim the unused space
of deleted records.

Ordered Files: The records in a file can be sorted on the values of one or more of the fields,
forming a key-sequenced data set. The resulting file is called an ordered or sequential file. The
field(s) that the file is sorted on is called the ordering field. If the ordering field is also a key of
the file, and therefore guaranteed to have a unique value in each record, the field is also called
the ordering key for the file.
The various types of file organizations are:
a. Heap File Organization.
b. Sequential File Organization.
c. Hash File Organization.
d. Clustered File Organization.
{EXPLAIN THEM AS ABOVE}

Data Dictionary Storage:


A data dictionary contains metadata i.e. data about the database. The data dictionary is very
important as it contains information such as what is in the database, who is allowed to access it,
where is the database physically stored etc. The users of the database normally don't interact
with the data dictionary; it is only handled by the database administrators.
The data dictionary in general contains information about the following –
 Names of all the database tables and their schemas.
 Details about all the tables in the database, such as their owners, their security
constraints, when they were created etc.
 Physical information about the tables such as where they are stored and how.
 Table constraints such as primary key attributes, foreign key information etc.
 Information about the database views that are visible.
This is a data dictionary describing a table that contains employee details.

Field Name Data Type Field Size for Description Example


display

EmployeeNumber Integer 10 Unique ID of each 1645000001


employee

Name Text 20 Name of the employee David

Date of Birth Date/Time 10 DOB of Employee 08/03/1995

Phone Number Integer 10 Mo.Numberofemployee 6583648648

The different types of data dictionary are –


Active Data Dictionary: If the structure of the database or its specifications change at any point
of time, it should be reflected in the data dictionary. This is the responsibility of the database
management system in which the data dictionary resides. So, the data dictionary is
automatically updated by the database management system when any changes are made in the
database. This is known as an active data dictionary as it is self-updating.
Passive Data Dictionary: This is not as useful or easy to handle as an active data dictionary. A
passive data dictionary is maintained separately to the database whose contents are stored in
the dictionary. That means that if the database is modified the database dictionary is not
automatically updated as in the case of Active Data Dictionary. So, the passive data dictionary
has to be manually updated to match the database. This needs careful handling or else the
database and data dictionary are out of sync.

Basic concept of Indexing:


Indexing is one of the techniques used to optimize performance of a database by reducing the
number of disk accesses that are required when a query is processed.
A database index is a data structure that is helpful to quickly locate and access the data in a
database table.
Indexes are created using database columns.
 The first column is the Search key which contains a copy of the primary key or candidate
key of the table.
 The second column is the data reference that contains a set of pointers which hold the
address of the disk block where the key value can be found.

Structure of Index

The structure of an index in the database management system (DBMS) is given below −

Search key Data reference

Types of indexes

The different types of index are as follows −

 Primary: 1. Dense, 2. sparse


 Clustering
 Secondary
Cluster Index:
Summary {In case of a non-unique key, such as department_name which can be same for
many students.}
 Index Entry will be created only for distinct values in a database.
 A Clustered index is one of the special types of index which reorders the way records in
the table are physically stored on the disk. It sorts and stores the data rows in the table
or view based on their key values. It is essentially a sorted copy of the data in the
indexed columns.
 Uses a combination of two or more columns to create an index. A group of records
consists of records with the same characteristics. And, these groups create the indexes.
 This is both a dense and sparse type example.
Secondary Index

Summary {two level indexing}

 Index (Unique value) is created for each record in a data file which is a candidate key.
 Secondary index is a type of dense index and also called a non-clustering index.
 Secondary mapping size will be small as the two level DB indexing is used.
 Contains another level of indexing to minimize the size of mapping.

Primary Index:

Summary {When the index is based on the primary key of the table it is called primary key.

Draw both dense and sparse.}

 When the index is based on the primary key of the table, it is called a primary index.
There are two types of indexes in primary key called dense and spare index. The dense
index contains an index record for every search key value in the data file. In the spare
index, there are index records for some data items.
 No. of entries in the index file will be equal to no of blocks you take in main file.
Ex in diagram: 10 ,20 ,30 these 3 are number of entries and blocks are also 3 in pair of
10,20:30,40and70,80.

Types of primary Index:

Dense Index:

 In a dense index, a record is created for every search key valued in the database.
 Dense indexing helps you to search faster but needs more space to store index records.
 In dense indexing, records contain search key value and points to the real record on the
disk.

Sparse Index:

 The sparse index is an index record that appears for only some of the values in the file.
 Sparse Index helps you to resolve the issues of dense indexing.
 In sparse indexing technique, a range of index columns stores the same data block
address, and when data needs to be retrieved, this block address will be fetched.
 Sparse indexing method stores index records for only some search key values.
 It needs less space, less maintenance overhead for insertion, and deletions but it is
slower compared to the dense index for locating records.

Ordered indices
The indices are usually sorted to make searching faster. The indices which are sorted are known
as ordered indices.

Example: Suppose we have an employee table with thousands of record and each of which is
10 bytes long. If their IDs start with 1, 2, 3....and so on and we have to search student with ID-
543.

o In the case of a database with no index, we have to search the disk block from starting
till it reaches 543. The DBMS will read the record after reading 543*10=5430 bytes.
o In the case of an index, we will search using indexes and the DBMS will read the record
after reading 542*2= 1084 bytes which are very less compared to the previous case.

B+ Tree Index Files:


Concept of B+ tree is used to store the records in the secondary memory. If the records are
stored using this concept, then those files are called as B+ tree index files. Since this tree is
balanced and sorted, all the nodes will be at same distance and only leaf node has the actual
value, makes searching for any record easy and quick in B+ tree index files. Even
insertion/deletion in B+ tree does not take much time. Hence B+ tree forms an efficient method
to store the records.

Searching, inserting and deleting a record is done in the same way as B+tree. Since it is a
balance tree, it searches for the position of the records in the file, and then it fetches/inserts
/deletes the records. In case it finds that tree will be unbalanced because of
insert/delete/update, it does the proper re-arrangement of nodes so that definition of B+ tree
is not changed.

A simple B tree can be represented as below:

B+ Tree
o The B+ tree is a balanced binary search tree. It follows a multi-level index format.
o In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf nodes remain
at the same height.
o In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can support
random access as well as sequential access.
Structure of B+ Tree
o In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is of the
order n where n is fixed for every B+ tree.
o It contains an internal node and leaf node.

Internal node

o An internal node of the B+ tree can contain at least n/2 record pointers except the root node.
o At most, an internal node of the tree contains n pointers.

Leaf node

o The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.
o At most, a leaf node contains n record pointer and n key values.
o Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.

Searching a record in B+ Tree


Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for the
intermediary node which will direct to the leaf node that can contain a record for 55.

So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the end,
we will be redirected to the third leaf node. Here DBMS will perform a sequential search to find
55.
B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node
after 55. It is a balanced tree, and a leaf node of this tree is already full, so we cannot insert 60
there.

In this case, we have to split the leaf node, so that it can be inserted into tree without affecting
the fill factor, balance and order.

The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will split
the leaf node of the tree in the middle so that its balance is not altered. So we can group (50,
55) and (60, 65, 70) into 2 leaf nodes.

If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have
60 added to it, and then we can have pointers to a new leaf node.
This is how we can insert an entry when there is overflow. In a normal scenario, it is very easy
to find the node where it fits and then place it in that leaf node.

B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have to remove 60
from the intermediate node as well as from the 4th leaf node too. If we remove it from the
intermediate node, then the tree will not satisfy the rule of the B+ tree. So we need to modify it
to have a balanced tree.

After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as follows:

B+ Tree Extensions: As the number of records grows in the database, the intermediary and
leaf nodes needs to be split and spread widely to keep the balance of the tree. This is called as
B+ tree extensions. As it spreads out widely, the searching of records becomes faster.

The main goal of creating B+ tree is faster traversal of records. As the branches spreads out, it
requires less I/O on disk to get the record. Record that needs to be fetched are fetched in
logarithmic fraction of time. Suppose we have K search key values – that is the pointers in the
intermediary node for n nodes. Then we can fetch any record in the b+ tree in log (n/2) (K).

Suppose each node takes 40bytes to store an index and each disk block is of 40Kbytes. That
means we can have 100 nodes (n). Say we have 1million search key values – that means we
have 1 million intermediary pointers. Then we can access log 50 (1000000) = 4 nodes are
accessed in one go. Hence this costs only 4milliseconds to fetch any node in the tree. Now we
can guess the advantage of extending the B+ tree into more intermediary nodes. As
intermediary nodes spread out more and more, it is more efficient in fetching the records in B+
tree. Look at below two diagrams to understand how it makes difference with B+ tree
extensions.

Index defination in sql:

SQL INDEX The Index in SQL is a special table used to speed up the searching of the data in the
database tables. It also retrieves a vast amount of data from the tables frequently. The INDEX
requires its own space in the hard disk.

The index concept in SQL is same as the index concept in the novel or a book. It is the best SQL
technique for improving the performance of queries. The drawback of using indexes is that they
slow down the execution time of UPDATE and INSERT statements.

But they have one advantage also as they speed up the execution time of SELECT and WHERE
statements.

Create an INDEX :In SQL, we can easily create the Index using the following

CREATE Statement:

CREATE INDEX Index_Name ON Table_Name (Column_Name);

Here, Index_Name is the name of that index that we want to create, and Table_Name is the
name of the table on which the index is to be created. The Column_Name represents the name
of the column on which index is to be applied.

Create UNIQUE INDEX: Unique Index is the same as the Primary key in SQL. The unique index
does not allow selecting those columns which contain duplicate values. This index is the best
way to maintain the data integrity of the SQL tables.

Syntax for creating the Unique Index is as follows:

CREATE UNIQUE INDEX Index_Name ON Table_Name (Column_Name);

Rename an INDEX: We can easily rename the index of the table in the relational database using
the ALTER command.

Syntax:

ALTER INDEX old_Index_Name RENAME TO new_Index_Name;


Remove an INDEX: An Index of the table can be easily removed from the SQL database using
the DROP command. If you want to delete an index from the data dictionary, you must be the
owner of the database or have the privileges for removing it.

Syntaxes for Removing an Index in relational databases are as follows:

DROP INDEX Index_Name;

Alter an INDEX: An index of the table can be easily modified in the relational database using the
ALTER command. The basic syntax for modifying the Index in SQL is as follows:

ALTER INDEX Index_Name ON Table_Name REBUILD;

Comparison of Ordered indexing and hashing:


Ordered indexing and hashing are two commonly used techniques in database systems for
efficient data retrieval.

Let's compare them based on several aspects:

 Search Complexity:

Ordered Indexing: Searching in ordered indexing is typically performed using binary search or
interpolation search, which has a logarithmic time complexity of O(log n), where n is the
number of indexed records.

Hashing: Hashing allows direct access to the desired record using a hash function. In ideal
cases, the search complexity is O(1), providing constant-time access. However, collisions can
occur, requiring additional steps to resolve them, which may increase the search complexity.

 Insertion and Deletion:

Ordered Indexing: Insertion and deletion operations in ordered indexing require maintaining
the sorted order of the index. Insertion may require shifting existing records, resulting in
additional time complexity, typically O(n). Deletion also requires reordering the index, making it
a costly operation.

Hashing: Insertion and deletion in hashing involve computing the hash value and placing the
record in the corresponding bucket. In general, the time complexity for these operations is
considered O(1). However, in the case of collisions, additional steps such as probing or chaining
may be required, affecting the overall complexity.

 Range Queries:

Ordered Indexing: Ordered indexing excels in range queries. Since the data is sorted, it is easy
to find records within a specified range by traversing the index sequentially. Range queries have
a complexity of O(k + log n), where k is the number of records in the range.

Hashing: Hashing is not optimized for range queries since the records are not stored in a
specific order. To perform range queries, all buckets need to be scanned, resulting in a time
complexity of O(m), where m is the total number of buckets.

 Space Efficiency:

Ordered Indexing: Ordered indexing typically requires additional storage space to store the
index structure. The size of the index is proportional to the number of indexed records,
resulting in higher space requirements.

Hashing: Hashing can be more space-efficient since it only requires space for the hash table
itself and the records. However, depending on the level of collisions, additional space might be
needed to handle chaining or probing.

 Handling Updates:

Ordered Indexing: Ordered indexing handles updates well, as it requires updating the index
structure and maintaining the sorted order. However, frequent updates can result in overhead
due to the need for reordering the index.

Hashing: Hashing can handle updates efficiently, especially when collisions are minimal.
Updates only require accessing the appropriate bucket and modifying the record. However,
excessive collisions can impact performance and require additional steps to resolve.

Static Hashing:
In static hashing, when a search-key value is provided, the hash function always computes the same
address. For example, if mod-4 hash function is used, then it shall generate only 5 values. The output
address shall always be same for that function. The number of buckets provided remains unchanged at
all times.
Operation:
 Insertion − When a record is required to be entered using static hash, the hash
function h computes the bucket address for search key K, where the record will be
stored.
Bucket address = h(K)
 Search − When a record needs to be retrieved, the same hash function can be used to
retrieve the address of the bucket where the data is stored.
 Delete − This is simply a search followed by a deletion operation.

1. Open Hashing:
When a hash function generates an address at which data is already stored, then the next
bucket will be allocated to it. This mechanism is called as Linear Probing.

For example: suppose R3 is a new address which needs to be inserted, the hash
function generates address as 112 for R3. But the generated address is already full. So
the system searches next available data bucket, 113 and assigns R3 to it.
2. Close Hashing
When buckets are full, then a new data bucket is allocated for the same hash result and
is linked after the previous one. This mechanism is known as Overflow chaining.

For example: Suppose R3 is a new address which needs to be inserted into the table,
the hash function generates address as 110 for it. But this bucket is full to store the new
data. In this case, a new bucket is inserted at the end of 110 buckets and is linked to it.

Dynamic Hashing
o The dynamic hashing method is used to overcome the problems of static hashing like
bucket overflow.
o In this method, data buckets grow or shrink as the records increases or decreases. This
method is also known as Extendable hashing method.
o This method makes hashing dynamic, i.e., it allows insertion or deletion without resulting
in poor performance.

How to search a key.

o First, calculate the hash address of the key.


o Check how many bits are used in the directory, and these bits are called as i.
o Take the least significant i bits of the hash address. This gives an index of the directory.
o Now using the index, go to the directory and find bucket address where the record
might be.

How to insert a new record:


o Firstly, you have to follow the same procedure for retrieval, ending up in some bucket.
o If there is still space in that bucket, then place the record in it.
o If the bucket is full, then we will split the bucket and redistribute the records.
o For example:
o Consider the following grouping of keys into buckets, depending on the prefix of
their hash address:

o The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits
of 5 and 6 are 01, so it will go into bucket B1. The last two bits of 1 and 3 are 10,
so it will go into bucket B2. The last two bits of 7 are 11, so it will go into B3.
o

You might also like