You are on page 1of 69

File Organization and Storage

Structures
Unit-5
• File Organization
• File Organization defines how file records are mapped onto disk blocks.
We have four types of File Organization to organize file records −
• File Operations
• Operations on database files can be broadly classified into two categories −
• Update Operations
• Retrieval Operations
• Update operations change the data values by insertion, deletion, or update.
Retrieval operations, on the other hand, do not alter the data but retrieve
them after optional conditional filtering. In both types of operations,
selection plays a significant role. Other than creation and deletion of a file,
there could be several operations, which can be done on files.
• Open − A file can be opened in one of the two modes, read mode or write
mode. In read mode, the operating system does not allow anyone to alter
data. In other words, data is read only. Files opened in read mode can be
shared among several entities. Write mode allows data modification. Files
opened in write mode can be read but cannot be shared.
• Locate − Every file has a file pointer, which tells the current position
where the data is to be read or written. This pointer can be adjusted
accordingly. Using find (seek) operation, it can be moved forward or
backward.
• Read − By default, when files are opened in read mode, the file pointer
points to the beginning of the file. There are options where the user can tell
the operating system where to locate the file pointer at the time of opening
a file. The very next data to the file pointer is read.

• Write − User can select to open a file in write mode, which enables them to
edit its contents. It can be deletion, insertion, or modification. The file
pointer can be located at the time of opening or can be dynamically
changed if the operating system allows to do so.

• Close − This is the most important operation from the operating system’s
point of view. When a request to close a file is generated, the operating
system
– removes all the locks (if in shared mode),
– saves the data (if altered) to the secondary storage media, and
– releases all the buffers and file handlers associated with the file.
• The organization of data inside a file plays a major role here. The process
to locate the file pointer to a desired record inside a file various based on
whether the records are arranged sequentially or clustered.
Sequential File Organization
• This method is the easiest method for file organization. In this method, files
are stored sequentially. This method can be implemented in two ways:
• 1. Pile File Method:
• It is a quite simple method. In this method, we store the record in a
sequence, i.e., one after another. Here, the record will be inserted in the
order in which they are inserted into tables.
• In case of updating or deleting of any record, the record will be searched in
the memory blocks. When it is found, then it will be marked for deleting,
and the new record is inserted.
• Insertion of the new record:
• Suppose we have four records R1, R3 and so on upto R9 and R8 in a
sequence. Hence, records are nothing but a row in the table. Suppose we
want to insert a new record R2 in the sequence, then it will be placed at the
end of the file. Here, records are nothing but a row in any table.
• 2. Sorted File Method:
• In this method, the new record is always inserted at the file's end, and then
it will sort the sequence in ascending or descending order. Sorting of
records is based on any primary key or any other key.
• In the case of modification of any record, it will update the record and then
sort the file, and lastly, the updated record is placed in the right place.
• Insertion of the new record:
• Suppose there is a preexisting sorted sequence of four records R1, R3 and
so on upto R6 and R7. Suppose a new record R2 has to be inserted in the
sequence, then it will be inserted at the end of the file, and then it will sort
the sequence.
• Pros of sequential file organization
• It contains a fast and efficient method for the huge amount of data.
• In this method, files can be easily stored in cheaper storage
mechanism like magnetic tapes.
• It is simple in design. It requires no much effort to store the data.
• This method is used when most of the records have to be accessed
like grade calculation of a student, generating the salary slip, etc.
• This method is used for report generation or statistical calculations.

• Cons of sequential file organization


• It will waste time as we cannot jump on a particular record that is
required but we have to move sequentially which takes our time.
• Sorted file method takes more time and space for sorting the
records.
• Heap file organization
• It is the simplest and most basic type of organization. It works with data
blocks. In heap file organization, the records are inserted at the file's end.
When the records are inserted, it doesn't require the sorting and ordering of
records.
• When the data block is full, the new record is stored in some other block.
This new data block need not to be the very next data block, but it can
select any data block in the memory to store new records. The heap file is
also known as an unordered file.
• In the file, every record has a unique id, and every page in a file is of the
same size. It is the DBMS responsibility to store and manage the new
records.
• Insertion of a new record
• Suppose we have five records R1, R3, R6, R4 and R5 in a heap and
suppose we want to insert a new record R2 in a heap. If the data block 3 is
full then it will be inserted in any of the database selected by the DBMS,
let's say data block 1.
• If we want to search, update or delete the data in heap file organization,
then we need to traverse the data from staring of the file till we get the
requested record.
• If the database is very large then searching, updating or deleting of record
will be time-consuming because there is no sorting or ordering of records.
In the heap file organization, we need to check all the data until we get the
requested record.
• Pros of Heap file organization
• It is a very good method of file organization for bulk insertion. If there is a
large number of data which needs to load into the database at a time, then
this method is best suited.
• In case of a small database, fetching and retrieving of records is faster than
the sequential record.
• Cons of Heap file organization
• This method is inefficient for the large database because it takes time to
search or modify the record.
• This method is inefficient for large databases.
Hash File Organization
• Hash File Organization uses the computation of hash function on some
fields of the records. The hash function's output determines the location of
disk block where the records are to be placed.
• When a record has to be received using the hash key columns, then
the address is generated, and the whole record is retrieved using that
address. In the same way, when a new record has to be inserted, then
the address is generated using the hash key and record is directly
inserted. The same process is applied in the case of delete and
update.
• In this method, there is no effort for searching and sorting the entire
file. In this method, each record will be stored randomly in the
memory.
• A hashed file uses a mathematical function to accomplish this mapping. The user
gives the key, the function maps the key to the address and passes it to the operating
system, and the record is retrieved
• Hashing methods
• For key-address mapping, we can select one of several hashing methods. We
discuss a few of them here.
• Direct hashing
• In direct hashing, the key is the data file address without any algorithmic
manipulation. The file must therefore contain a record for every possible key..
Indexed sequential access method (ISAM)
• ISAM method is an advanced sequential file organization. In this method,
records are stored in the file using the primary key. An index value is
generated for each primary key and mapped with the record. This index
contains the address of the record in the file.
• If any record has to be retrieved based on its index value, then the address
of the data block is fetched and the record is retrieved from the memory.
• Pros of ISAM:
• In this method, each record has the address of its data block,
searching a record in a huge database is quick and easy.
• This method supports range retrieval and partial retrieval of records.
Since the index is based on the primary key values, we can retrieve
the data for the given range of value. In the same way, the partial
value can also be easily searched, i.e., the student name starting with
'JA' can be easily searched.
• Cons of ISAM
• This method requires extra space in the disk to store the index value.
• When the new records are inserted, then these files have to be
reconstructed to maintain the sequence.
• When the record is deleted, then the space used by it needs to be
released. Otherwise, the performance of the database will slow
down.

Cluster file organization
• When the two or more records are stored in the same file, it is known as
clusters. These files will have two or more tables in the same data block,
and key attributes which are used to map these tables together are stored
only once.
• This method reduces the cost of searching for various records in different
files.
• The cluster file organization is used when there is a frequent need for
joining the tables with the same condition. These joins will give only a few
records from both tables. In the given example, we are retrieving the record
for only particular departments. This method can't be used to retrieve the
record for the entire department.
• In this method, we can directly insert, update or delete any record. Data is
sorted based on the key with which searching is done. Cluster key is a type
of key with which joining of the table is performed.
• Pros of Cluster file organization
• The cluster file organization is used when there is a frequent request for
joining the tables with same joining condition.
• It provides the efficient result when there is a 1:M mapping between the
tables.
• Cons of Cluster file organization
• This method has the low performance for the very large database.
• If there is any change in joining condition, then this method cannot use. If
we change the condition of joining then traversing the file takes a lot of
time.
• This method is not suitable for a table with a 1:1 condition.
Inverted Files
• Definition: an inverted file is a word-oriented mechanism for indexing a
text collection in order to speed up the searching task.

• The inverted file may be the database file itself, rather than its index. It is
the most popular data structure used in document retrieval systems, used
on a large scale for example in search engines.

• An inverted index catalogs a collection of objects in their textual


representations. Given a set of documents, keywords and other attributes
(possibly including relevance ranking) are assigned to each document. The
inverted index is the list of keywords and links to the corresponding
document.
• Structure of inverted file:
– Vocabulary: is the set of all distinct words in the text
– Occurrences: lists containing all information necessary for each word
of the vocabulary (text position, frequency, documents where the word
appears, etc.)

• Stopwords
• The stopword list is a group of keywords which should not be indexed.
Stopwords are terms that are typically irrelevant in searches, like "a",
"and", and "the". Removing these terms while indexing can significantly
reduce index size without adversely impacting user query results.
Example
• Text:
1 6 12 16 18 25 29 36 40 45 54 58 66 70

That house has a garden. The garden has many flowers. The flowers are
beautiful
• Inverted file
Vocabulary Occurrences
beautiful 70
flowers 45, 58
garden 18, 29
house 6
Example
• Text:
Block 1 Block 2 Block 3 Block 4

That house has a garden. The garden has many flowers. The flowers are
beautiful

• Inverted file
Vocabulary Occurrences
beautiful 4
flowers 3
garden 2
house 1
B-Tree
B Tree is a specialized m-way tree that can be widely used for disk
access.
A B-Tree of order m can have at most m-1 keys and m children. One
of the main reason of using B tree is its capability to store large
number of keys in a single node and large key values by keeping the
height of the tree relatively small.
• A B tree of order m contains all the properties of an M way tree. In
addition, it contains the following properties.
• Every node in a B-Tree contains at most m children.
• Every node in a B-Tree except the root node and the leaf node
contain at least m/2 children.
• The root nodes must have at least 2 nodes.
• All leaf nodes must be at the same level.
• It is not necessary that, all the nodes contain the same number of
children but, each node must have m/2 number of nodes.
While performing some operations on B Tree, any property of B Tree may violate such as
number of minimum children a node can have. To maintain the properties of B Tree, the tree may
split or join.
Operations
Searching :
Searching in B Trees is similar to that in Binary search tree. For example, if we search for an item
49 in the following B Tree.
The process will something like following :
1. Compare item 49 with root node 78. since 49 < 78 hence, move to its left sub-tree.
2. Since, 40<49<56, traverse right sub-tree of 40.
3. 49>45, move to right. Compare 49.
4. match found, return.
Searching in a B tree depends upon the height of the tree. The search algorithm takes O(log n)
time to search any element in a B tree.
B Tree
B+ Tree
• B+ Tree is an extension of B Tree which allows efficient insertion, deletion and
search operations.
• In B Tree, Keys and records both can be stored in the internal as well as leaf
nodes. Whereas, in B+ tree, records (data) can only be stored on the leaf nodes
while internal nodes can only store the key values.
• The leaf nodes of a B+ tree are linked together in the form of a singly linked lists
to make the search queries more efficient.
• B+ Tree are used to store the large amount of data which can not be stored in the
main memory. Due to the fact that, size of main memory is always limited, the
internal nodes (keys to access records) of the B+ tree are stored in the main
memory whereas, leaf nodes are stored in the secondary memory.
• The internal nodes of B+ tree are often called index nodes. A B+ tree of order 3 is
shown in the following figure.
• Advantages of B+ Tree
• Records can be fetched in equal number of disk accesses.
• Height of the tree remains balanced and less as compare to B tree.
• We can access the data stored in a B+ tree sequentially as well as directly.
• Keys are used for indexing.
• Faster search queries as the data is stored only on the leaf nodes.
• Insertion in B+ Tree
• Step 1: Insert the new node as a leaf node
• Step 2: If the leaf doesn't have required space, split the node and copy the
middle node to the next index node.
• Step 3: If the index node doesn't have required space, split the node and copy
the middle element to the next index page.
• Example :
• Insert the value 195 into the B+ tree of order 5 shown in the following figure.

RAID
• RAID, or “Redundant Arrays of Independent Disks” is a technique
which makes use of a combination of multiple disks instead of using
a single disk for increased performance, data redundancy or both.
The term was coined by David Patterson, Garth A. Gibson, and
Randy Katz at the University of California, Berkeley in 1987.
• There are 7 levels of RAID schemes. These schemas are as RAID 0,
RAID 1, ...., RAID 6

• Why data redundancy?


• Data redundancy, although taking up extra space, adds to disk
reliability. This means, in case of disk failure, if the same data is
also backed up onto another disk, we can retrieve the data and go on
with the operation. On the other hand, if the data is spread across
just multiple disks without the RAID technique, the loss of a single
disk can affect the entire data.
• Key evaluation points for a RAID System
• Reliability: How many disk faults can the system tolerate?

• Availability: What fraction of the total session time is a


system in uptime mode, i.e. how available is the system for
actual use?

• Performance: How good is the response time? How high is


the throughput (rate of processing work)? Note that
performance contains a lot of parameters and not just the two.

• Capacity: Given a set of N disks each with B blocks, how


much useful capacity is available to the user?
RAID
• RAID refers to redundancy array of the independent disk. It is a technology
which is used to connect multiple secondary storage devices for increased
performance, data redundancy or both. It gives you the ability to survive one or
more drive failure depending upon the RAID level used.
• It consists of an array of disks in which multiple disks are connected to achieve
different goals.
• RAID technology
• There are 7 levels of RAID schemes. These schemas are as RAID 0, RAID 1, ....,
RAID 6.
• These levels contain the following characteristics:
• It contains a set of physical disk drives.
• In this technology, the operating system views these separate disks as a single
logical disk.
• In this technology, data is distributed across the physical drives of the array.
• Redundancy disk capacity is used to store parity information.
• In case of disk failure, the parity information can be helped to recover the data.
Standard RAID levels
• RAID 0
• RAID level 0 provides data stripping, i.e., a data can place across multiple
disks. It is based on stripping that means if one disk fails then all data in the
array is lost.
• This level doesn't provide fault tolerance but increases the system
performance.
• Pros of RAID 0:
• In this level, throughput is increased because multiple
data requests probably not on the same disk.
• This level full utilizes the disk space and provides high
performance.
• It requires minimum 2 drives.
• Cons of RAID 0:
• It doesn't contain any error detection mechanism.
• The RAID 0 is not a true RAID because it is not fault-
tolerance.
• In this level, failure of either disk results in complete data
loss in respective array.
• RAID 1
• This level is called mirroring of data as it copies the data from drive 1 to drive 2. It
provides 100% redundancy in case of a failure.
• Only half space of the drive is used to store the data. The other half of drive is just
a mirror to the already stored data.
• Pros of RAID 1:
• The main advantage of RAID 1 is fault tolerance. In this level, if one disk
fails, then the other automatically takes over.
• In this level, the array will function even if any one of the drives fails.
• Cons of RAID 1:
• In this level, one extra drive is required per drive for mirroring, so the
expense is higher.
• RAID 2
• RAID 2 consists of bit-level striping using hamming code parity. In this
level, each data bit in a word is recorded on a separate disk and ECC code
of data words is stored on different set disks.
• Due to its high cost and complex structure, this level is not commercially
used. This same performance can be achieved by RAID 3 at a lower cost.

• Pros of RAID 2:
• This level uses one designated drive to store parity.
• It uses the hamming code for error detection.
• Cons of RAID 2:
• It requires an additional drive for error detection.
• RAID 3
• This uses byte level striping. i.e Instead of striping the blocks across the
disks, it stripes the bytes across the disks.
• In the above diagram B1, B2, B3 are bytes. p1, p2, p3 are parities.
• Uses multiple data disks, and a dedicated disk to store parity.
• The disks have to spin in sync to get to the data.
• Sequential read and write will have good performance.
• Random read and write will have worst performance.
• This is not commonly used.
• Pros of RAID 3:
• In this level, data is regenerated using parity drive.
• It contains high data transfer rates.
• In this level, data is accessed in parallel.

• Cons of RAID 3:
• It required an additional drive for parity.
• It gives a slow performance for operating on small sized files.
• RAID 4
• RAID 4 consists of block-level stripping with a parity disk. Instead of
duplicating data, the RAID 4 adopts a parity-based approach.
• This level allows recovery of at most 1 disk failure due to the way parity
works. In this level, if more than one disk fails, then there is no way to
recover the data.
• Level 3 and level 4 both are required at least three disks to implement
RAID.
• In this figure, we can observe one disk dedicated to parity.
• In this level, parity can be calculated using an XOR function. If the data
bits are 0,0,0,1 then the parity bits is XOR(0,1,0,0) = 1. If the parity bits are
0,0,1,1 then the parity bit is XOR(0,0,1,1)= 0. That means, even number of
one results in parity 0 and an odd number of one results in parity 1.
• Suppose that in the above figure, C2 is lost due to some disk failure. Then
using the values of all the other columns and the parity bit, we can
recompute the data bit stored in C2. This level allows us to recover lost
data.
• RAID 5
• RAID 5 is a slight modification of the RAID 4 system. The only difference
is that in RAID 5, the parity rotates among the drives.
• It consists of block-level striping with DISTRIBUTED parity.
• Same as RAID 4, this level allows recovery of at most 1 disk failure. If
more than one disk fails, then there is no way for data recovery.
• This figure shows that how parity bit rotates.
• This level was introduced to make the random write performance better.
• Pros of RAID 5:
• This level is cost effective and provides high performance.
• In this level, parity is distributed across the disks in an array.
• It is used to make the random write performance better.
• Cons of RAID 5:
• In this level, disk failure recovery takes longer time as parity
has to be calculated from all available drives.
• This level cannot survive in concurrent drive failure.
• RAID 6
• This level is an extension of RAID 5. It contains block-level stripping with
2 parity bits.
• In RAID 6, you can survive 2 concurrent disk failures. Suppose you are
using RAID 5, and RAID 1. When your disks fail, you need to replace the
failed disk because if simultaneously another disk fails then you won't be
able to recover any of the data, so in this case RAID 6 plays its part where
you can survive two concurrent disk failures before you run out of options.
• Pros of RAID 6:
• This level performs RAID 0 to strip data and RAID 1 to mirror. In this
level, stripping is performed before mirroring.
• In this level, drives required should be multiple of 2.
• Cons of RAID 6:
• It is not utilized 100% disk capability as half is used for mirroring.
• It contains very limited scalability.

object-oriented database
• An object-oriented database is a collection of object-oriented programming
and relational database. There are various items which are created using
object-oriented programming languages like C++, Java which can be stored
in relational databases, but object-oriented databases are well-suited for
those items.
• An object-oriented database is organized around objects rather than actions,
and data rather than logic. For example, a multimedia record in a relational
database can be a definable data object, as opposed to an alphanumeric
value.
Data Warehousing
• Data warehousing is the process of constructing and using a data
warehouse. A data warehouse is constructed by integrating data from
multiple heterogeneous sources that support analytical reporting, structured
and/or ad hoc queries, and decision making. Data warehousing involves
data cleaning, data integration, and data consolidations.
• Using Data Warehouse Information
• There are decision support technologies that help utilize the data available
in a data warehouse. These technologies help executives to use the
warehouse quickly and effectively. They can gather data, analyze it, and
take decisions based on the information present in the warehouse. The
information gathered in a warehouse can be used in any of the following
domains −
• Tuning Production Strategies − The product strategies can be well tuned
by repositioning the products and managing the product portfolios by
comparing the sales quarterly or yearly.
• Customer Analysis − Customer analysis is done by analyzing the
customer's buying preferences, buying time, budget cycles, etc.
• Operations Analysis − Data warehousing also helps in customer
relationship management, and making environmental corrections. The
information also allows us to analyze business operations.
• The Important features of Data Warehouse are given below:
• 1. Subject Oriented
• A data warehouse is subject-oriented. It provides useful data about a
subject instead of the company's ongoing operations, and these subjects can
be customers, suppliers, marketing, product, promotion, etc.
• A data warehouse usually focuses on modeling and analysis of data that
helps the business organization to make data-driven decisions.

• 2. Time-Variant:
• The different data present in the data warehouse provides information for a
specific period.

• 3. Integrated
• A data warehouse is built by joining data from heterogeneous sources, such
as social databases, level documents, etc.

• 4. Non- Volatile
• It means, once data entered into the warehouse cannot be change.
• Functions of Data Warehouse Tools and Utilities
• The following are the functions of data warehouse tools and utilities −
• Data Extraction − Involves gathering data from multiple heterogeneous
sources.
• Data Cleaning − Involves finding and correcting the errors in data.
• Data Transformation − Involves converting the data from legacy format
to warehouse format.
• Data Loading − Involves sorting, summarizing, consolidating, checking
integrity, and building indices and partitions.
• Refreshing − Involves updating from data sources to warehouse.
Data mining

• It is a process of discovering patterns in large data sets involving methods


at the intersection of machine learning, statistics, and database systems.

• Data mining is an interdisciplinary subfield of computer


science and statistics with an overall goal to extract information (with
intelligent methods) from a data set and transform the information into a
comprehensible structure for further use.

• Data mining is the analysis step of the "knowledge discovery in databases"


process, or KDD. Aside from the raw analysis step, it also involves
database and data management aspects, data pre-
processing, model and inference considerations, interestingness
metrics, complexity considerations, post-processing of discovered
structures, visualization, and online updating.
Multimedia database
• A Multimedia database (MMDB) is a collection of related
for multimedia data.
• The multimedia data include one or more primary media data types such
is text, images, graphic objects (including drawings, sketches and illustratio
ns) animation sequences, audio and video.

• A Multimedia Database Management System (MMDBMS) is


a framework that manages different types of data potentially represented in
a wide diversity of formats on a wide array of media sources. It provides
support for multimedia data types, and facilitate for creation, storage,
access, query and control of a multimedia database
• Multimedia Database (MMDB) needs to manage additional information
pertaining to the actual multimedia data. The information is about the
following:
• Media data: the actual data representing an object.
• Media format data: information about the format of the media data after it
goes through the acquisition, processing, and encoding phases.
• Media keyword data: the keyword descriptions, usually relating to the
generation of the media data.
• Media feature data: content dependent data such as contain information
about the distribution of colours, the kinds of textures and the different
shapes present in an image.
• The last three types are called metadata as they describe several different
aspects of the media data. The media keyword data and media feature data
are used as indices for searching purpose. The media format data is used to
present the retrieved information.
Distributed Database
• A distributed database is a database that consists of two or more files located in
different sites either on the same network or on entirely different networks.
Portions of the database are stored in multiple physical locations and
processing is distributed among multiple database nodes.
• A centralized distributed database management system (DDBMS) integrates
data logically so it can be managed as if it were all stored in the same location.
The DDBMS synchronizes all the data periodically and ensures that data
updates and deletes performed at one location will be automatically reflected in
the data stored elsewhere.
• By contrast, a centralized database consists of a single database file located at
one site using a single network.
• Features of distributed databases
• When in a collection, distributed databases are logically interrelated with each
other, and they often represent a single logical database. With distributed
databases, data is physically stored across multiple sites and independently
managed. The processors on each site are connected by a network, and they
don't have any multiprocessing configuration.

You might also like