Professional Documents
Culture Documents
Structures
Unit-5
• File Organization
• File Organization defines how file records are mapped onto disk blocks.
We have four types of File Organization to organize file records −
• File Operations
• Operations on database files can be broadly classified into two categories −
• Update Operations
• Retrieval Operations
• Update operations change the data values by insertion, deletion, or update.
Retrieval operations, on the other hand, do not alter the data but retrieve
them after optional conditional filtering. In both types of operations,
selection plays a significant role. Other than creation and deletion of a file,
there could be several operations, which can be done on files.
• Open − A file can be opened in one of the two modes, read mode or write
mode. In read mode, the operating system does not allow anyone to alter
data. In other words, data is read only. Files opened in read mode can be
shared among several entities. Write mode allows data modification. Files
opened in write mode can be read but cannot be shared.
• Locate − Every file has a file pointer, which tells the current position
where the data is to be read or written. This pointer can be adjusted
accordingly. Using find (seek) operation, it can be moved forward or
backward.
• Read − By default, when files are opened in read mode, the file pointer
points to the beginning of the file. There are options where the user can tell
the operating system where to locate the file pointer at the time of opening
a file. The very next data to the file pointer is read.
• Write − User can select to open a file in write mode, which enables them to
edit its contents. It can be deletion, insertion, or modification. The file
pointer can be located at the time of opening or can be dynamically
changed if the operating system allows to do so.
• Close − This is the most important operation from the operating system’s
point of view. When a request to close a file is generated, the operating
system
– removes all the locks (if in shared mode),
– saves the data (if altered) to the secondary storage media, and
– releases all the buffers and file handlers associated with the file.
• The organization of data inside a file plays a major role here. The process
to locate the file pointer to a desired record inside a file various based on
whether the records are arranged sequentially or clustered.
Sequential File Organization
• This method is the easiest method for file organization. In this method, files
are stored sequentially. This method can be implemented in two ways:
• 1. Pile File Method:
• It is a quite simple method. In this method, we store the record in a
sequence, i.e., one after another. Here, the record will be inserted in the
order in which they are inserted into tables.
• In case of updating or deleting of any record, the record will be searched in
the memory blocks. When it is found, then it will be marked for deleting,
and the new record is inserted.
• Insertion of the new record:
• Suppose we have four records R1, R3 and so on upto R9 and R8 in a
sequence. Hence, records are nothing but a row in the table. Suppose we
want to insert a new record R2 in the sequence, then it will be placed at the
end of the file. Here, records are nothing but a row in any table.
• 2. Sorted File Method:
• In this method, the new record is always inserted at the file's end, and then
it will sort the sequence in ascending or descending order. Sorting of
records is based on any primary key or any other key.
• In the case of modification of any record, it will update the record and then
sort the file, and lastly, the updated record is placed in the right place.
• Insertion of the new record:
• Suppose there is a preexisting sorted sequence of four records R1, R3 and
so on upto R6 and R7. Suppose a new record R2 has to be inserted in the
sequence, then it will be inserted at the end of the file, and then it will sort
the sequence.
• Pros of sequential file organization
• It contains a fast and efficient method for the huge amount of data.
• In this method, files can be easily stored in cheaper storage
mechanism like magnetic tapes.
• It is simple in design. It requires no much effort to store the data.
• This method is used when most of the records have to be accessed
like grade calculation of a student, generating the salary slip, etc.
• This method is used for report generation or statistical calculations.
• The inverted file may be the database file itself, rather than its index. It is
the most popular data structure used in document retrieval systems, used
on a large scale for example in search engines.
• Stopwords
• The stopword list is a group of keywords which should not be indexed.
Stopwords are terms that are typically irrelevant in searches, like "a",
"and", and "the". Removing these terms while indexing can significantly
reduce index size without adversely impacting user query results.
Example
• Text:
1 6 12 16 18 25 29 36 40 45 54 58 66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
• Inverted file
Vocabulary Occurrences
beautiful 70
flowers 45, 58
garden 18, 29
house 6
Example
• Text:
Block 1 Block 2 Block 3 Block 4
That house has a garden. The garden has many flowers. The flowers are
beautiful
• Inverted file
Vocabulary Occurrences
beautiful 4
flowers 3
garden 2
house 1
B-Tree
B Tree is a specialized m-way tree that can be widely used for disk
access.
A B-Tree of order m can have at most m-1 keys and m children. One
of the main reason of using B tree is its capability to store large
number of keys in a single node and large key values by keeping the
height of the tree relatively small.
• A B tree of order m contains all the properties of an M way tree. In
addition, it contains the following properties.
• Every node in a B-Tree contains at most m children.
• Every node in a B-Tree except the root node and the leaf node
contain at least m/2 children.
• The root nodes must have at least 2 nodes.
• All leaf nodes must be at the same level.
• It is not necessary that, all the nodes contain the same number of
children but, each node must have m/2 number of nodes.
While performing some operations on B Tree, any property of B Tree may violate such as
number of minimum children a node can have. To maintain the properties of B Tree, the tree may
split or join.
Operations
Searching :
Searching in B Trees is similar to that in Binary search tree. For example, if we search for an item
49 in the following B Tree.
The process will something like following :
1. Compare item 49 with root node 78. since 49 < 78 hence, move to its left sub-tree.
2. Since, 40<49<56, traverse right sub-tree of 40.
3. 49>45, move to right. Compare 49.
4. match found, return.
Searching in a B tree depends upon the height of the tree. The search algorithm takes O(log n)
time to search any element in a B tree.
B Tree
B+ Tree
• B+ Tree is an extension of B Tree which allows efficient insertion, deletion and
search operations.
• In B Tree, Keys and records both can be stored in the internal as well as leaf
nodes. Whereas, in B+ tree, records (data) can only be stored on the leaf nodes
while internal nodes can only store the key values.
• The leaf nodes of a B+ tree are linked together in the form of a singly linked lists
to make the search queries more efficient.
• B+ Tree are used to store the large amount of data which can not be stored in the
main memory. Due to the fact that, size of main memory is always limited, the
internal nodes (keys to access records) of the B+ tree are stored in the main
memory whereas, leaf nodes are stored in the secondary memory.
• The internal nodes of B+ tree are often called index nodes. A B+ tree of order 3 is
shown in the following figure.
• Advantages of B+ Tree
• Records can be fetched in equal number of disk accesses.
• Height of the tree remains balanced and less as compare to B tree.
• We can access the data stored in a B+ tree sequentially as well as directly.
• Keys are used for indexing.
• Faster search queries as the data is stored only on the leaf nodes.
• Insertion in B+ Tree
• Step 1: Insert the new node as a leaf node
• Step 2: If the leaf doesn't have required space, split the node and copy the
middle node to the next index node.
• Step 3: If the index node doesn't have required space, split the node and copy
the middle element to the next index page.
• Example :
• Insert the value 195 into the B+ tree of order 5 shown in the following figure.
•
RAID
• RAID, or “Redundant Arrays of Independent Disks” is a technique
which makes use of a combination of multiple disks instead of using
a single disk for increased performance, data redundancy or both.
The term was coined by David Patterson, Garth A. Gibson, and
Randy Katz at the University of California, Berkeley in 1987.
• There are 7 levels of RAID schemes. These schemas are as RAID 0,
RAID 1, ...., RAID 6
• Pros of RAID 2:
• This level uses one designated drive to store parity.
• It uses the hamming code for error detection.
• Cons of RAID 2:
• It requires an additional drive for error detection.
• RAID 3
• This uses byte level striping. i.e Instead of striping the blocks across the
disks, it stripes the bytes across the disks.
• In the above diagram B1, B2, B3 are bytes. p1, p2, p3 are parities.
• Uses multiple data disks, and a dedicated disk to store parity.
• The disks have to spin in sync to get to the data.
• Sequential read and write will have good performance.
• Random read and write will have worst performance.
• This is not commonly used.
• Pros of RAID 3:
• In this level, data is regenerated using parity drive.
• It contains high data transfer rates.
• In this level, data is accessed in parallel.
• Cons of RAID 3:
• It required an additional drive for parity.
• It gives a slow performance for operating on small sized files.
• RAID 4
• RAID 4 consists of block-level stripping with a parity disk. Instead of
duplicating data, the RAID 4 adopts a parity-based approach.
• This level allows recovery of at most 1 disk failure due to the way parity
works. In this level, if more than one disk fails, then there is no way to
recover the data.
• Level 3 and level 4 both are required at least three disks to implement
RAID.
• In this figure, we can observe one disk dedicated to parity.
• In this level, parity can be calculated using an XOR function. If the data
bits are 0,0,0,1 then the parity bits is XOR(0,1,0,0) = 1. If the parity bits are
0,0,1,1 then the parity bit is XOR(0,0,1,1)= 0. That means, even number of
one results in parity 0 and an odd number of one results in parity 1.
• Suppose that in the above figure, C2 is lost due to some disk failure. Then
using the values of all the other columns and the parity bit, we can
recompute the data bit stored in C2. This level allows us to recover lost
data.
• RAID 5
• RAID 5 is a slight modification of the RAID 4 system. The only difference
is that in RAID 5, the parity rotates among the drives.
• It consists of block-level striping with DISTRIBUTED parity.
• Same as RAID 4, this level allows recovery of at most 1 disk failure. If
more than one disk fails, then there is no way for data recovery.
• This figure shows that how parity bit rotates.
• This level was introduced to make the random write performance better.
• Pros of RAID 5:
• This level is cost effective and provides high performance.
• In this level, parity is distributed across the disks in an array.
• It is used to make the random write performance better.
• Cons of RAID 5:
• In this level, disk failure recovery takes longer time as parity
has to be calculated from all available drives.
• This level cannot survive in concurrent drive failure.
• RAID 6
• This level is an extension of RAID 5. It contains block-level stripping with
2 parity bits.
• In RAID 6, you can survive 2 concurrent disk failures. Suppose you are
using RAID 5, and RAID 1. When your disks fail, you need to replace the
failed disk because if simultaneously another disk fails then you won't be
able to recover any of the data, so in this case RAID 6 plays its part where
you can survive two concurrent disk failures before you run out of options.
• Pros of RAID 6:
• This level performs RAID 0 to strip data and RAID 1 to mirror. In this
level, stripping is performed before mirroring.
• In this level, drives required should be multiple of 2.
• Cons of RAID 6:
• It is not utilized 100% disk capability as half is used for mirroring.
• It contains very limited scalability.
•
object-oriented database
• An object-oriented database is a collection of object-oriented programming
and relational database. There are various items which are created using
object-oriented programming languages like C++, Java which can be stored
in relational databases, but object-oriented databases are well-suited for
those items.
• An object-oriented database is organized around objects rather than actions,
and data rather than logic. For example, a multimedia record in a relational
database can be a definable data object, as opposed to an alphanumeric
value.
Data Warehousing
• Data warehousing is the process of constructing and using a data
warehouse. A data warehouse is constructed by integrating data from
multiple heterogeneous sources that support analytical reporting, structured
and/or ad hoc queries, and decision making. Data warehousing involves
data cleaning, data integration, and data consolidations.
• Using Data Warehouse Information
• There are decision support technologies that help utilize the data available
in a data warehouse. These technologies help executives to use the
warehouse quickly and effectively. They can gather data, analyze it, and
take decisions based on the information present in the warehouse. The
information gathered in a warehouse can be used in any of the following
domains −
• Tuning Production Strategies − The product strategies can be well tuned
by repositioning the products and managing the product portfolios by
comparing the sales quarterly or yearly.
• Customer Analysis − Customer analysis is done by analyzing the
customer's buying preferences, buying time, budget cycles, etc.
• Operations Analysis − Data warehousing also helps in customer
relationship management, and making environmental corrections. The
information also allows us to analyze business operations.
• The Important features of Data Warehouse are given below:
• 1. Subject Oriented
• A data warehouse is subject-oriented. It provides useful data about a
subject instead of the company's ongoing operations, and these subjects can
be customers, suppliers, marketing, product, promotion, etc.
• A data warehouse usually focuses on modeling and analysis of data that
helps the business organization to make data-driven decisions.
• 2. Time-Variant:
• The different data present in the data warehouse provides information for a
specific period.
• 3. Integrated
• A data warehouse is built by joining data from heterogeneous sources, such
as social databases, level documents, etc.
• 4. Non- Volatile
• It means, once data entered into the warehouse cannot be change.
• Functions of Data Warehouse Tools and Utilities
• The following are the functions of data warehouse tools and utilities −
• Data Extraction − Involves gathering data from multiple heterogeneous
sources.
• Data Cleaning − Involves finding and correcting the errors in data.
• Data Transformation − Involves converting the data from legacy format
to warehouse format.
• Data Loading − Involves sorting, summarizing, consolidating, checking
integrity, and building indices and partitions.
• Refreshing − Involves updating from data sources to warehouse.
Data mining