File Structure and Hashing

Introduction
• Most Applications collect huge amount of

Data, [ Bank Accounts, College Admissions,
File Structure And Hashing Scientific Experiments etc.]
• Files stored on computer are a good alternative
to store.
• Primary Storage vs. Secondary Storage.
• Less Time and Less Space vs. More Time and
More Space.
File structure and data structure File structure and data structure
• A data structure could be present both in • In most (binary) file structures, the basic
RAM and DISK. organizing elements are Pages and Physical
• Technically the file structures are more Links.
standardized, especially if one wants to allow • Page (computer memory) map to file system
a third party app or a newer version of the
same app to open the file. chunks that are (hopefully) guaranteed to be
stored contiguously and can be read and
• Data structures can change between app
versions and there is no need for compatibility. written in a single operation.
2015-11-2 2015-11-2
Data Hierarchy File Attributes
• Every file contains data which can be • Every file is stored in directory in computer
organized in hierarchy for systematic system.
organization. • Each file is having list of attributes associated
– Data Field with it. They state how to use file to any
– Record software program looks up the directory.
– File • Just like hidden files are not displayed in DOS
– Directory when we apply DIR command.
2015-11-2 2015-11-2
File Attributes File Attributes

• File name – string of characters vary by OS • Attributes Flag – a single byte is used to
• File position – position where the next specify 6 flags attached to a file
read/write operation can be performed Attribute Byte
• File structure – whether text file or binary file Read-Only 0000 0001
Hidden 0000 0010
• File access method – how to access records, System 0000 0100
sequentially or randomly Volume Label 0000 1000
Directory 0001 0000
Archive 0010 0000
2015-11-2 2015-11-2
File Attributes File Attributes
• Read-only – can’t get deleted or modified. Get • Directory – Extra bit to differentiate directory
message ‘access denied’ if you do so. from file. In directory listing it is used to
• Hidden – Not displayed in directory listing. specify sub directories. Theoretically by
• System – Important as used by system and changing this bit, we can convert a file to
must not be altered or removed. More than directory but practically results in failure.
Read-only. • Archive – Backup program uses it by setting 1
• Volume label – Every disk volume is assigned to 0 and 0 to 1 to have the latest copy.
for identification, mainly at the time of
formatting.
2015-11-2 2015-11-2
Text files vs. Binary files Text files vs. Binary files
• Basically every file is just a series of bytes one • A binary file is basically any file that is not
after the other. "line-oriented". Any file where besides the
actual written characters and newlines there
• In general every file is a binary file, but if the are other symbols as well.
data in it contains only text like letter, • A Microsoft word file is a binary file as besides
numbers and other symbols one would use in the actual text, it also contains various
writing, and if it consists of lines, then we characters representing font size and color.
consider it a text file. • Program written in the C programming
language is a text file, but after you compiled
it, the compiled version is binary.
2015-11-2 2015-11-2
File Organization File Organization
• File is a collection of related records so main • Sequential Organization
issue in file management is the way records • Relative File Organization
are organized inside the file.
• Indexed Sequential File Organization
• How to select file organization method:
– Rapid access to one or more records
– Ease of insert/update/delete one or more records
w/o affecting speed of accessing records
– Efficient storage of records
2015-11-2 2015-11-2
Sequential Organization Sequential Organization

• In a sequential file, we place records in a block • Suppose we only wish to find one record. We
one after another until there is no room for may have to search the entire file, reading
another complete record in the block. one block at a time to find the record.
• If we wish to access all the records, one after
another, this is a highly efficient file
organization.
2015-11-2 2015-11-2
Relative File Organization Relative File Organization
• In contrast to SEQ files, records of a RELATIVE • Address of ith record =
file can be accessed by specifying the record base_add + (i-1)*record_length
sequence number/relative key and without • Add of 5th record = 1000+(4)*20 = 1080
needing to read all the previous records. • Possible only if records are of equal lenths.
Relative Record Number Records in Memory
• Record with sequence number 16 is located 0 Record 0
1 Record 1
just after the 15th record.
2 Free
• Range from 0 to n-1. 3 Free
…… …..
98 Free
99 Record 99
2015-11-2 2015-11-2
Relative File Organization Indexed Sequential File Organization

• This can be used as Relative as well as • Provides fast data retrieval
sequential • Mostly records are of fixed length
• Random access, ease of processing, low • Index table stores address of each record
overhead, add/delete based on relative key • Sequential and random access is possible
• Only disk devices uses this, limited to fixed • Index table is read sequentially to find the
length records, relative record number must address of the desired record
be known in advance to access any record
2015-11-2 2015-11-2
Indexed Sequential File Organization Indexing
• Advantage is that indexes are small and • Index to file is like a catalogue to library
searched quickly • Indexed sequential file organization is efficient
• Disadvantage is extra space for index to use but in practical where a file may have
Record Number Address of Record
millions of records, method fails
1 1234 Record • Based on access time, access type, insertion
2 NULL
time, deletion time, we have two categories,
…. ….
99 4567 Record – Ordered Indices
– Hash Indices
2015-11-2 2015-11-2
Ordered Indices Ordered Indices

• Indexes are used to provide fast and random • Secondary Index, while primary index
access to records provides sequential order only. Here the
• Primary Index – Index whose search key search key which provides random access or
specifies the sequential order of file. different from sequential order will work as
• Example, STUDENT file has sequential order secondary index.
starting from roll no. 1 to 70. So roll number is • Example, in STUDENT file use student name as
primary index. Call roll no 20 and you get index which never provides sequential order,
record. as they are sorted roll number wise.
2015-11-2 2015-11-2
Hash Indices Why Hashing ?
• Based on Hashing Techniques • Internet has grown to millions of users
generating terabytes of content every day.
• With this kind of growth, it is impossible to

find anything in the internet, unless we
develop new data structures and algorithms
for storing and accessing data.
2015-11-2 2015-11-2
Why not traditional data structures? Why Hashing ?

• Traditional data structures like Arrays and • Therefore we discuss a new technique called
Linked Lists? hashing that allows us to update and retrieve
• Amount of time required to look up an any entry in constant time O(1).
element in the array is either O(log n) or O( n). • The constant time or O(1) performance means,
– If array is sorted then binary search or searched the amount of time to perform the operation
linearly. does not depend on data size 'n'.
• Either case may not be desirable if we need to
process a very large data set.
2015-11-2 2015-11-2
Hash Function Hash Function
• Pair is of the form (key, value), where for • The concept of a hash table is a generalized
given a key, we can find a value using some idea of an array where key does not have to
kind of a “function” that maps keys to values. be an integer.
• The key for a given object can be calculated • We can have a name as a key, or for that
matter any object as the key.
using a function called a hash function.
• The trick is to find a hash function to compute
• For example, given an array A, if i is the key, an index so that an object can be stored at a
then we can find the value by simply looking specific location in a table such that it can
up A[i]. easily be found.
2015-11-2 2015-11-2
Example Example
• Suppose we have a set of strings {“abc”, “def”, “ghi”} • If we assume that we have a table of size 5 to store
that we’d like to store in a table. these strings,
• Our objective here is to find or update them quickly • we can compute the location of the string by taking
from a table, actually in O(1). the sum mod 5. So we will then store “abc” in 6 mod
• We are not concerned about ordering them or 5 = 1, “def” in 15 mod 5 = 0, and “ghi” in 24 mod 5 =
maintaining any order at all. 4.
• Suppose we assign “a” = 1, “b”=2, … etc to all • So locations 1, 0 and 4 as follows.
alphabetical characters, then simply compute a
number for each of the strings.
• “abc” = 1 + 2 + 3=6, “def” = 4 + 5 + 6=15 , “ghi” = 7 +
8 + 9=24
2015-11-2 2015-11-2
Example Problem with Hashing
• We can immediately compute the location • The method discussed seems too good as we begin
using a simple hash function, which is sum of to think more about the hash function.
the characters mod Table size. • First of all, the hash function we used, that is the
• Using this hash value, we can search for the sum of the letters, is a bad one.
string. This seems to be great way to store a • In case we have permutations of the same letters,
Dictionary. "abc", "bac" etc in the set, we will end up with the
• Therefore the idea of hashing seems to be a same value for the sum and hence the key.
great way to store pairs of (key, value) in a • In this case, the strings would hash into the same
table. location, creating what we call a "collision".
2015-11-2 2015-11-2
Problem with Hashing Question 1

• Question 1: How do we pick a good hash • Picking a “good” hash function is key to
function? successfully implementing a hash table.
• What we mean by “good” is that the function
• Question 2: How do we deal with collisions? must be easy to compute and avoid collisions
as much as possible.
2015-11-2 2015-11-2
Question 1 Division Method
• Properties of Good Hash Function • This method divides Key K by Prime number
– Low Cost – Binary search find value with log n, so M and then uses the reminder thus obtained.
hash function must cost less than this
– Determinism – same hash value for same input
• h(K) = K mod M
– Uniformity – uniform hash values minimize the
number of collisions
• Example:
– h(12345) = 12345 % 95 = 90
2015-11-2 2015-11-2
Midsquare Method Folding Method

• Square the value of the Key • Divide the Key value into a number of parts.
• Extract the middle r bits of the result – K into k1, k2, k3, .....kn. where each part has same
number of digits except the last one.
• Add the individual parts.
• Example: – obtain k1+k2+k3+....+kn. the hash value is
– if K = 12345 obtained by ignoring the last carry.
– Square(K) = 152399025 • Example:
– middle 3 bits: 399 – K = 12345: k1=12, k2=34,k3=5
– middle 2 bits: 39 – k1+k2+k3 = 51
2015-11-2 2015-11-2
Multiplication Method Algebraic Method
• Choose a constant A such that 0 < A < 1 • Suppose we need to store a dictionary in a
• Multiply Key K by A
• Extract the fractional part of KA
hash table. A dictionary is a set of Strings and
• Multiply the result by m and take floor we can define a hash function as follows.
Assume that S is a string of length n and
• h(K) = floor [m (KA mod 1)]
• Example:
– h(12345) = floor[ 100 (12345*0.357840 mod 1)]
= floor[ 100 (4417.5348 mod 1) ]
= floor[ 100 (0.5348) ]
= floor[ 53.48 ] • where p is a prime number. Obviously, each
= 53 string will lead to a unique number
2015-11-2 2015-11-2
Collisions Collision Resolution

• It is difficult to find a “perfect” hash function, • If two keys hash to the same index, the
that is a function that has no collisions. corresponding records cannot be stored in the
same location.
• One problem with hashing is that it is possible • So, if it's already occupied, we must find
that two strings can hash into the same another location to store the new record, and
location. This is called a collision. do it so that we can find it when we look it up
later on.
• There are number of collision resolution
techniques, but the most popular
are chaining and open addressing.
2015-11-2 2015-11-2
Open Addressing Open Addressing
• Open addressing hash tables can store the • Linear probing, looking for the next available
records directly within the array. location i+1, i+2, etc. from the hashed value i
• A hash collision is resolved by probing, or • Quadratic probing, same as linear probing,
searching through alternate locations in the except we look for available positions i+1 , i +
array (the probe sequence) until either the 4, i + 9, etc from the hashed value i
target record is found, or an unused array slot
is found, which indicates that there is no such
key in the table.
2015-11-2 2015-11-2
Chaining
• Each slot in the array references a linked list of
inserted records that collide to the same slot.
• Insertion requires finding the correct slot, and
appending to either end of the list in that slot; Thank You...
deletion requires searching the list and removal.
2015-11-2 2015-11-2

File Structure and Hashing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

File Structure and Hashing

Uploaded by

Copyright:

Available Formats

Introduction

• Most Applications collect huge amount of

File Attributes File Attributes

Sequential Organization Sequential Organization

Relative File Organization Indexed Sequential File Organization

Ordered Indices Ordered Indices

• With this kind of growth, it is impossible to

Why not traditional data structures? Why Hashing ?

Problem with Hashing Question 1

Midsquare Method Folding Method

Collisions Collision Resolution

You might also like