You are on page 1of 9

9 HASHING

Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that
represents the original string. Hashing is used to index and retrieve items in a database because it is faster
to find the item using the shorter hashed key than to find it using the original value. It is also used in many
encryption algorithms.

The hashing algorithm is called the hash function (and probably the term is derived from the idea that the
resulting hash value can be thought of as a "mixed up" version of the represented value). In addition to
faster data retrieval, hashing is also used to encrypt and decrypt digital signatures (used to authenticate
message senders and receivers). The digital signature is transformed with the hash function and then both
the hashed value (known as a message-digest) and the signature are sent in separate transmissions to the
receiver. Using the same hash function as the sender, the receiver derives a message-digest from the
signature and compares it with the message-digest it also received. They should be the same.

The hash function is used to index the original value or key and to retrive the data associated with the
value or key. Thus, hashing is always a one-way operation. There is no need to "reverse engineer" the hash
function by analyzing the hashed values. In fact, the ideal hash function cannot be derived by such analysis.
A good hash function also should not produce the same hash value from two different keys. If it does, this
is known as a collision. A hash function that offers an extremely low risk of collision may be considered
acceptable.

9.1 HASH FUNCTION

A hash function h(k) is any well-defined mathematical function that converts a large data with different
sizes, into a single integer called index for accessing hash table (discussed later). The value returned by a
hash function is called hash values. It is also referred as hash codes, hash sums, or simply hashes.

To speed up table lookup or data comparison tasks, like finding items in a database, detecting duplicated
or similar records in a large file, finding similar stretches in DNA sequences, and so on, hash functions are
used frequently.

A hash function may map two or more keys to the same hash value. In many applications i.e. if h(k1) =
h(k2), where k1  k2, this condition arises, then it is called hash collision. It is desirable to minimize the
occurrence of such collisions, which means that the hash function must map the keys to the hash values as
evenly as possible. Depending on the application, other properties may be required as well.

Hash functions are related to checksums, check digits, fingerprints, randomization functions, error-
correcting codes, and cryptographic hash functions.

A hash function that is injective i.e. maps each valid input to a different hash value, is said to be perfect or
ideal hash function. With such a function, one can directly locate the desired entry in a hash table, without
any additional searching.

Unfortunately, perfect hash functions are effective only in situations where the inputs are fixed and
entirely known in advance.

249
9.2 HASH TABLES

A hash table or hash map is a data structure that uses a hash function to map certain identifiers or keys
(e.g., person id) with its associated values (e.g., person name, their telephone numbers etc.).

In static hashing the hash table ht is partitioned into b buckets, ht[0], …, ht[b — 1]. Each bucket is capable
to hold s dictionary pairs (or pointers to this many pairs). Thus, a bucket is said to consist of s slots, each
being large enough to hold a dictionary pairs. Usually s = 1, and each bucket can hold exactly one pair.

Ideally the hash function should map each possible key to a different slot index, but this ideal is rarely
achievable in practice (unless the hash keys are fixed; i.e. new entries are never added to the table after
creation).

In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is independent
of the number of elements stored in the table.

9.3 LOAD FACTOR

The key density of the hash table is the ratio n / T, where n is the number of pairs in the table and T is the
total number of possible keys. The load factor of hash table is n / (sb).

The performance of most of the collision resolution methods does not depend directly on the number n of
stored entries, but depends strongly on the table's load factor.

9.4 WELL-KNOWN HASH FUNCTIONS

Here are some relatively simple hash functions that have been used:

9.4.1 THE DIVISION-REMAINDER METHOD

The size of the number of items in the table is estimated. That number is then used as a divisor into each
original value or key to extract a quotient and a remainder. The remainder is the hashed value. (Since this
method is liable to produce a number of collisions, any search mechanism would have to be able to
recognize a collision and offer an alternate search mechanism.)

9.4.2 FOLDING

This method divides the original value (digits in this case) into several parts, adds the parts together, and
then uses the last four digits (or some other arbitrary number of digits that will work) as the hashed value
or key.

9.4.3 RADIX TRANSFORMATION

250
Where the value or key is digital, the number base (or radix) can be changed resulting in a different
sequence of digits. (For example, a decimal numbered key could be transformed into a hexadecimal
numbered key.) High-order digits could be discarded to fit a hash value of uniform length.

9.4.4 DIGIT REARRANGEMENT

This is simply taking part of the original value or key such as digits in positions 3 through 6, reversing their
order, and then using that sequence of digits as the hash value or key.

9.4.5 MID-SQUARE METHOD

In this section, we consider a hashing method, which avoids the use of division. Since, integer division is
usually slower than integer multiplication, by avoiding division we can potentially improve the running
time of the hashing algorithm. We can avoid division by making use of the fact that a computer does finite-
precision integer arithmetic.

The middle-square hashing method works as follows. First, we assume that M is a power of two, say
M  2 k for some k  1 . Then, to hash an integer x, we use the following hash function:

h( x)   
M 2 
x mod W  
W 

Notice that since M and W are both powers of two, the ratio W  2 w k is also a power two. Therefore, in
 
M
order to multiply the term x 2 mod W by M/W we simply shift it to the right by w-k bits! In effect, we are
extracting k bits from the middle of the square of the key--hence the name of the method.

9.5 COLLISION RESOLUTION TECHNIQUE

Collisions are practically unavoidable when hashing a random subset of a large set of possible keys.
Therefore, most of the hash table implementations have some collision resolution strategy to handle such
events. Some common strategies are described below. All these methods require that the keys (or pointers
to them) be stored in the table, together with the associated values.

9.5.1 OPEN ADDRESSING

In this strategy, all entries are stored in the bucket itself. During the insertion of a new entry, the buckets
are examined in some probe sequence until an unoccupied slot is found starting from the hashed slot. In
case of searching, the buckets are scanned in the same sequence, till the target record or an empty array
slot is found, which indicates that there is no such key in the table. The name "open addressing" refers to
the fact that the location ("address") of the item is not determined by its hash value. This method is also
called closed hashing.

Well known probe sequences include:

251
9.5.1.1 LINEAR PROBING

Linear probing is a scheme in computer programming for resolving hash collisions of values of hash
functions by sequentially searching the hash table for a free location. This is accomplished using two values
- one as a starting value and one as an interval between successive values in modular arithmetic. The
second value, which is the same for all keys and known as the stepsize, is repeatedly added to the starting
value until a free space is found, or the entire table is traversed.

newLocation = (startingValue + stepSize) % arraySize

This algorithm, which is used in open-addressed hash tables, provides good memory caching (if stepsize is
equal to one), through good locality of reference, but also results in clustering, an unfortunately high
probability that where there has been one collision there will be more. The performance of linear probing
is also more sensitive to input distribution when compared to double hashing.

Given an ordinary hash function H(x), a linear probing function would be:

H x, i   H  x   i mod n 

Here H(x) is the starting value, n the size of the hash table, and the stepsize is i in this case.

9.5.1.2 QUADRATIC PROBING

Quadratic probing is a scheme in computer programming for resolving collisions in hash tables.

Quadratic probing operates by taking the original hash value and adding successive values of an arbitrary
quadratic polynomial to the starting value. This algorithm is used in open-addressed hash tables. Quadratic
probing provides good memory caching because it preserves some locality of reference; however, linear
probing has greater locality and, thus, better cache performance. Quadratic probing better avoids the
clustering problem that can occur with linear probing.

Quadratic probing is used in the Berkeley Fast File System to allocate free blocks. The allocation routine
chooses a new cylinder-group when the current is nearly full using quadratic probing, because of the speed
it shows in finding unused cylinder-groups.

Let h(k) be a hash function that maps an element k to an integer in [0,m − 1], where m is the size of the
table.

Let the ith probe position for a value k be given by the function h(k,i) = (h(k) + c1i + c2i2)(mod m), where
. If c2 = 0, then h(k,i) degrades to a linear probe. For a given hash table, the values of c1 and c2
remain constant.

9.5.1.3 DOUBLE HASHING

252
Double hashing is a computer programming technique used in hash tables to resolve hash collisions, cases
when two different values to be searched for produce the same hash key. It is a popular collision-
resolution technique in open-addressed hash tables.

Like linear probing, it uses one hash value as a starting point and then repeatedly steps forward an interval
until the desired value is located, an empty location is reached, or the entire table has been searched; but
this interval is decided using a second, independent hash function (hence the name double hashing).
Unlike linear probing and quadratic probing, the interval depends on the data, so that even values mapping
to the same location have different bucket sequences; this minimizes repeated collisions and the effects of
clustering. In other words, given independent hash functions h1 and h2, the jth location in the bucket
sequence for value k in a hash table of size m is:

hk , j   h1 k   j.h2 k mod n 

Linear probing and, to a lesser extent, quadratic probing are able to take advantage of the data cache by
accessing locations that are close together. Double hashing has larger intervals and is not able to achieve
this advantage.

9.5.2 SEPARATE CHAINING

In this strategy (also known as direct chaining, or simply chaining), each slot of the bucket holds the pointer
to a list that contains the key-value pairs that hashed to the same location. For lookup, the list is being
scanned with the given key until the match is found or encounter end of the list.In case of insertion, a new
record is entered to either end of the list in the hashed slot and deletion requires searching the list and
removing the element if found. The technique is also called open hashing or closed addressing.

Chained hash tables with linked lists are popular because they require only basic data structures with
simple algorithms, and can use simple hash functions that are unsuitable for other methods. The cost of a
table operation depends on distribution of the entries in the selected bucket. For uniform distribution of
keys, the average cost of a lookup depends only on the number of keys per bucket, i.e. on the load factor.
Chained hash tables remain effective even when the number of entries is much higher than the number of
slots.

If all entries are inserted into the same bucket, the hash table becomes ineffective and the cost of
searching depends only on the data structure, where the lookup procedure may have to scan all the entries
for linear data structure and the worst-case complexity is directly proportion to the number of entries in
the bucket. However, if the bucket chains are implemented as ordered lists compared to unordered, sorted
by the key field; 50% gain in average complexity may be achived. In case of large load factor, balanced
search trees may be considered as a data structure for better complexity.

Since, chained hash table uses linked list, it inherits its disadvantages too. When storing small keys and
values, the space overhead of the next pointer in each entry record can be significant. Another
disadvantage is that of traversing a linked list with poor locality of reference.

253
10 FILE AND INDEXING

10.1 INTRODUCTION

File organization is the methodology, which is applied to structured computer files. Files contain computer
records that can be documents or information that is stored in a certain way for later retrieval. Record is a
collection of related fields and a key uniquely identifies a record. File organization refers primarily to the
logical arrangement of data (which can itself be organized in a system of records with correlation between
the fields/columns) in a file system. It should not be confused with the physical storage of the file in some
types of storage media. There are certain basic types of computer file, which can include files stored as
blocks of data and streams of data, where the information streams out of the file while it is being read until
the end of the file is encountered.

We will look at two components of file organization here:

1. The way the internal file structure is arranged and


2. The external file as it is presented to the O/S or program that calls it. Here we will also examine the
concept of file extensions.

We will examine various ways that files can be stored and organized. Files are presented to the application
as a stream of bytes and then an EOF (end of file) condition.

A program that uses a file needs to know the structure of the file and needs to interpret its contents.

10.2 INTERNAL FILE STRUCTURE

It is a high-level design decision to specify a system of file organization for a computer software program or
a computer system designed for a particular purpose. Performance is high on the list of priorities for this
design process, depending on how the file is being used. The design of the file organization usually
depends mainly on the system environment. For instance, factors such as whether the file is going to be
used for transaction-oriented processes like OLTP or Data Warehousing, or whether the file is shared
among various processes like those found in a typical distributed system or standalone. It must also be
asked whether the file is on a network and used by a number of users and whether it may be accessed
internally or remotely and how often it is accessed.

However, overall the most important considerations might be as follows:

1. Rapid access to a records or a number of records, which are related to each other,
2. The Adding, modification, or deletion of records,
3. Efficiency of storage and retrieval of records and
4. Redundancy, being the method of ensuring data integrity

A file should be organized in such a way that the records are always available for processing with no delay.
This should be done in line with the activity and volatility of the information.

10.3 TYPES OF FILE ORGANIZATION

254
Organizing a file depends on what kind of file it happens to be: a file in the simplest form can be a text file,
(in other words a file that is composed of ASCII (American Standard Code for Information Interchange)
text.) Files can also be created as binary or executable types (containing elements other than plain text.) In
addition, files are keyed with attributes, which help determine their use, by the host operating system.

10.4 TECHNIQUES OF FILE ORGANIZATION

The four techniques of file organization are as follows:

1. Sequential (SAM)
2. Line Sequential (LSAM)
3. Indexed Sequential (ISAM)
4. Hashed or Direct

In addition to the four techniques, there is another method of organizing files. This is inverted list.

10.4.1 SEQUENTIAL ORGANIZATION

A sequential file contains records organized in the order they were entered. The order of the records is
fixed. The records are stored and sorted in physical, contiguous blocks within each block the records are in
sequence.

Records in these files can only be read or written sequentially.

Once stored in the file, the record cannot be made shorter, longer, or deleted. However, the record can be
updated if the length does not change. (This is done by replacing the records by creating a new file.) New
records will always appear at the end of the file.

If the order of the records in a file is not important, sequential organization will suffice, no matter how
many records you may have. Sequential output is also useful for report printing or sequential reads which
some programs prefer to do.

10.4.2 LINE-SEQUENTIAL ORGANIZATION

Line-sequential files are like sequential files, except that the records can contain only characters as data.
Line-sequential files are maintained by the native byte stream files of the operating system.

In the COBOL environment, line-sequential files that are created with WRITE statements with the
ADVANCING phrase can be directed to a printer as well as to a disk.

10.4.3 INDEXED-SEQUENTIAL ORGANIZATION

Key searches are improved by this system too. The single-level indexing structure is the simplest one where
a file, whose records are pairs, contains a key pointer. This pointer is the position in the data file of the

255
record with the given key. A subset of the records, which are evenly spaced along the data file, is indexed,
in order to mark intervals of data records.

This is how a key search is performed. The search key is compared with the index keys to find the highest
index key coming in front of the search key, while a linear search is performed from the record that the
index key points to, until the search key is matched or until the record pointed to by the next index entry is
reached. Regardless of double file access (index + data) required by this sort of search, the access time
reduction is significant compared with sequential file searches.

Let us examine, for sake of example, a simple linear search on a 1,000 record sequentially organized file.
Averages of 500 key comparisons are needed (and this assumes the search keys are uniformly distributed
among the data keys). However, using an index evenly spaced with 100 entries, the total number of
comparisons is reduced to 50 in the index file plus 50 in the data file: a five to one reduction in the
operations count!

Hierarchical extension of this scheme is possible since an index is a sequential file in itself, capable of
indexing in turn by another second-level index, and so forth and so on. Moreover, the exploit of the
hierarchical decomposition of the searches more and more, to decrease the access time will pay increasing
dividends in the reduction of processing time. There is however, a point when this advantage starts to be
reduced by the increased cost of storage and this in turn will increase the index access time.

Hardware for Index-Sequential Organization is usually Disk-based, rather than tape. Records are physically
ordered by primary key. Moreover, the index gives the physical location of each record. Records can be
accessed sequentially or directly, via the index. The index is stored in a file and read into memory at the
point when the file is opened. In addition, indexes must be maintained.

Life sequential organization the data is stored in physical contiguous box. However, the difference is in the
use of indexes. There are three areas in the disc storage:

1. Primary Area:-Contains file records stored by key or ID numbers


2. Overflow Area:-Contains records area that cannot be placed in primary area
3. Index Area:-It contains keys of records and there locations on the disc

10.4.4 INVERTED LIST

In file organization, this file is indexed on many of the attributes of the data itself. The inverted list method
has a single index for each key type. The records are not necessarily stored in a sequence. They are placed
in the data storage area, but indexes are updated for the record keys and location.

Here is an example, in a company file, an index could be maintained for all products, and another one
might be maintained for product types. Thus, it is faster to search the indexes than every record. These
types of file are also known as "inverted indexes." Nevertheless, inverted list files use more media space
and the storage devices get full quickly with this type of organization. The benefits are apparent
immediately because searching is fast. However, updating is much slower.

256
Content-based queries in text retrieval systems use inverted indexes as their preferred mechanism. Data
items in these systems are usually stored compressed which would normally slow the retrieval process, but
the compression algorithm will be chosen to support this technique.

When querying a file there are certain circumstances when the query is designed to be modal, hold which
means that rules are set which require that different information, in the index. Here is an example of this
modality: when phrase querying is undertaken, the particular algorithm requires that offsets to word
classifications be held in addition to document numbers.

10.4.5 DIRECT OR HASHED ACCESS

With direct or hashed access, a portion of disk space is reserved and a “hashing” algorithm computes the
record address. Therefore, there is additional space required for this kind of file in the store. Records are
placed randomly throughout the file. Records are accessed by addresses that specify their disc location. In
addition, this type of file organization requires a disk storage rather than tape. It has an excellent search
retrieval performance, but care must be taken to maintain the indexes. If the indexes become corrupt,
what is left as well may go to the bit-bucket, so it is as well to have regular backups of this kind of file just
as it is for all stored valuable data.

257

You might also like