You are on page 1of 2

There are two types of hashing - Internal and External Hashing.

In Internal Hashing the hash table is in memory, where each slot holds only one entry. This type of hashing is covered in a separate lesson. This lesson covers the applications of hashing techniques for indexing records on disk, where slots are called buckets and refer to pages on disk. Each bucket may hold multiple data entries. This is called External Hashing. It is used to create hashed files (indexes), in which records are positioned based on a hash function on some field(s).

External Hashing
When searching for a record with specific field or search key in a database, we can use hashing to find the records containing that key on disk. This is done with a Hash function, which takes the key and computes an integer. This integer can be used to map to the record on disk through a: Direct file - the integer maps directly to the record. The Operating System must provide support for this type of file. Heap file - the integer maps to the id of the page containing the record, where the data page is searched sequentially. This technique works well if one directory page can hold the correct number of page ids. Lookup Table - translates relative page address (the integer computer by the hash function) to physical page addrress.

How does External Hashing work?


The hash table is called a directory and it composed of pages on disk or buckets, which map the key to a data page on disk, containing the actual records with that key. The hash function takes the key we are looking for and computes a page number. The hash page is searched for the correct data page, containing the record with the given key. The data page is searched to locate the actual record(s). This technique is very efficient with fast searches of large databases. Since disk access is slow, it is very important to access as few pages on disk as possible. External hashing keeps the number of accesses to disk low for very large databases. For example, assume we have a database of 1 million records with 4 records per page, which is 250000 pages of data. Also, assume the key and pointer is 24 bytes with a 512 byte page. Then a page can hold 21 keys and pointers in a page. Assume buckets are 3/4 full, we have 67000 buckets with 15 keys. The hash function computes a value in the range 0 to 66999. Then the indexes to all records fit in the hash table and we can access any record in only two prompts. Provided there are no collisions we have to read in only two pages on disk to locate a record.

Static and Dynamic Hashing


Static Hashing has the number of primary pages in the directory fixed. Thus, when a bucket is full, we need an overflow bucket to store any additional records that hash to the full bucket. This can be done with a link to an overflow page, or a linked list of overflow pages. The linked list can be separate for each bucket, or the same for all buckets that overflow. When searching for a record, the original bucket is accessed first, then the overflow buckets. Provided there are many keys that hash to the same bucket, locating a record may require accessing multiple pages on disk, which greatly degrades performance. The problem of lengthy searching of overflow buckets is solved by Dynamic Hashing. In Dynamic Hashing the size of the directory grows with the number of collisions to accommodate new records

and avoid long overflow page chains. Extendible and Linear Hashing are two dynamic hashing techniques.

Advantages of Extendible Hashing


When the index exceeds one page only the upper so many bits may be checked to determine if a key hashes to a bucket referred to in this page of the index. Although the mechanism is different than a tree, the net effect is not that much different. Extendible Hashing allows the index to grow smoothly without changes to the hash function or drastic rewriting of many pages on disk.

Disadvantages of Extendible Hashing


Extendible Hashing does not come without problems. First, when the index is doubled additional work is added to the insertion of a single record. Moreover, when the whole index cannot not fit in memory substantial input and output of pages between memory and disk may occur. Another consideration is that when the number of records per bucket is small we can end up with much larger global levels than needed. Suppose we have only two records per bucket and 3 records have the same key for the last 20 bits. Since no Local depth of a bucket cannot be bigger than the Global Depth, we will have global depth of 20, even though most of the Local Depths are in range from 1 to 5 bits.