You are on page 1of 16

Hash Data Structure and Hash Table

There are several searching techniques like linear search, binary search, search trees etc. In these
techniques, time taken to search any particular element depends on the total number of elements.

Example:

1. Linear Search takes O(n) time to perform the search in unsorted arrays consisting of n elements.
2. Binary Search takes O(logn) time to perform the search in sorted arrays consisting of n
elements.
3. It takes O(logn) time to perform the search in Binary Search Tree consisting of n elements.

Drawback:

The main drawback of these techniques is:


✔ As the number of elements increases, time taken to perform the search also increases.
✔ This becomes problematic when total number of elements become too large.

We want to perform Searching in time O(1)


Solution: Hashing
Hash Table is a data structure which stores data in an associative manner. In a hash table, data is stored
in an array format, where each data value has its own unique index value. Access of data becomes very
fast if we know the index of the desired data.

Thus, it becomes a data structure in which insertion and search operations are very fast irrespective of
the size of the data. Hash Table uses an array as a storage medium and uses hash technique to generate
an index where an element is to be inserted or is to be located from.

Definition: It is a method for storing and retrieving data from a database in O(1) time. Sometimes it is
called mapping. With the help of hashing in data structure, we convert larger values into smaller
values using the concept of hashing. With the help of the search key, we point the data into the
database and for this, we use a pointer in hashing.

Hans Peter Luhn was the first to invent hashing in the 1940s. In November 1958, he presented his
work on hashing at a six-day international conference dedicated to scientific information.

In this technique, we give an input called a key to the hash function. The function uses this key and
generates the unique index corresponding to that value in the hash table. After that, it returns the value
stored at that index which is known as the hash value.
Data can be hashed into a shorter, fixed-length value for quicker access using a key or set of characters.
This is how key-value pairs are stored in hash tables. The representation of the hash function looks like
this:
Hash key: It is the data you want to be hashed in the hash table. The hashing algorithm translates the
key to the hash value. This identifier can be a string or an integer. Therefore, Hashing Key is the raw
data that has to be hashed in a hash table. The hashing algorithm carries out a function to translate the
hash key into the hash value. The outcome of feeding the hash key through the hashing algorithm is
what is known as the hash value.

Hash Key = Key Value % Number of Slots in the Table (k % M)

There are some types of hash keys:

1. Public key:- Public key often termed 'asymmetric' key, is a type of key only used for data
encryption. The mechanism is relatively slower because the public key is an open key. The
common uses of public keys include the functions of cryptography, the transfer of bitcoins, and
securing online sessions. However, Public key functions along with a set of private keys, and
thus, the overall security is not compromised.

2. Private key:- The private key is employed in both encryption and decryption. Each party that
sends or receives sensitive information that has been encrypted shares a key. Due to the fact that
both parties share it, the private key is also referred to as "symmetric". A private key is typically
a long, impossible-to-guess string of bits generated at random or artificially random.

3. SSH public key:- SSH uses both a public and a private key. SSH is a set of keys that can be
used to authenticate and decrypt a communication sent from a distance. Both the distant servers
and the stakeholders have access to the public key.

Hash Function: It performs the mathematical operation of accepting the key value as input and
producing the hash code or hash value as the output. Some of the characteristics of an ideal hash
function are as follows:
✔ It must produce the same hash value for the same hash key to be deterministic.
✔ Every input has a unique hash code. This feature is known as the hash property.
✔ It must be collision-friendly.
✔ A little bit of change leads to a drastic change in the output.
✔ The calculation must be quick.

Hash Table: It is a type of data structure that stores data in an array format. The table maps keys to
values using a hash function.

Properties of Hash Function: The properties of a good hash function are-


1. It is efficiently computable.
2. It minimizes the number of collisions.
3. It distributes the keys uniformly over the table.

To gain better understanding about Hashing in Data Structures,


• There are several searching techniques like linear search, binary search, search trees etc.
• In these techniques, time taken to search any particular element depends on the total number of
elements.

Use cases of Hashing:


 Password Storage: Hash functions are commonly used to securely store passwords. Instead of
storing the actual passwords, the system stores their hash values. When a user enters a
password, it is hashed and compared with the stored hash value for authentication.

 Data Integrity: Hashing is used to ensure data integrity by generating hash values for files or
messages. By comparing the hash values before and after transmission or storage, it's possible
to detect if any changes or tampering occurred.

 Data Retrieval: Hashing is used in data structures like hash tables, which provide efficient data
retrieval based on key-value pairs. The hash value serves as an index to store and retrieve data
quickly.

 Digital Signatures: Hash functions are an integral part of digital signatures. They are used to
generate a unique hash value for a message, which is then encrypted with the signer's private
key. This allows for verification of the authenticity and integrity of the message using the
signer's public key.

Types of Hash Functions: The primary types of hash functions are:


1. Division Method.
2. Mid Square Method.
3. Folding Method.
4. Multiplication Method.

1. Division Method

The easiest and quickest way to create a hash value is through division. The k-value is divided by M in
this hash function, and the result is used.
Formula:
h(K) = k mod M (k % M)
(where k = key value and M = the size of the hash table)
Example of Division Method
k = 1987
M = 13
Therefore, h(1987) = 1987 mod 13 (1987 % 13)
h(1987) = 4

Advantages:
✔ This method is effective for all values of M.
✔ The division strategy only requires one operation, thus it is quite quick.

Disadvantages:
✗ Since the hash table maps consecutive keys to successive hash values, this could result in poor
performance.
✗ There are times when exercising extra caution while selecting M's value is necessary.

2. Mid Square Method

The following steps are required to calculate this hash method:


1. k*k, or square the value of k.
2. Using the middle r digits, calculate the hash value.

Formula:
h(K) = h(k x k)

(where k = key value)

Example of Mid Square Method


k = 60
Therefore, K = k x k
K = 60 x 60
K = 3600
Thus,
h(60) = 60

Advantages:
✔ This technique works well because most or all of the digits in the key value affect the result. All
of the necessary digits participate in a process that results in the middle digits of the squared
result.
✔ The result is not dominated by the top or bottom digits of the initial key value.

Disadvantages:
✗ The size of the key is one of the limitations of this system; if the key is large, its square will
contain twice as many digits.
✗ Probability of collisions occurring repeatedly.
3. Folding Method

The process involves two steps:

1. except for the last component, which may have fewer digits than the other parts, the key-value k
should be divided into a predetermined number of pieces, such as k1, k2, k3,..., kn, each having
the same amount of digits.
2. Add each element individually. The hash value is calculated without taking into account the
final carry, if any.

Formula:
k = k1, k2, k3, k4, ..., kn
s = k1+ k2 + k3 + k4 +….+ kn
h(K)= s

(Where, s = addition of the parts of key k)

Example of Folding Method


k = 123456789
k1 = 67; k2 = 89; k3 = 12
Therefore,
s = k 1 + k 2 + k3
s = 67 + 89 + 12
s = 168
h(K) = s = 168

Advantages:
✔ Creates a simple hash value by precisely splitting the key value into equal-sized segments.
✔ Without regard to distribution in a hash table.

Disadvantages:
✗ When there are too many collisions, efficiency can occasionally suffer.

4. Multiplication Method

1. Determine a constant value. A, where (0, A, 1)


2. Add A to the key value and multiply.
3. Consider kA's fractional portion.
4. Multiply the outcome of the preceding step by M, the hash table's size.

Formula:

h(K) = floor (M (kA mod 1))

(Where, M = size of the hash table, k = key value and A = constant value)
Example of Multiplication Method

k = 5678
A = 0.6829
M = 200
Now, calculating the new value of h(5678):
h(5678) = floor(200((5678 x 0.6829) mod 1))

h(5678) = floor(200(3881.5702 mod 1))


h(5678) = floor(200(0.5702))
h(5678) = floor(114.04)
h(5678) = 114
So, with the updated values, h(5678) is 114.

Advantages:
✔ Any number between 0 and 1 can be applied to it, however, some values seem to yield better
outcomes than others.

Disadvantages:
✗ The multiplication method is often appropriate when the table size is a power of two since
multiplication hashing makes it possible to quickly compute the index by key.

What is a Hash Collision?

Collision in a hash table is a term used to denote the phenomena when the hashing algorithm produces
the same hash value for two or more keys using a hash function. However, it's crucial to note that
collisions are not an issue; rather, they constitute a key component of hashing algorithms. Because
various hashing methods used in data structures convert each input into a fixed-length code regardless
of its length, collisions happen. The hashing algorithms will eventually yield repeating hashes since
there are an infinite number of inputs and a finite number of outputs.

Collision Resolution Techniques in Data Structures


1. Open hashing/separate chaining/closed addressing
2. Open addressing/closed hashing

1. Open hashing/separate chaining/closed addressing


A typical collision handling technique called "separate chaining" links components with the same hash
using linked lists. It is also known as closed addressing and employs arrays of linked lists to
successfully prevent hash collisions.

Advantages:
1. Implementation is simple and easy.
2. We can add more keys to the table because the hash table has a lot of empty places.
3. Less sensitive than average to changing load factors
4. Typically utilized when there is uncertainty on the number and frequency of keys to be used in
the hash table.
Disadvantages:
1. Space is wasted.
2. The length of the chain lengthens the search period.
3. Comparatively worse cache performance to closed hashing.

2. Closed hashing (Open addressing)

Instead of using linked lists, open addressing stores each entry in the array itself. The hash value is not
used to locate objects. To insert, it first verifies the array beginning from the hashed index and then
searches for an empty slot using probing sequences. The probe sequence, with changing gaps between
subsequent probes, is the process of progressing through entries. There are three methods for dealing
with collisions in closed hashing:

1. Linear Probing

Linear probing includes inspecting the hash table sequentially from the very beginning. If the site
requested is already occupied, a different one is searched. The distance between probes in linear
probing is typically fixed (often set to a value of 1).

Formula:
index = (key + i) % hashTableSize

Sequence:
index = ( hash(n) % T)
(hash(n) + 1) % T
(hash(n) + 2) % T
(hash(n) + 3) % T … and so on.
2. Quadratic Probing

The distance between subsequent probes or entry slots is the only difference between linear and
quadratic probing. You must begin traversing until you find an available hashed index slot for an entry
record if the slot is already taken. By adding each succeeding value of any arbitrary polynomial in the
original hashed index, the distance between slots is determined.

Formula:
index = (index + i2) % hashTableSize

Sequence:
index = ( hash(n) % T)
(hash(n) + 1 x 1) % T
(hash(n) + 2 x 2) % T
(hash(n) + 3 x 3) % T … and so on

3. Double-Hashing

The time between probes is determined by yet another hash function. Double hashing is an optimized
technique for decreasing clustering. The increments for the probing sequence are computed using an
extra hash function.

Formula:
index = (first hash(key) + i * secondHash(key)) % size of the table

Sequence:
index = hash(x) % S
(hash(x) + 1*hash2(x)) % S
(hash(x) + 2*hash2(x)) % S
(hash(x) + 3*hash2(x)) % S … and so on

Importance of Hashing

1. Easy retrieval of required information from large data sets in an efficient manner.
2. The hash code produced by the hash function serves as the unique identifier in the data set thus
maintaining data integrity.
3. The data is stored in a structured manner as there is an index for every record in the hash table.
This ensures efficient storage and retrieval.

Limitations of Hashing
1. Many a time there leads to a situation of collision where two or more inputs have the same hash
value.
2. The performance of the hashing algorithm depends upon the quality of the hash function.
Sometimes, a not well-thought-of hash function may lead to collisions thus reducing the
efficiency of the algorithm.

Hash Table in Data Structures


In the previous tutorial, we saw what is hashing and how it works. We saw that a hash table is a data
structure that stores data in an array format. The table maps keys to values using a hash function.

What is a Hash Table in Data Structures?

Definition: A Hash table is a data structure used to insert, look up, and remove key-value pairs
quickly. It implements an associative array. Here, each key is translated by a hash function into a
distinct index in an array. The index functions as a storage location for the matching value.
How Does the Hash Table Work?

1. Initialize the Hash Table: Start with an empty array of a fixed size (let's say 10). Each index in
the array represents a bucket in the hash table.
[empty, empty, empty, empty, empty, empty, empty, empty, empty, empty]
2. Insertion: When inserting a key-value pair into the hash table, the hash function is used to
calculate the index where the pair should be stored. The hash function takes the key as input and
returns an integer value.
Let's say we want to insert the key "apple" with the value "fruit" into the hash table. The hash
function calculates the index as follows: hash("apple") = 5 Since the hash value is 5, we store
the key-value pair in index 5 of the array: [empty, empty, empty, empty, empty, ("apple",
"fruit"), empty, empty, empty, empty]

3. Retrieval: To retrieve a value associated with a given key, we use the same hash function to
calculate the index. We then look for the key-value pair in that index.
For example, if we want to retrieve the value for the key "apple," the hash function calculates
the index as 5. We check index 5 in the array and find the key-value pair ("apple", "fruit"):
[empty, empty, empty, empty, empty, ("apple", "fruit"), empty, empty, empty, empty]

4. Collision: Hash tables can encounter collisions when two different keys map to the same index.
This can happen if the hash function is not perfectly distributed or if the array size is relatively
small compared to the number of keys.
For example, if we try to insert the key "banana" and the hash function calculates the index as 5,
we encounter a collision. We search for the next available slot and find index 6 is empty. So, we
store the key-value pair ("banana", "fruit") in index 6: [empty, empty, empty, empty, empty,
("apple", "fruit"), ("banana", "fruit"), empty, empty, empty] [empty, ("apple", "fruit"),
("banana", "fruit"), empty, empty, empty]

Hash Table Data Structure Algorithm

1. Initialize an array of fixed size to serve as the underlying storage for the hash table. The size of
the array depends on the expected number of key-value pairs to be stored.
2. Define a hash function that takes a key as input and generates a unique hash code.
3. To insert a key-value pair into the hash table, apply the hash function to the key to determine
the index. If the calculated index is already occupied, handle collisions using a collision
resolution strategy.
4. To retrieve the value associated with a key, apply the hash function to the key to determine the
index. If the index contains a value, compare the stored key with the given key to ensure a
match. If the keys match, return the corresponding value. If the index is empty or the keys don't
match, the key is not present in the hash table.
5. To remove a key-value pair from the hash table, apply the hash function to the key to determine
the index. If the index contains a value, compare the stored key with the given key to ensure a
match. If the keys match, remove the key-value pair from the table. If the index is empty or the
keys don't match, the key is not present in the hash table.

Characteristics of a Good Hash Function


1. Deterministic: A hash function should always produce the same hash value for the same input.
This property ensures consistency and allows for easy verification and comparison.
2. Uniform distribution: The hash function should evenly distribute the hash values across the
output space. This property helps to minimize collisions and ensures that different inputs have a
low probability of producing the same hash value.
3. Fixed output size: A good hash function has a fixed output size, regardless of the input size.
This property enables efficient storage and comparison of hash values.
4. Fast computation: Hash functions should be computationally efficient to handle large amounts
of data quickly. Speed is crucial for applications that involve hashing large datasets or
frequently computing hash values.
5. Avalanche effect: A small change in the input should result in a significant change in the output
hash value. This property ensures that even a minor modification in the input leads to a
completely different hash value, making it difficult to predict or reverse-engineer the original
input.
6. Resistance to collisions: While it is impossible to eliminate collisions (different inputs mapping
to the same hash value), a good hash function minimizes their occurrence. It should be difficult
to find two different inputs that produce the same hash value.
7. Non-invertibility: Given a hash value, it should be computationally infeasible to determine the
original input that generated it. This property provides security and helps protect sensitive
information.
8. Well-defined behavior: The hash function should have an unambiguous definition for all
possible inputs. It should handle edge cases and corner cases gracefully, without producing
unexpected results or errors.
9. Security properties: For cryptographic applications, a good hash function should exhibit
additional security properties such as resistance to preimage attacks (finding an input given its
hash value) and second preimage attacks (finding a different input with the same hash value).

Examples of widely-used hash functions that possess these characteristics include:

1. (Secure Hash Algorithm 3)


2. BLAKE2
3. MD5 (although it is considered less secure due to vulnerabilities)

Note that the choice of a hash function depends on the specific requirements of the application, and
different hash functions may be suitable for different use cases.

Hash Collision

1. A hash collision refers to a situation where two different inputs produce the same hash value or
hash code when processed by a hash function. In other words, it occurs when two distinct pieces
of data result in an identical hash output.
2. Hash collisions can have practical implications depending on the context, e.g. in hash tables,
collisions can lead to degraded performance if not handled efficiently.
3. Cryptographic hash functions, used in areas like data security and digital signatures, have
stricter requirements to prevent collisions. These functions aim to provide a high level of
collision resistance, making it computationally infeasible to find two inputs that produce the
same hash value.
Cryptographic hash collisions are of particular concern because they can undermine the
integrity and security of cryptographic systems. Researchers and cryptographers continually
evaluate and analyze hash functions to ensure their collision resistance and security properties,
as vulnerabilities or weaknesses discovered in hash functions could have significant
implications for various applications that rely on them.

Types of Hash Collision

There are primarily two types of hash collisions: accidental or random collisions and intentional or
malicious collisions. Let's explore each type in more detail:

1. Accidental Collisions: Accidental collisions occur when two different inputs produce the same
hash value due to the nature of hashing algorithms and the limited range of possible hash
values. These collisions are unintended and usually considered rare and coincidental. The
probability of accidental collisions depends on the quality of the hashing algorithm and the size
of the hash space. Good hashing algorithms strive to minimize the chances of accidental
collisions by distributing hash values uniformly across the output space.

2. Intentional Collisions: Intentional collisions occur when an attacker purposefully generates two
or more different inputs that produce the same hash value. These collisions are often the result
of exploiting vulnerabilities or weaknesses in the hashing algorithm. Intentional collisions can
be a significant security concern in various scenarios, such as digital signatures, certificate
authorities, password hashing, and data integrity checks.

There are two main categories of intentional collisions:

(a) Preimage Attacks: In a preimage attack, an attacker tries to find any input that matches
a given hash value. This type of attack is aimed at breaking the one-way property of a hash
function, which means that it is computationally infeasible to determine the original input from
its hash value.

(b) Collision Attacks: In a collision attack, an attacker aims to find two different inputs that
produce the same hash value. The goal is to break the collision resistance property of a hash
function, which ensures that it is computationally difficult to find any two inputs with the same
hash value.

Hash Table Applications


1. Dictionaries: Hash tables are commonly used to implement dictionary data structures, where
words or terms are associated with their definitions or values. Hashing allows for quick lookup
and retrieval of definitions based on the provided key.
2. Caching: Hash tables are employed in caching systems to store frequently accessed data. By
using a hash table, the most recently used or important items can be stored, and retrieval can be
performed in constant time, improving overall system performance.
3. Symbol tables: In compilers and interpreters, hash tables are used to implement symbol tables.
Symbol tables store information about variables, functions, classes, and other symbols used in a
program. Hashing enables efficient lookup and manipulation of symbols during compilation or
interpretation.
4. Database indexing: Hash tables are used in database systems to implement indexing structures.
Hash-based indexing allows for fast data retrieval based on a specific key or attribute, speeding
up query processing in databases.
5. Set membership: Hash tables are utilized to implement sets, where membership operations like
adding, removing, and checking for the presence of an element are performed efficiently.
Hashing allows for constant-time lookup and insertion operations, making sets suitable for tasks
such as duplicate removal and membership testing.
6. Caching of computed values: Hash tables are employed to store pre-computed results or
expensive computations. By caching the results based on the input parameters, subsequent
requests can be satisfied instantly, avoiding redundant calculations.
7. Spell checkers: Hash tables are used in spell-checking algorithms to store a dictionary of valid
words. By hashing the words, the algorithm can quickly check if a given word is present in the
dictionary, helping identify potential misspellings.
8. Cryptographic data structures: Hash tables are employed in cryptographic systems for tasks
like password storage and digital signatures. Hash functions convert sensitive data into fixed-
length hashes, and hash tables provide an efficient way to store and retrieve hashed values
securely.

What is Load Factor?


The load factor is essentially a measurement of how full (occupied) the hash table is, and is defined as:
= number of occupied slots/total slots.
In basic terms, if we have a hash table with a capacity of 1000 slots and 500 slots occupied, the load
factor is = 500/1000 = 0.5.

The higher the load factor value, the more space is conserved, but at the cost of chaining more items
into a slot (We will learn about chaining in the below sections). This would increase the time it takes
for items to be accessed.

In general, when faced with the choice of whether time or space is more essential, humans prefer to
optimise time (raising speed) over memory space. The tradeoff between time and space is known as the
time-space tradeoff.

Analysis of Hashing
We stated earlier that in the best case hashing would provide a O(1), constant time search technique.
However, due to collisions, the number of comparisons is typically not so simple. Even though a
complete analysis of hashing is beyond the scope of this text, we can state some well-known results
that approximate the number of comparisons necessary to search for an item.

The most important piece of information we need to analyze the use of a hash table is the load factor, λ.
Conceptually, if λ is small, then there is a lower chance of collisions, meaning that items are more
likely to be in the slots where they belong. If λ is large, meaning that the table is filling up, then there
are more and more collisions. This means that collision resolution is more difficult, requiring more
comparisons to find an empty slot. With chaining, increased collisions means an increased number of
items on each chain.

As before, we will have a result for both a successful and an unsuccessful search. For a successful
search using open addressing with linear probing, the average number of comparisons is approximately
1 1
(1+ )
2 1−λ
and an unsuccessful search gives
1 1 2
(1+( ))
2 1−λ
If we are using chaining, the average number of comparisons is
1+ λ
2
for the successful case, and simply λ comparisons if the search is unsuccessful.

Complete Implementation Hashing and Hash Table in C


#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdbool.h>
#define SIZE 20

struct DataItem {
int data;
int key;
};

struct DataItem* hashArray[SIZE];


struct DataItem* dummyItem;
struct DataItem* item;

int hashCode(int key) {


return key % SIZE;
}

struct DataItem *search(int key) {


//get the hash
int hashIndex = hashCode(key);

//move in array until an empty


while(hashArray[hashIndex] != NULL) {

if(hashArray[hashIndex]->key == key)
return hashArray[hashIndex];

//go to next cell


++hashIndex;
//wrap around the table
hashIndex %= SIZE;
}
return NULL;
}

void insert(int key,int data) {


struct DataItem *item = (struct DataItem*) malloc(sizeof(struct DataItem));
item->data = data;
item->key = key;

//get the hash


int hashIndex = hashCode(key);

//move in array until an empty or deleted cell


while(hashArray[hashIndex] != NULL && hashArray[hashIndex]->key != -1) {
//go to next cell
++hashIndex;
//wrap around the table
hashIndex %= SIZE;
}
hashArray[hashIndex] = item;
}

struct DataItem* delete(struct DataItem* item) {


int key = item->key;
//get the hash
int hashIndex = hashCode(key);
//move in array until an empty
while(hashArray[hashIndex] != NULL) {

if(hashArray[hashIndex]->key == key) {
struct DataItem* temp = hashArray[hashIndex];
//assign a dummy item at deleted position
hashArray[hashIndex] = dummyItem;
return temp;
}
//go to next cell
++hashIndex;

//wrap around the table


hashIndex %= SIZE;
}
return NULL;
}
void display() {
int i = 0;

for(i = 0; i<SIZE; i++) {


if(hashArray[i] != NULL)
printf("(%d,%d) ",hashArray[i]->key,hashArray[i]->data);
}

printf("\n");
}

int main() {
dummyItem = (struct DataItem*) malloc(sizeof(struct DataItem));
dummyItem->data = -1;
dummyItem->key = -1;
insert(1, 20);
insert(2, 70);
insert(42, 80);
insert(4, 25);
insert(12, 44);
insert(14, 32);
insert(17, 11);
insert(13, 78);
insert(37, 97);
printf("Insertion done: \n");
printf("Contents of Hash Table: ");
display();
int ele = 37;
printf("The element to be searched: %d", ele);
item = search(ele);
if(item != NULL) {
printf("\nElement found: %d\n", item->key);
} else {
printf("\nElement not found\n");
}
delete(item);
printf("Hash Table contents after deletion: ");
display();
}

You might also like