You are on page 1of 20

Hashing Part I

CS 367 Introduction to Data Structures

Searching
Up to now the only way to find a key is to search through all or part of the data
linked list: O(n) AVL tree: O(log n) binary search of array: O(log n)

If lots of data and/or searching the data very often, these times can be long
given the key, would like to get the data directly

Hashing
The solution to this problem is to put the key through a function that says exactly where the data is (or where it should be placed)
this function is called a hash function
h(key) = integer

the integer obtained from a hash function can be used as an index into an array
if the hash function is perfect always generates a unique integer for different keys the time to place and access data is O(1)

Hashing
A M X

Hashing Function

A 0 1 2 3 4

M 5 6 7 8

X 9 10 11

Hashing Functions
So what is the hashing function?
the simplest hashing function is to use the division remainder
assume the array is 1000 elements in size translate the data into a number, n h(n) = n % 1000

Hashing Functions
simple example
consider a small school each student is tracked by a 4 digit ID number each students ID# begins with the year they started
2000 -> 0, 2001->1, 2002->2, etc.

all student records are stored in an array


maximum of 1000 students per year

lets look at records for all sophomores


assume they were freshman in 2001

Hashing Functions
To find Johns record in the array: Marys ID #: Petes ID #: Johns ID #: Amys ID#: 1000 1004 1009 1011 1009 % 1000 = 9 Go to index number 9.

9 10 11

Marys records Petes records Johns records Amys records

Generating n
The previous example is rather simplistic in that it is hashing already unique integers
seems kind of pointless maybe not if the integers are large
consider the UWs 10 digit ID numbers

Often it is desirable to hash some other kind of data


a persons name for example

Generating n
How is a string converted into an integer?
the simplest method is to add all of the ASCII values for each character together example
convert amy into an integer
a = 97; m = 109; a + m + y = 327 y = 121

there are lots of other ways to convert strings to integers


what are a few of them?

Hashing Functions
There are millions of possible hashing functions
we will not be considering them all basically, anything you can think of to generate an integer could be used as a hashing function

Mathematicians have spent lots of time and effort to come up with some basic methods that work pretty well

Division
We have already seen the division method
it involves taking the remainder of division
h(key) = key % tableSize

A few notes about making this work better


table size should be a prime number usually a good method if nothing very little is known about the keys the remaining methods will all use division as the final step in their calculation

Folding
Separate the key into various equally sized parts and then recombine them
usually with addition

Two kinds of folding


shift folding
just add the various parts together as they are

boundary folding
reverse the order of every other part and add them together

Folding
Consider a SSN as a key
break it into 3 parts
first 3, second 3, last 3

Shift folding example


SSN = 123-45-6789 first = 123; second = 456; third = 789 h(key) = (first + second + third) % size
h(SSN) = 1368 % tableSize

Boundary folding example


h(key) = (first + R(second) + third) % size h(key) = (123 + 654 + 789) % size

Increasing Performance
Consider using shifting and exclusive ORing to generate the key
exclusive OR parts together to generate index

Example
consider the string abcdefgh if each part is a letter, just exclusive OR them
a ^ b ^ c ^ d ^ e ^ f ^ g ^ h

often, a character is represented by 8 bits


whats the problem with this?

might be better to exclusive OR chunks of the string


abcd ^ efgh why were four digits chosen in this case?

Increasing Performance
int shiftFold(String key, int tableSize) { int chunk = 0; int result = 0; byte[ ] st = key.getBytes(); for(int i=0; i<st.length; i+=4) { for(int j=0; (j<4) && (j + i < st.length); j++) { chunk = chunk | st[j + i]; chunk = chunk << 8; } result = result ^ chunk; chunk = 0; } return result % tableSize; }

Increasing Performance
The performance could be increased even more if the table size was a power of 2
can get rid of the modulo operation at the end modulo is an expensive calculation could just do a subtraction and an AND operation instead

Mid-Square Function
Square the number and take the middle part as the index
a string must first be converted to get the number to square

The entire key gets used to generate the address


less chance for conflicts
more on this later

This method works best if the table size is a power of two

Mid-Square Function
Table size equals 1024 (210) The key is 3121
31212 = 9740441 = (100101001010000101100001)2 middle 10 digits of this value are listed in bold

Index in array is
(0101000010)2 = 322

This is all very quick and easy to calculate using mask and shift operations

Mid-Square Function
int tableSize = 1024; int mask = (tableSize 1) ; int maskBits = logBase2(tableSize); int shiftBits = 7; // table size must be a power of two int midSquare(String key, int tableSize) { int n = stringToNum(key); int n = n * n; return n & (mask << shiftBits); }

Extraction
Simply pull out a certain part of the key and use it as the index
example
SSN = 123-45-6789 index = middle of key = 456 alternative index = first, middle, last = 159

Should try to choose a part of the key that is most likely unique
consider foreign student SSN start with 999
probably not a great idea to extract the first three numbers

You might also like