You are on page 1of 3

ming topics.

This is directly opposed to the often made


assumption that a student's academic discipline is a good
predictor of potential competency in programming. This
result implies that if a course is suitably constructed and
presented, there is no need to segregate students from Programming and M.D. Mcllroy
different academic disciplines due to concerns based on Data Structures Editor
learning ability or interdisciplinary competitiveness.

5.2 Gender Differences


Minimal Perfect Hash
No significant difference was found in academic
performance between the genders. The only observable
Functions Made
difference is that women appear to be more uniform in
their academic performance than men.
Simple
Richard J. Cichelli
5.3 Semester in School Software Consulting Services,
The correlations found between semester in school Allentown, Pennsylvania
and academic performance were very low. This clearly
indicates that in a suitably constructed course, there is
no reason to worry more about those beginning their A method is presented for Computing machine
university experience as opposed to those well along in independent, minimal perfect hash functions of the
their academic careers. They all seem to succeed or fail form: hash value <-- key length + the associated value
in much the same manner. of the key's first character + the associated value of
the key's last character. Such functions allow single
5.4 PAT Use probe retrieval from minimally sized tables of identifier
The results indicate that future programming skill is lists. Application areas include table lookup for
not predictable by the most commonly used written test reserved words in compilers and filtering high frequency
(IBM's Programming Aptitude Test). The suggestion is words in natural language processing. Functions for
that this type of test should not be administered to college Pascal's reserved words, Pascal's predefined identifiers,
level people as the results are unreliable. frequently occurring English words, and month
abbreviations are presented as examples.
Key Words and Phrases: hashing, hashing methods,
6. Conclusions hash coding, direct addressing, dictionary lookup,
information retrieval, lexical analysis, identifier-to-
It is not necessary to construct separate competing address transformations, perfect hashing functions,
courses for those from differing disciplines and levels of perfect hash coding, scatter storage, searching, Pascal,
academic experience as there is no apparent need to be Pascal reserved words, backtracking
concerned with unequal capability. It is also not possible CR Categories: 3.7, 3.74, 4.34, 5.25, 5.39
to reliably predict success in learning programming on
the basis of normally observable external personal attri-
butes or standardized written tests when the people
involved are of college level.

Received May 1978; revised September 1978; accepted September 1979 Introduction

References
!. Bateman, C.R. Predicting performance in a basic computer In several recent articles [2, 3] it has been asserted
course. Proc. of the Fifth Annual Meeting of the Amer. Inst. for that in general computing minimal perfect hash functions
Decision Sciences, Boston, Mass., 1973. for static identifier lists (keys) is difficult. In [1], Knuth
2. Newstead, P.R. Grade and ability predictions in an introductory
programming course. SIGCSE Bull. 7, 2 (June 1975), pp. 87-91. also notes the difficulty in computing perfect hash func-
tions. He estimates that for the set of words used here in
Example No. 3 a search for such a function might include
the examination of 1043 possibilities. A program imple-
Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed for direct
commercial advantage, the ACM copyright notice and the title of the
publication and its date appear, and notice is given that copying is by
permission of the Association for Computing Machinery. To copy
otherwise, or to republish, requires a fee and/or specific permission.
Author's present address: R.J. Cichelli, Software Consulting Ser-
vices, 901 Whittier Drive, Allentown, PA 18103.
© 1980 ACM 0001-0782/80/0100-0017 $00.75.

17 Communications January 1980


of Volume 23
the ACM Number 1
menting the procedure outlined in this paper finds a The ordering process is twofold. First, order the keys
machine independent minimal perfect hash function for by the sum of the frequencies of the occurrences of each
these words in less than one second on a DEC PDP-11/ key's first and last letter in the list. For example: "E"
45 minicomputer. Other examples of such functions are occurs 9 times as a first or last letter in the Pascal reserved
shown and an effective method for computing them is word list. It is the most frequent letter and thus, "ELSE"
described. is the first word in the search list. "D" is the next most
Although my algorithm requires exponential search frequent letter, and thus " E N D " is second. After the
of potentially large spaces, it is based on intelligent words have been put in order by character occurrence
search of these spaces. The procedure has been shown to frequencies, modify the order of the list such that any
find solutions for many sets in a few seconds. For thirty- word whose hash value is determined by assigning the
nine of Pascal's forty predefined identifiers (Example associated character values already determined by pre-
No. 2), the potential search space was 2039 nodes. Com- vious words is placed next. Thus, after " O T H E R W I S E ''1
putation of the function required seven minutes on the has been placed as the third element of the frequency
PDP-11/45. For lookup tables that will be used many ordered list, the hash value of the word " D O " is deter-
times, the method has proved quite practical. So far I mined and so it is placed fourth. (For example, during
have encountered no instance of an interminably long search, after the placement of the word "END," a value
search. will be associated with "D," and after the placement of
The form of my hash function is: the word "OTHERWISE," a value will be associated
Hash value ~ key length + with "O.") The ordering process causes inevitable hash
associated value of the key's first character + value conflicts to occur during search as early as possible,
associated value of the key's last character. thus pruning the search tree and speeding the computa-
tion.
Example No. 1: Pascal's Reserved Words The completely ordered search list for Pascal's re-
For Pascal's 36 reserved words, the following list served words is:
defines the associated value for each letter: ELSE, END, OTHERWISE, DO, DOWNTO, TYPE, TO, FILE, OF,
A = II, B = 15, C = 1, D = 0 , E = 0 , F = 15, G = 3 , H = 15,1= 13, THEN, NOT, FUNCTION, RECORD, REPEAT, OR, FOR, PRO-
J = 0 , K = 0 , L = 15, M = 15, N = 13, O = 0 , P = 15, Q = 0 , R = 14, CEDURE, PACKED, WHILE, CASE, CONST, DIV, VAR, AND,
S = 6 , T = 6 , U = 14, V = 10, W = 6 , X = 0 , Y = 13, Z = 0. MOD, PROGRAM, NIL, LABEL, SET, IN, IF, GOTO, BEGIN,
UNTIL, ARRAY, WITH.
(For lookup routines these values are stored in an integer
array indexed by the letters. Note: Associated values The backtracking search procedure then attempts to
need not be unique.) find a set of associated values which will permit the
The corresponding hash table with hash values run- unique referencing of all members of the key word list.
ning from 2 through 37 is as follows: It does this by trying the words one at a time in order.
The backtracking procedure is as follows: If both the
DO, END, ELSE, CASE, DOWNTO, GOTO, TO, OTHERWISE, first and last letter of the identifier already have associ-
TYPE, WHILE, CONST, DIV, AND, SET, OR, OF, MOD, FILE,
RECORD, PACKED, NOT, THEN, PROCEDURE, WITH, RE- ated values, try the word. If either the first or last letter
PEAT, VAR, IN, ARRAY, IF, NIL, FOR, BEGIN, UNTIL, LABEL, has an associated value, vary the associated value of the
FUNCTION, PROGRAM. unassigned character from zero to the maximum allowed
As an example, consider the computation for associated value, trying each occurrence. If both letters
"CASE": are as yet unassociated, vary the first and then the
second, trying each possible combination. (An exception
(1 ,--- "C") + (0 ~-- "E") + (4 ~ length("CASE")) = 5. test is required to catch situations in which the first and
The advantage of hash functions of the above form last letters are the same.) Each "try" tests whether the
is that they are simple, efficient, and machine (i.e., given hash value is already assigned and, if not, reserves
character representation) independent. It is also likely the value and assigns the letters. If all identifiers have
that any lexical scanning process will have, as a by- been selected, print the solution and halt. Otherwise,
product of its identifier scanning logic, the identifier invoke the search procedure recursively to place the next
length and the values of the first and last characters. word. If the "try" fails, remove the word by backtrack-
Two disadvantages of functions of this form are: (1) it ing.
requires that no two keys share length and first and last The search time for computing such functions is
characters and (2) for lists with more than about 45 related to the number of identifiers to be placed, the
items, segmentation into sublists may be necessary. (This maximum value which is allowed to be associated with
is a result of the limited range of hash values which the a character, and the density of the resultant hash table.
functions produce.)
The associated values for each of the letters are
computed by the following procedure: (1) order the 1 Inclusion of the word "OTHERWISE" in Pascal's reserved word
list anticipates the acceptance by the Pascal Users Group of the
identifier list, and (2) search, by backtracking, for a recommendation for a revised CASE construct submitted by its Inter-
solution. national Working Group for Extensions.

18 Communications January 1980


of Volume 23
the ACM Number 1
If the table density is one (i.e., a minimal perfect hash) Moral
and the maximum associated value is allowed to be the
count of distinct first and last letter occurrences (21 for This article does not have a conclusion, but it does
Pascal's reserved words), then the above procedure finds have a moral. In the words of the renowned chess
a solution for Pascal's reserved words in about seven programmer, J. Gillogly, author of the Technology Chess
seconds on a DEC PDP-11/45 using a straightforward Program which was the prototype of the current gener-
implementation of the algorithm in Pascal. (Without the ation of highly successful chess programs, "When all else
second ordering, the search required 5.5 hours.) If the fails, try brute force."
maximum associated value is limited to 15, as in the
above list, the search requires about 40 minutes. (There Received May 1979; revised and accepted October 1979
is no solution with 14 as a maximum value.)
References
Incorporation of the above hash function into a 1. Knuth, D.E. The Art of Computing Programming. Volume 3:
Pascal cross-reference program yielded a 10 percent Sorting and Searching. Addison-Wesley, Reading, Mass., 1973, pp.
reduction in total run time for processing large programs. 506-507.
2. Sheil, B.A. Median split trees: A fast lookup technique for
The method replaced a well-coded binary search which frequently occurring keys. Comm. A C M 21, 11 (Nov. 1978), 947-958.
was used to exclude reserved words from cross-referenc- 3. Sprugnoli, R. Perfect hashing functions: A single probe retrieving
ing. method for static sets. Comm. A C M 20, I l (Nov. 1977), 841-850.

Example No. 2
The second example is for the list of Pascal's prede-
fined identifiers.
A = 15, B = 9 , C = II, D = 19, E = 5 , F = 3 , G = O , H = O , I = 3 ,
J = 0 , K = 16, L = 13, M = I . N = 19, O = 0 , P = 18, Q = 0 , R = 0 ,
S = 15, T = 0 , U = 17, V = 0 , W = 10, X = 0, Y = 0, Z = 0.
GET, TEXT, RESET, OUTPUT, MAX1NT, INPUT, TRUE, INTE-
GER, EOF, REWRITE, FALSE, CHR, CHAR, TRUNC, REAL,
SQR, SQRT, WRITE, PUT, ORD, READ, ROUND, READLN,
EXP, PAGE, EOLN, COS, SUCC, DISPOSE, NEW, ABS, LN,
BOOLEAN, WRITELN, SIN, PACK, UNPACK, ARCTAN, PRED.

Computation of this function required about seven min-


utes. Note: Since the predefined identifier " O D D " con-
flicts with " O R D , " it was not included in the list.

Example No. 3: Frequently Occurring English Words


This example uses the word list o f [ 1, 2]. Search time
was less than one second.
A = 3 , B = 15, C = O , D = 7 , E = O , F = 15, G = 0 , H = 10,1 = 0,
J=O,K =O,L=O,M= t2, N = 1 3 , 0 = 7, P = O , Q = O , R = 12,
S = 6 , T = O , U = 15, V = 0 , W = 14, X = 0, Y = 9, Z = 0.
I, it, the, that, at, are, a, is, to, this, as, he, and, have, in, not, be, but,
his, had, or, on, was, of, her, by, you, with, which, for, from.

Example No. 4: Month Abbreviations


This example is from [3]. The function's form was
modified slightly to:
Hash value *-- associated value of the key's second character +
associated value of the key's third character.
A = 4 , B = 5, C = 2, D = O , E = O , F = O , G = 3, H = O , I = O ,
J=O,K =O,L--6, M=O,N=O, O = 5, P = I , Q = 0 , R = 6 ,
S=O,T=6, U=O,V=6, W=O,X=O,Y=5, Z=O.
JUN, SEP, DEC, AUG, JAN, FEB, JUL, APR, OCT, MAY, MAR,
NOV.

This form avoids the conflict between " J A N " and


" J U N " and takes into account the constant key length.
Search time was again well less than one second. Note:
The method presented here is applicable to sets up to
four times as large as those said to be feasible by the
methods described in [3].
19 Communications January 1980
of Volume 23
the ACM Number 1

You might also like