Category : C Source Code
Archive   : CSRC2.ZIP
Filename : PERFHASH.DOC

 
Output of file : PERFHASH.DOC contained in archive : CSRC2.ZIP
/*
* P e r f e c t H a s h
*
*
* Program to search for Minimal Perfect Hash Functions for
* use in lexical analyzers. C.D. Havener Jan 26 1982
* GenRad Inc. 37 Great Road, Bolton Mass. 01740
* Based on paper "Minimal Perfect Hash Functions Made Simple"
* by Richard J. Cichelli - Comm. of ACM Jan 1980 pp 17-19
*
* Synopsis: The hash function is h = assoc value of 1st
* letter + length of keyword + assoc value of last letter
*
* This program finds the associated values of the letters
* given a list of keywords, 1 per line. It works most of
* the time for up to about 40 keywords but certain
* pathological cases exist. A semi-perfect hash is usually
* found by the program. The user can then tighten the
* default limits for max associated char value or the
* table limit using the -v and -t options. Sometimes the
* presort heuristics actually make the search process
* much more difficult. The user can try his luck at manual
* sorting using the -n option. Since the hash function
* produces such a limited range of numbers it can only work
* for up to about 40 keywords. If a language needs say 80
* keywords just split them up into two tables and let the
* lexical analyzer look in first one then the other, this
* will still be much faster than any other keyword lookup.
*
* This program has run sucessfully on a VAX 11/780 under
* 4.1BSD and VMS (using Vax-11 C and Decus C).
*/
synopsis

perfect [-options]

description

perfect reads a list of keywords from the standard input
and computes a "perfect hash" function for the set.
The following options are defined:
-d Enable debug code.
-n Disable pre-sort of keywords. (See below).
-t Limit the maximum table size to entries.
-v Limit the maximum associated character value.
-k Use only the first keywords in the list.
-a Give a status report every seconds.
(Unix and Vax-11C/VMS only).
-o file Write a sample keyword lookup routine to the
indicated file.
The program prints a running commentary on the standard
output.

Discussion

This technical note describes an implementation of a
pragmatic algorithm for finding perfect or semi-perfect hash
functions for lists of keywords. The resulting hash function can
be used to speed up lexical analyzers used in translators and
compilers. The algorithm was described by R.J. Cichelli in The
January 1980 issue of Communications of the ACM under the title
"Minimal Perfect Hash Functions made Simple." The article did not
include a computer language description of the algorithm and some
important implementation details were unclear. It is assumed that
the user will know what to do with the output of this program.
The -o option may be used to produce a lexical analyser kernel
from the tables built by this program. Another reference for
those wishing to pursue the topic further is "More On Minimal
Perfect Hash tables" by Curtis Cook and R. Oldehoeft, a Colorado
State University Technical Report.

The program takes a list of keywords and sorts them in such
a way that the search time for a hash function will be
reasonable. Once sorted a recursive trial and error procedure
hunts for a hash function satisfying user supplied bounds
of table size and associated character value limits. The
built-in hash function is

hash = assoc value of first character
+ keyword length
+ assoc value of last character

It is critically important to select a good ordering of the
keywords before searching begins. I ran up 100 hours of VAX time
searching for a hash function with an unordered list, and gave
up. Once the sort heuristics were debugged it found the hash
function in minutes. Typically it will find the function in
minutes or you may as well give up. A status reporting feature is
built into the program on the UNIX system that lets you follow
the progress of the search depth. If it has trouble, you can
tell just which word it can't get past, and take appropriate
action. If it has trouble you can attempt to alter its choice of
pre-ordering by moving troublesome words to the front of the
list. In a sequel to his paper, Cichelli stated that sometimes
the sort heuristics makes the search longer. There is also no
guarantee that a hash function can be found. The program does the
obvious precheck that no two keywords have the same first and
last letter and length in common (e.g. BAK KAB). Nonetheless,
as pointed out by Jaeschke and Osterberg in an overly harsh
criticism, there are many pathological sets of keywords that fail.
(On Cichelli's minimal perfect hash functions method. Comm. ACM
Dec. 1980, pp 728-729) The algorithm also only works for up to
about 40 keywords due to the limited range of numbers the formula
can generate. If you have, say 80 keywords, just make two hash
tables and look in each one. Here are some examples of the
program's output:

Perfect hash function finder, CDH Ver. 2.9
Start time = Mon Oct 17 20:40:40 1983
28 keywords, 19 distinct letters.
The associated char value limit is 19
The table size limit is 100
The search ordering is
else double continue case float struct sizeof
static short extern typedef default register for
char while entry int if return do unsigned switch
union goto auto long break

Success! Associated Char Values Follow:
a =19, b = 4, c = 1, d = 0, e = 0, f = 3, g =17,
h =14, i =17, k =19, l = 6, n = 8, o = 0, r =11,
s = 6, t = 0, u =16, w =13, y =14,
Hash min = 2, max = 30, spread = 29
do 2, else 4, case 5, double 6, default 7,
float 8, continue 9, typedef 10, short 11, struct 12,
static 13, extern 14, sizeof 15, char 16, for 17,
while 18, entry 19, int 20, goto 21, if 22, auto 23,
unsigned 24, return 25, switch 26, long 27, break 28,
union 29, register 30,

Total search() invocations = 15913
Started Mon Oct 17 20:40:40 1983
Finished Mon Oct 17 20:40:42 1983

Perfect hash function finder, CDH Ver. 2.9
Start time = Mon Oct 17 20:41:11 1983
39 keywords, 19 distinct letters.
The associated char value limit is 19
The table size limit is 100
The search ordering is
TEXT RESET TRUE REWRITE READLN SQRT SQR EOLN TRUNC
PUT EXP PAGE CHR CHAR COS SUCC READ ROUND DISPOSE
PRED SIN OUTPUT ORD INPUT INTEGER GET MAXINT REAL
WRITE EOF FALSE NEW WRITELN LN ARCTAN ABS BOOLEAN
PACK UNPACK

Success! Associated Char Values Follow:
A =18, B =11, C =16, D =18, E = 3, F = 3, G = 0,
I = 1, K =19, L = 9, M =18, N =19, O = 8, P = 9,
R = 0, S =14, T = 0, U =13, W =19,
Hash min = 3, max = 45, spread = 43
GET 3, TEXT 4, RESET 5, INPUT 6, TRUE 7,
INTEGER 8, EOF 9, REWRITE 10, FALSE 11, PUT 12,
REAL 13, OUTPUT 14, EXP 15, PAGE 16, SQR 17, SQRT 18,
CHR 19, CHAR 20, TRUNC 21, READ 22, ROUND 23,
MAXINT 24, READLN 25, EOLN 26, WRITE 27, DISPOSE 28,
ORD 29, LN 30, PRED 31, PACK 32, COS 33, SUCC 34,
ABS 35, SIN 36, BOOLEAN 37, UNPACK 38, NEW 41,
ARCTAN 43, WRITELN 45,

Total search() invocations = 149292
Started Mon Oct 17 20:41:11 1983
Finished Mon Oct 17 20:41:35 1983

Usually, the first time you run perf, just let everything
default. The second time,
use the -t option to limit the table size to the first hash
value plus the number of keywords.

On Unix and Vax/VMS (Vax-11 C), The program will accept
SIGTERM signals (CTRL/C on VMS) for an update status report
since it may take quite a while to find the hash function values.

Sample keyword tables

The following tables are known to work correctly.
The first defines the keywords for the C programming
language; the second for a toy computer language.

int char float double struct union long short
unsigned auto extern register typedef static
goto return sizeof break continue if else for
do while switch case default entry

GET TEXT RESET OUTPUT MAXINT INPUT TRUE
INTEGER EOF REWRITE FALSE CHR CHAR TRUNC REAL
SQR SQRT WRITE PUT ORD READ ROUND READLN EXP
PAGE EOLN COS SUCC DISPOSE NEW ABS LN BOOLEAN
WRITELN SIN PACK UNPACK ARCTAN PRED



  3 Responses to “Category : C Source Code
Archive   : CSRC2.ZIP
Filename : PERFHASH.DOC

  1. Very nice! Thank you for this wonderful archive. I wonder why I found it only now. Long live the BBS file archives!

  2. This is so awesome! 😀 I’d be cool if you could download an entire archive of this at once, though.

  3. But one thing that puzzles me is the “mtswslnkmcjklsdlsbdmMICROSOFT” string. There is an article about it here. It is definitely worth a read: http://www.os2museum.com/wp/mtswslnk/