Hashing
Hashing is a technique used
for performing insertions, deletions and finds in constant average time. Tree
operations that require any ordering information among the elements are not
supported efficiently. Thus, operations such as find_min, find_max, and the
printing of the entire table in sorted order in linear time are not supported.
The
ideal hash table data structure is merely an array of some fixed size,
containing the keys. Typically, a key is a string with an associated value (for
instance, salary information). We will refer to the table size as H_SIZE, with
the understanding that this is part of a hash data structure and not merely
some variable floating around globally. The common convention is to have the
table run from 0 to H_SIZE-1; we will see why shortly.
Each key
is mapped into some number in the range 0 to H_SIZE - 1 and placed in the
appropriate cell. The mapping is called a hash function, which ideally should
be simple to compute and should ensure that any two distinct keys get different
cells. Since there are a finite number of cells and a virtually inexhaustible
supply of keys, this is clearly impossible, and thus we seek a hash function
that distributes the keys evenly among the cells. Figure 1 is typical of a perfect situation. In this
example, john hashes to 3, phil hashes to 4, dave hashes to 6, and mary hashes
to 7
This is the basic idea of
hashing. The only remaining problems deal with choosing a function, deciding
what to do when two keys hash to the same value (this is known as a collision),
and deciding on the table size.
Hash Function:
If the
input keys are integers, then simply returning key mod H_SIZE is generally a
reasonable strategy, unless key happens to have some undesirable properties. In
this case, the choice of hash function needs to be carefully considered. For
instance, if the table size is 10 and the keys all end in zero, then the
standard hash function is obviously a bad choice. For reasons we shall see
later, and to avoid situations like the one above, it is usually a good idea to
ensure that the table size is prime. When the input keys are random integers,
then this function is not only very simple to compute but also distributes the
keys evenly.
Usually, the keys are strings; in this case, the hash
function needs to be chosen carefully.
One option is to add up the ASCII values of the
characters in the string. we
declare the type INDEX, which is returned by the hash function. The routine implements this strategy and uses the typical
C method of stepping through a string.
The hash function given
below is simple to implement and computes an answer quickly. However, if the
table size is large, the function does not distribute the keys well. For
instance, suppose that H_SIZE = 10,007 (10,007 is a prime number). Suppose all
the keys are eight or fewer characters long. Since a char has an integer value
that is always at most 127, the hash function can only assume values between 0
and 1016, which is 127 * 8. This is clearly not an equitable distribution!
