Hash Functions
A hash function typically is designed to take a "large" piece of data such as a string or chunk of binary and reduce it to a signature, typically a number or a string that can be used as a simpler representation of the original data. Since the original information is represented by a shorter value, hash functions are many-to-one (several of the original values may map onto the same hash value). A hash functions fall into two major categories, cryptographic and non-cryptographic. Non-cryptographic hash functions are typically used for error detection, and searching. Crytographic ones are used as steps in secure protocols to ensure the immutability of the original content. Cryptographic hashes are meant to be one way; you can't come up with any of the possible original data values given only the hash value.
There are many complicated aspects to coming up with a proper hashing function, especially cryptographic ones. Lots of complicated mathematical theory and proofs are sometimes used to prove various intrinsic properties. Naively choosing a function can be a mistake, since it may not work too well.
Searching
Non-cryptographic hash functions are often used to generate keys for KeyDataAssociation for insertion in HashTables and many other structures used to store SymbolTables - in fact, sometimes the term hash function is used to refer to a function that not only produces a signature from a piece of input data but also calculates the index of the entry in a hash table to place it in.
A relatively even distribution on the range of hash values from the original value domain is important. It minimizes the likelihood of handling a collision in practice, since more than one original object may be mapped onto the same hash value. Only with a "good" hash function can Hashing make certain search operations linear (amortized) rather than N*LOG(N). The tricky part is coming up with a "good" hashing function, given all the different types of original domain objects that various applications could utilize. Attempts at providing good hashing functions in many different circumstances and abstracting the process fly underneath the banner of Universal Hashing.
In addition to a good hashing function, the range of hash values has to be kept somewhat sparse to avoid collisions. A rule of thumb some hashing libraries use is to only allow the range to become 75% full, otherwise performance degradation may become an issue. When this problem occurs, too many original values being mapped to too small a hash range. The problem is either solved by knowing how many objects need to be hashed a-priori, or by remapping the objects from one hash-space to a larger one. Like a dynamic string's current length and allocation, a certain geometric rate maybe used to control the cardinality of the increase (i.e. twice the allocation or HashTable entries (hash range values) as before, etc). Hashing schemes are classic examples of the trade-offs between time (cpu) and space (ram) in finite machines. Many books on computer algorithms will go into more depth than possible here. See the works and of the likes of Knuth, Aho/Hopcroft/Ullman, Cormen/Leiserson/Rivest, and Sedgewick for more algorithmic details.
Underfilled HashTables could be candidates for reaping unused memory, when an application has reached MemoryExhaustion. The cost would be to map all the values to a lesser space (still only 75% full or less) and to do this within existing memory, which could take a fair degree of cpu and time. Most applications don't go this far though; mostly HashTables are reclaimed largely in their entirety only when going out of scope and use.
Most of the time finding a good library will suffice, once you know how to use it properly.
Good Hash Functions
Good hash functions can be hard to come by; many have to be specially made for the task at hand. However, there are a variety of tried and proven hashing techniques that have been developed.
So, add links to them here.
Related Topics
Online Resources
None yet.
Example Implementations
A simple example of a hash function for strings (in CLanguage):
int string_hash(char *s) {
int hval = 0;
while(*s != '\0') {
hval *= 3;
hval += *s;
++s;
} return hval;
};This isn't a very good example; many strings will likely share the same signature.
