Hash Tables
Be comfortable with KeyDataAssociation and the LinkedList.
Hash tables are associative arrays. The basic hash table consists of an array or list of entries - each entry consisting of a key/data pair. There must be a definite upper limit on the number of entries in the table - the table size - and each entry in the table should be indexed sequentially; ie entries 0 through size.
Insertion, deletion and searches start off with this equation:
i = k mod size
This calculates an index (i) for the given key (k) using the size of the hash table. This index determines which entry in the table the key/data pair for the given key should be stored. Thus, extracting or inserting an item into the table is as easy as calculating the appropriate index using the key.
That should be enough to understand how one works, but there are implementation details that need to be worked out. The main thing is hash clash, when two different keys give the same index in the table. If you can be sure that none of your keys will clash, or you can safely ignore clashing, then everything is just peachy. The two simplest techniques of dealing with this can be seen in the following subsections.
Btw, using this technique without hash clash resolution gives insertion and search times of O(1) - the best possible with /any/ data structure.
Chained Hash Table
Instead of storing key/data pairs directly in each table entry, a pointer to the head of a LinkedList is stored instead. The actual key/data pairs are inserted into the LinkedList for the calculated entry, and searching is performed by searching through the appropriate LinkedList linearly. The performance isn't stellar, but it's faster than putting all the key/data pairs into a single LinkedList. Insertion is fast, and there's no hash clash because every entry can store a theoretically infinite number of key/data pairs.
This kind of hash table is easy to program and convenient to use, but is space inefficient (overhead from the linked lists) and time inefficient (traversing a linked list to find items isn't all that fast).
Probe Hash Table
Key/data pairs are still stored directly in each table entry, but the solution to hash clash is entirely different. Every entry in the table has an additional flag, marking it either as empty, used or deleted. Usually these are flagged using special key values, referred to as sentinels. For instance, if the keys in a given hash table were integers, 0 might be used for empty, 1 for deleted, and any other key value would be considered a regular entry.
Searching is performed by calculating the table index as usual. Once that is done, the found entry's key is checked if it matches the search key. If it does, the search returns successfully. Otherwise the table index is increased by one (wrapping around if size is hit). The next entry's key is checked and so on until a match is found. But before each key is matched, the entry's flag is checked. An empty flag immediately terminates the search unsuccessfully, and entries that are deleted are skipped over. The search should terminate unsuccessfully if the entire table is checked and no entries are found.
Insertion is performed by calculating the table index and looping through the table until an empty or deleted cell is found, at which point the new key/data pair is inserted. Again, the the entire table is checked and there is no space free it should terminate unsuccessfully (or possibly resize the table, an expensive operation involving rehashing all of the entries into the new table).
Deletetion is like a search, except upon success the found entry is marked as deleted.
This hashing technique is space efficient (no extra fields used in linked list nodes) but also is limited in that the table size bounds the number of items that can be inserted (a problem not suffered with chained tables). Searching speed is good, and can be improved with DoubleHashing and other techniques. Insertion is slightly more expensive than with chained tables, but all around every operation will typically have good runtime.
Not Perfect Yet
Unfortunately, both of the above techniques are useless when it comes to resizing (the lists in the chaining technique will eventually become overly cumbersome). They require the program to create a new table of the desired size and rehash all the elements from the old array - painfully slow and inefficient. I can't remember any of the clever hash techniques right now, please add them to these subsections if you know them. There are also a lot of strange hybrid list-hash-tree structures out there, but create new pages for them unless the modification is small or relevant to the basic hash table.
Related Data Structures
None yet.
Related Algorithms
Sample Implementations
None yet.
Online Resources
None yet.
I find this writeup to be poorly organized and difficult to understand. Also there seem to be certain assumptions that are not stated. For example, the key is not explained at all. Will the author please rewrite it in a simpler manner? If nobody fixes it, then I will at a later time. --MikeLeonhard
I'm sorry, the writing is confused. I'm fixing the first section now and other sections as best as I can, but make whatever changes you feel are needed. After all, it is a wiki :l I'll assume the reader knows what I mean by modulo, but if I have to explain what a key is then perhaps another page should be made for that?
Hey, mysterious aliant.net user: How about signing in with a WikiWord name since you're making so many changes? -- AdamChlipala
--- Ah, this is SebastianHubbard, but i'm on vacation. I'll be back home tomorrow, so it was hardly worth signing in.
