Environment: Developed in (but not restricted to) VC6
Abstract
This article gives a brief, non-academic (no dwelling into time-complexity issues), explanation of what hash tables are about.
Table of Contents
Introduction
There are numerous ways to structure elements into a collection:
- Linked lists
- Binary trees
- Arrays
- And so forth…
They all have their advantages and disadvantages. A common disadvantage is that you have to do (sometimes a lot of) comparisons when you want to find an element (or find out that the element is not stored in the collection).
Although the Binary Tree can help to reduce the required comparisons, there are even faster ways to go element-hunting, ways that ideally require 0 (zero) comparisons!
The Array
The array might have its disadvantages, such as reallocation of the entire array, plus more as the array needs to grow. That has somehow given the array concept a “bad name.” Combine that with something like this:
SomeThing* findSomeThing(const CString& name,
const SomeArray& array)
{
for(unsigned int i=0; i<array.size(); ++i)
{
if (array[i]!=0 && name == array[i]->GetName())
return array[i];
}
return NULL;
}
This mimics the way one, for example, traverses a linked list to find an element. I think this is a rather common way that arrays are used (Have you ever done std::find on a std::vector?). However, when using arrays like that, you miss out the one major advantage it has over the rest of collections:
If you know the index to an element, you can receive it “instantenously.”
Now, doesn’t instantenously sound cool? Unfortunately, is there a small if, but that’s where the hash function comes in…
The Hash Function
So, what is the deal? Well, the concept is quite simple:
Let there be a function that can convert an element’s key into an index. This is the so-called “hash function.”
In other words, a modified version of the example above would be something like this:
SomeThing* findSomeThing(const CString& name,
const SomeArray& array)
{
unsigned int i = calculateIndexOfSomeThing(name, array.size());
return array[i];
}
Of course, the interesting question is this: What should the hash function, calculateIndexOfSomeThing, do? Note that all that calculateIndexOfSomeThing “sees” is the name and the array’s size, not the array itself. To put it generally, a hash function…
- …must return a value within the array (in other words, 0..array’s size – 1). This is the so-called “hash value.”
- …must, given a certain key and array size, always return the same hash value for that key. It must not be dependent on what else might be stored in the array, what day it is, and so forth.
- …should try to “spread out” the values to avoid different elements getting the same index (more about this further down).
- …should preferably be fast.
Here is an example of a simple (and not all that clever) hash function, that, given a string, returns a hash value:
unsigned int getHashValueForString(const char* str,
unsigned int arraySize)
{
return str[0] % arraySize;
};
As you see, it simply returns the first char, with an % arraySize to ensure that the value is between 0 and arraySize-1. There’d obviously be a conflict for an array such as {“Foo”, “Fnurt”}. And that’s where the thing about collisions come in.
Collisions
There probably will be collisions sooner or later. Consider an array where the key is a string of a maximum of 25 chars. How many possible strings are there that are up to 25 characters? Well, let’s not go into number-crunching; let’s just say that it’s insanely many. Consequently, to have an array that can have a unique index for each string, it would have to be insanely large. That in turn would naturally be…well…insane (and 25 chars isn’t all that much, now, is it?). If we don’t intend to buy all the memory storage in the world, there must be another way.
So, what can we do? Well, we can…
- …be prepared for collisions and make sure we can handle them.
- …at least try to avoid getting collisions.
Handling collisions
Uh oh, we have a collision! What can we do? There are a number of solutions:
- Chaining
Chaining lets the array’s elements actually be collections such as linked lists, binary trees (as is done in Rex Schilasky’s article), or even another hash table.
In other words, you can use the hash function to quickly resolve which collection the element is in, and then use a regular find in that collection. - Probing
Probing uses slots in the array other than the one originally indexed by the hash function. For example, when searching, it’d roughly (not the complete algorithm) be something like this: - Get the hash value, i
while (!someStopCondition && array[i%array.size()]->
getKey()!=theKeySearched)
{
i+=someValueLargerThan1;
…
}
If the array’s size is a prime number (it is—see the next section), the i%array.size() wraparound when i>array.size() will be quite efficient because it won’t be the original hash value again until all other slots have been checked.
When handling collisions, we no longer have the (oh, so cool) instantenous access to the elements; thus, we try to avoid them.
Avoiding collisions
There are some techniques to avoid collisions:
- Let the array size be a prime number
It can be mathematically proven that the probability of collisions is less if we let the hash function perform % on a prime number rather than on a non-prime number.
Consequently, the first step to avoid collisions is to make sure the array is sized after some prime number. - Spreading the values
For example, let the hash function be smart enough to distribute its return values as evenly as possible. How to determine the best algorithm for this is beyond the scope if this article (and its author)… - Increasing the array’s size
The more “free slots” in the array, the less probability of collisions. So, to avoid collisions, let the array grow proportionally and try to keep a fixed (free slots)/(array size) ratio. See Resizing the Array below.
Resizing the Array
As always with arrays, if there’s a need for it to grow, one will have to:
- Allocate an entirely new array of a larger scale
- Move elements to the new array
- Discard the old array
With hash arrays, Step 2 is a bit special; we can’t just copy them over. Instead, we need to use the hash function to ensure they get put in the right place. Note that the hash value of “Foo” in an array of 29 elements may be different than the hash value of “Foo” in an array of 23 elements.
The sequence of shuffling elements like this is commonly called “rehashing.”
Typical rehashing:
for (unsigned int i=0;i<oldArray.size();++)
{
if (oldArray[i]!=NULL)
{
unsigned int newHashValue = theHashFunction(oldArray[i]->
GetKey(), newArray.size());
newArray[newHashValue] = oldArray[i]
}
}
The same hash function is used when the old array was dealt with. The only difference is that it gets a different array size.
Disclaimer
Because this article is designed to be a brief and non-academic introduction to the hash concept, the mathematically most correct definitions of the concept are found elsewhere.
Hashing is a widely known concept and the author makes no claims in having invented it.
Downloads
About the source code (src.zip)
The source contains two hash table stuctures, HashTableChained and HashTableProbed, that illustrate what I’ve just been talking about.
Making really generic structures is always somewhat tricky as we all have different demands. The tables in the provided source address this by handling just the core structure. Much of the descisions are made in other classes and these classes are given as template parameters. Thus, they can easily be modified/replaced with something more apropriate.
The arrays are arrays of pointers where a NULL value denotes a (surprise) available slot.
The hash tables’ interface is somewhat inspired by std::map. They have value_type, iterator, const_iterator, find etc, mainly as its good to follow form (besides, it makes them interchangeable with the std::map as Collection parameter (see below)).
HashTableChained.h:
template <class Key, class Value, class MyHasher = Hasher<Key>, class MyGrower = DefaultGrower, class Collection = std::map<Key, Value> > class HashTableChained { ... }
- Key, Value
The key and value types—nothing special about these.. - MyHasher
A class that works like a function (a.k.a. a function object, a.k.a. functor). It is called whenever a Key’s hash value needs to be computed. See GenericHashers.h and/or the demo source for more info. - MyGrower
A class that determines if the array needs to grow. It is expected to have two methods: - size_t getPrimeGreaterThan(size_t size) const
That…well…should return the next prime bigger than the size parameter. - size_t getNewSize(size_t oldSize, size_t freeSlots) const
Queried when new elements are to be inserted. If the returned value > oldSize, the array will be increased (and rehashed, of course) to the returned size. - Collection
Because it’s a chained hash table, we need to define what kind of collection should be used. By default, std::map is used.
HashTableProbed.h:
Pretty much the same as HashTableChained except that it doesn’t have a Collection as a template parameter. It simply holds an array of pointers to a std::pair(Key, Value) structure.
As the name suggests, it will handle collisions by probing; in other words, go look for a free spot/key in the array.
GenericHashers.h:
A set of classes that can be used to compute the hash value.
DefaultGrower.h:
Holds the default implementation of a Grower class that holds a fairly small list of prime numbers. It will return a new, bigger, size if less than 10% of the slots are free.
The demo source (demo.zip)
TestHash.cpp
The demo is a simple console application that tests that the various hash tables work, illustrating things such as…
- …how to reuse the generic hasher classes for other Key types.
- …how to use another HashTable as Collection.
- …tow to use your own, customized, template parameters.
Check its source to find out more.
Enjoy!