Introduction to Hash Tables

Environment: Developed in (but not restricted to) VC6

Abstract

This article gives a brief, non-academic (no dwelling into time-complexity issues), explanation of what hash tables are about.

Table of Contents

Introduction

There are numerous ways to structure elements into a collection:

They all have their advantages and disadvantages. A common disadvantage is that you have to do (sometimes a lot of) comparisons when you want to find an element (or find out that the element is not stored in the collection).

Although the Binary Tree can help to reduce the required comparisons, there are even faster ways to go element-hunting, ways that ideally require 0 (zero) comparisons!

The Array

The array might have its disadvantages, such as reallocation of the entire array, plus more as the array needs to grow. That has somehow given the array concept a "bad name." Combine that with something like this:

SomeThing* findSomeThing(const CString& name,
                         const SomeArray& array)
{
  for(unsigned int i=0; i<array.size(); ++i)
  {
    if (array[i]!=0 && name == array[i]->GetName())
      return array[i];
  }
  return NULL;
}

This mimics the way one, for example, traverses a linked list to find an element. I think this is a rather common way that arrays are used (Have you ever done std::find on a std::vector?). However, when using arrays like that, you miss out the one major advantage it has over the rest of collections:

If you know the index to an element, you can receive it "instantenously."

Now, doesn't instantenously sound cool? Unfortunately, is there a small if, but that's where the hash function comes in...

The Hash Function

So, what is the deal? Well, the concept is quite simple:

Let there be a function that can convert an element's key into an index. This is the so-called "hash function."

In other words, a modified version of the example above would be something like this:

SomeThing* findSomeThing(const CString& name,
                         const SomeArray& array)
{
  unsigned int i = calculateIndexOfSomeThing(name, array.size());
  return array[i];
}

Of course, the interesting question is this: What should the hash function, calculateIndexOfSomeThing, do? Note that all that calculateIndexOfSomeThing "sees" is the name and the array's size, not the array itself. To put it generally, a hash function...

  • ...must return a value within the array (in other words, 0..array's size - 1). This is the so-called "hash value."
  • ...must, given a certain key and array size, always return the same hash value for that key. It must not be dependent on what else might be stored in the array, what day it is, and so forth.
  • ...should try to "spread out" the values to avoid different elements getting the same index (more about this further down).
  • ...should preferably be fast.

Here is an example of a simple (and not all that clever) hash function, that, given a string, returns a hash value:

  unsigned int getHashValueForString(const char* str,
                                     unsigned int arraySize)
  {
    return str[0] % arraySize;
  };

As you see, it simply returns the first char, with an % arraySize to ensure that the value is between 0 and arraySize-1. There'd obviously be a conflict for an array such as {"Foo", "Fnurt"}. And that's where the thing about collisions come in.

Collisions

There probably will be collisions sooner or later. Consider an array where the key is a string of a maximum of 25 chars. How many possible strings are there that are up to 25 characters? Well, let's not go into number-crunching; let's just say that it's insanely many. Consequently, to have an array that can have a unique index for each string, it would have to be insanely large. That in turn would naturally be...well...insane (and 25 chars isn't all that much, now, is it?). If we don't intend to buy all the memory storage in the world, there must be another way.

So, what can we do? Well, we can...

  1. ...be prepared for collisions and make sure we can handle them.
  2. ...at least try to avoid getting collisions.

Handling collisions

Uh oh, we have a collision! What can we do? There are a number of solutions:

  • Chaining
    Chaining lets the array's elements actually be collections such as linked lists, binary trees (as is done in
    Rex Schilasky's article), or even another hash table.
    In other words, you can use the hash function to quickly resolve which collection the element is in, and then use a regular find in that collection.
  • Probing
    Probing uses slots in the array other than the one originally indexed by the hash function. For example, when searching, it'd roughly (not the complete algorithm) be something like this:
    1. Get the hash value, i
    2. while (!someStopCondition && array[i%array.size()]->
                                getKey()!=theKeySearched)
      {
        i+=someValueLargerThan1;
        ...
      }
      
      If the array's size is a prime number (it is—see the next section), the i%array.size() wraparound when i>array.size() will be quite efficient because it won't be the original hash value again until all other slots have been checked.

When handling collisions, we no longer have the (oh, so cool) instantenous access to the elements; thus, we try to avoid them.

Avoiding collisions

There are some techniques to avoid collisions:

  1. Let the array size be a prime number
    It can be mathematically proven that the probability of collisions is less if we let the hash function perform % on a prime number rather than on a non-prime number.
    Consequently, the first step to avoid collisions is to make sure the array is sized after some prime number.
  2. Spreading the values
    For example, let the hash function be smart enough to distribute its return values as evenly as possible. How to determine the best algorithm for this is beyond the scope if this article (and its author)...
  3. Increasing the array's size
    The more "free slots" in the array, the less probability of collisions. So, to avoid collisions, let the array grow proportionally and try to keep a fixed (free slots)/(array size) ratio. See
    Resizing the Array below.

Resizing the Array

As always with arrays, if there's a need for it to grow, one will have to:

  1. Allocate an entirely new array of a larger scale
  2. Move elements to the new array
  3. Discard the old array

With hash arrays, Step 2 is a bit special; we can't just copy them over. Instead, we need to use the hash function to ensure they get put in the right place. Note that the hash value of "Foo" in an array of 29 elements may be different than the hash value of "Foo" in an array of 23 elements.

The sequence of shuffling elements like this is commonly called "rehashing."

Typical rehashing:

for (unsigned int i=0;i<oldArray.size();++)
{
   if (oldArray[i]!=NULL)
   {
      unsigned int newHashValue = theHashFunction(oldArray[i]->
                                  GetKey(), newArray.size());
      newArray[newHashValue] = oldArray[i]
   }
}

The same hash function is used when the old array was dealt with. The only difference is that it gets a different array size.

Disclaimer

Because this article is designed to be a brief and non-academic introduction to the hash concept, the mathematically most correct definitions of the concept are found elsewhere.

Hashing is a widely known concept and the author makes no claims in having invented it.

Downloads

About the source code (src.zip)

The source contains two hash table stuctures, HashTableChained and HashTableProbed, that illustrate what I've just been talking about.

Making really generic structures is always somewhat tricky as we all have different demands. The tables in the provided source address this by handling just the core structure. Much of the descisions are made in other classes and these classes are given as template parameters. Thus, they can easily be modified/replaced with something more apropriate.

The arrays are arrays of pointers where a NULL value denotes a (surprise) available slot.

The hash tables' interface is somewhat inspired by std::map. They have value_type, iterator, const_iterator, find etc, mainly as its good to follow form (besides, it makes them interchangeable with the std::map as Collection parameter (see below)).

HashTableChained.h:

template <class Key, class Value,
         class MyHasher   = Hasher<Key>,
         class MyGrower   = DefaultGrower,
         class Collection = std::map<Key, Value>
         >
class HashTableChained
{ ... }
  • Key, Value
    The key and value types—nothing special about these..
  • MyHasher
    A class that works like a function (a.k.a. a function object, a.k.a. functor). It is called whenever a Key's hash value needs to be computed. See GenericHashers.h and/or the demo source for more info.
  • MyGrower
    A class that determines if the array needs to grow. It is expected to have two methods:
    • size_t getPrimeGreaterThan(size_t size) const
      That...well...should return the next prime bigger than the size parameter.
    • size_t getNewSize(size_t oldSize, size_t freeSlots) const
      Queried when new elements are to be inserted. If the returned value > oldSize, the array will be increased (and rehashed, of course) to the returned size.
  • Collection
    Because it's a chained hash table, we need to define what kind of collection should be used. By default, std::map is used.

HashTableProbed.h:
Pretty much the same as HashTableChained except that it doesn't have a Collection as a template parameter. It simply holds an array of pointers to a std::pair(Key, Value) structure.

As the name suggests, it will handle collisions by probing; in other words, go look for a free spot/key in the array.

GenericHashers.h:
A set of classes that can be used to compute the hash value.

DefaultGrower.h:
Holds the default implementation of a Grower class that holds a fairly small list of prime numbers. It will return a new, bigger, size if less than 10% of the slots are free.

The demo source (demo.zip)

TestHash.cpp
The demo is a simple console application that tests that the various hash tables work, illustrating things such as...

  • ...how to reuse the generic hasher classes for other Key types.
  • ...how to use another HashTable as Collection.
  • ...tow to use your own, customized, template parameters.

Check its source to find out more.

Enjoy!



Downloads

Comments

  • help...

    Posted by chungwufeiQ on 02/20/2010 09:39am

    can u simplify the codes just enaf for dummies... i'm having my report for hashing, hash tables and scatter tables... can't really understand the codes... anyway, were just gonna hash integer values...

    Reply
  • Thanks...

    Posted by Promotional Engine on 01/02/2007 12:48am

    Your easy to understand and usefull article is very good. (All of Introductions must be the same!)

    Reply
  • Added source

    Posted by PerFnurt on 08/31/2004 02:56am

    Added source for illustration

    Reply
  • Addendum: Another plain introduction to hash tables

    Posted by Legacy on 02/23/2004 12:00am

    Originally posted by: Fred

    http://epaperpress.com/sortsearch/index.html

    Reply
  • Yet again thanks

    Posted by Legacy on 02/22/2004 12:00am

    Originally posted by: Darwen

    Look, I have to say that most of the articles about hashing go into huge details about the number theory behind it.

    Which in my opinion makes them unreadable.

    Your article was concise, and neat.

    This (Hey listen up admins !) should be added to the FAQ.

    Well done again... a lovely article.

    Darwen.

    Reply
  • Slow container

    Posted by Legacy on 02/20/2004 12:00am

    Originally posted by: Caprice

    Where it can be used?

    Reply
  • Good article

    Posted by Legacy on 02/19/2004 12:00am

    Originally posted by: Darwen

    What a good article ! Should be self explanitory for most people. Except chimpanzees and oranges of course.

    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • As all sorts of data becomes available for storage, analysis and retrieval - so called 'Big Data' - there are potentially huge benefits, but equally huge challenges...
  • The agile organization needs knowledge to act on, quickly and effectively. Though many organizations are clamouring for "Big Data", not nearly as many know what to do with it...
  • Cloud-based integration solutions can be confusing. Adding to the confusion are the multiple ways IT departments can deliver such integration...

Most Popular Programming Stories

More for Developers

RSS Feeds

Thanks for your registration, follow us on our social networks to keep up-to-date