Introduction to Hash Tables

Environment: Developed in (but not restricted to) VC6

Abstract

This article gives a brief, non-academic (no dwelling into time-complexity issues), explanation of what hash tables are about.

Table of Contents

Introduction

There are numerous ways to structure elements into a collection:

They all have their advantages and disadvantages. A common disadvantage is that you have to do (sometimes a lot of) comparisons when you want to find an element (or find out that the element is not stored in the collection).

Although the Binary Tree can help to reduce the required comparisons, there are even faster ways to go element-hunting, ways that ideally require 0 (zero) comparisons!

The Array

The array might have its disadvantages, such as reallocation of the entire array, plus more as the array needs to grow. That has somehow given the array concept a "bad name." Combine that with something like this:

SomeThing* findSomeThing(const CString& name,
                         const SomeArray& array)
{
  for(unsigned int i=0; i<array.size(); ++i)
  {
    if (array[i]!=0 && name == array[i]->GetName())
      return array[i];
  }
  return NULL;
}

This mimics the way one, for example, traverses a linked list to find an element. I think this is a rather common way that arrays are used (Have you ever done std::find on a std::vector?). However, when using arrays like that, you miss out the one major advantage it has over the rest of collections:

If you know the index to an element, you can receive it "instantenously."

Now, doesn't instantenously sound cool? Unfortunately, is there a small if, but that's where the hash function comes in...

The Hash Function

So, what is the deal? Well, the concept is quite simple:

Let there be a function that can convert an element's key into an index. This is the so-called "hash function."

In other words, a modified version of the example above would be something like this:

SomeThing* findSomeThing(const CString& name,
                         const SomeArray& array)
{
  unsigned int i = calculateIndexOfSomeThing(name, array.size());
  return array[i];
}

Of course, the interesting question is this: What should the hash function, calculateIndexOfSomeThing, do? Note that all that calculateIndexOfSomeThing "sees" is the name and the array's size, not the array itself. To put it generally, a hash function...

  • ...must return a value within the array (in other words, 0..array's size - 1). This is the so-called "hash value."
  • ...must, given a certain key and array size, always return the same hash value for that key. It must not be dependent on what else might be stored in the array, what day it is, and so forth.
  • ...should try to "spread out" the values to avoid different elements getting the same index (more about this further down).
  • ...should preferably be fast.

Here is an example of a simple (and not all that clever) hash function, that, given a string, returns a hash value:

  unsigned int getHashValueForString(const char* str,
                                     unsigned int arraySize)
  {
    return str[0] % arraySize;
  };

As you see, it simply returns the first char, with an % arraySize to ensure that the value is between 0 and arraySize-1. There'd obviously be a conflict for an array such as {"Foo", "Fnurt"}. And that's where the thing about collisions come in.

Collisions

There probably will be collisions sooner or later. Consider an array where the key is a string of a maximum of 25 chars. How many possible strings are there that are up to 25 characters? Well, let's not go into number-crunching; let's just say that it's insanely many. Consequently, to have an array that can have a unique index for each string, it would have to be insanely large. That in turn would naturally be...well...insane (and 25 chars isn't all that much, now, is it?). If we don't intend to buy all the memory storage in the world, there must be another way.

So, what can we do? Well, we can...

  1. ...be prepared for collisions and make sure we can handle them.
  2. ...at least try to avoid getting collisions.

Handling collisions

Uh oh, we have a collision! What can we do? There are a number of solutions:

  • Chaining
    Chaining lets the array's elements actually be collections such as linked lists, binary trees (as is done in
    Rex Schilasky's article), or even another hash table.
    In other words, you can use the hash function to quickly resolve which collection the element is in, and then use a regular find in that collection.
  • Probing
    Probing uses slots in the array other than the one originally indexed by the hash function. For example, when searching, it'd roughly (not the complete algorithm) be something like this:
    1. Get the hash value, i
    2. while (!someStopCondition && array[i%array.size()]->
                                getKey()!=theKeySearched)
      {
        i+=someValueLargerThan1;
        ...
      }
      
      If the array's size is a prime number (it is—see the next section), the i%array.size() wraparound when i>array.size() will be quite efficient because it won't be the original hash value again until all other slots have been checked.

When handling collisions, we no longer have the (oh, so cool) instantenous access to the elements; thus, we try to avoid them.

Avoiding collisions

There are some techniques to avoid collisions:

  1. Let the array size be a prime number
    It can be mathematically proven that the probability of collisions is less if we let the hash function perform % on a prime number rather than on a non-prime number.
    Consequently, the first step to avoid collisions is to make sure the array is sized after some prime number.
  2. Spreading the values
    For example, let the hash function be smart enough to distribute its return values as evenly as possible. How to determine the best algorithm for this is beyond the scope if this article (and its author)...
  3. Increasing the array's size
    The more "free slots" in the array, the less probability of collisions. So, to avoid collisions, let the array grow proportionally and try to keep a fixed (free slots)/(array size) ratio. See
    Resizing the Array below.

Resizing the Array

As always with arrays, if there's a need for it to grow, one will have to:

  1. Allocate an entirely new array of a larger scale
  2. Move elements to the new array
  3. Discard the old array

With hash arrays, Step 2 is a bit special; we can't just copy them over. Instead, we need to use the hash function to ensure they get put in the right place. Note that the hash value of "Foo" in an array of 29 elements may be different than the hash value of "Foo" in an array of 23 elements.

The sequence of shuffling elements like this is commonly called "rehashing."

Typical rehashing:

for (unsigned int i=0;i<oldArray.size();++)
{
   if (oldArray[i]!=NULL)
   {
      unsigned int newHashValue = theHashFunction(oldArray[i]->
                                  GetKey(), newArray.size());
      newArray[newHashValue] = oldArray[i]
   }
}

The same hash function is used when the old array was dealt with. The only difference is that it gets a different array size.

Disclaimer

Because this article is designed to be a brief and non-academic introduction to the hash concept, the mathematically most correct definitions of the concept are found elsewhere.

Hashing is a widely known concept and the author makes no claims in having invented it.

Downloads

About the source code (src.zip)

The source contains two hash table stuctures, HashTableChained and HashTableProbed, that illustrate what I've just been talking about.

Making really generic structures is always somewhat tricky as we all have different demands. The tables in the provided source address this by handling just the core structure. Much of the descisions are made in other classes and these classes are given as template parameters. Thus, they can easily be modified/replaced with something more apropriate.

The arrays are arrays of pointers where a NULL value denotes a (surprise) available slot.

The hash tables' interface is somewhat inspired by std::map. They have value_type, iterator, const_iterator, find etc, mainly as its good to follow form (besides, it makes them interchangeable with the std::map as Collection parameter (see below)).

HashTableChained.h:

template <class Key, class Value,
         class MyHasher   = Hasher<Key>,
         class MyGrower   = DefaultGrower,
         class Collection = std::map<Key, Value>
         >
class HashTableChained
{ ... }
  • Key, Value
    The key and value types—nothing special about these..
  • MyHasher
    A class that works like a function (a.k.a. a function object, a.k.a. functor). It is called whenever a Key's hash value needs to be computed. See GenericHashers.h and/or the demo source for more info.
  • MyGrower
    A class that determines if the array needs to grow. It is expected to have two methods:
    • size_t getPrimeGreaterThan(size_t size) const
      That...well...should return the next prime bigger than the size parameter.
    • size_t getNewSize(size_t oldSize, size_t freeSlots) const
      Queried when new elements are to be inserted. If the returned value > oldSize, the array will be increased (and rehashed, of course) to the returned size.
  • Collection
    Because it's a chained hash table, we need to define what kind of collection should be used. By default, std::map is used.

HashTableProbed.h:
Pretty much the same as HashTableChained except that it doesn't have a Collection as a template parameter. It simply holds an array of pointers to a std::pair(Key, Value) structure.

As the name suggests, it will handle collisions by probing; in other words, go look for a free spot/key in the array.

GenericHashers.h:
A set of classes that can be used to compute the hash value.

DefaultGrower.h:
Holds the default implementation of a Grower class that holds a fairly small list of prime numbers. It will return a new, bigger, size if less than 10% of the slots are free.

The demo source (demo.zip)

TestHash.cpp
The demo is a simple console application that tests that the various hash tables work, illustrating things such as...

  • ...how to reuse the generic hasher classes for other Key types.
  • ...how to use another HashTable as Collection.
  • ...tow to use your own, customized, template parameters.

Check its source to find out more.

Enjoy!



Downloads

Comments

  • The Crucial Element In order to rule the mizuno-scene Is Pretty Simple and easy!

    Posted by Acuddence on 05/04/2013 03:20am

    Most recent questions about mizuno resolved not to mention the reasons why you should really analyze each concept on this documentation.[url=http://www.nikejpgolf.biz/]ナイキゴルフ[/url] An utter double change on nike [url=http://www.nikejpgolf.biz/nike-ゴルフボール-c-23.html]ナイキgolf[/url] Fresh new questions about mizuno replied to in addition to reasons why you have got to analyze each and every word of this specific write up. [url=http://www.nikejpgolf.biz/nike-アイアン-c-1.html]nike ゴルフ[/url] Unbiased page brings out 2 innovative new things around nike that noone is mentioning. [url=http://www.nikejpgolf.biz/nike-アイアン-c-1.html]ナイキゴルフ[/url] Some of the nike Commerce Meaning - Who likes practically nothing profit?! [url=http://www.nikejpgolf.biz/nike-ゴルフシューズ-c-15.html]nike dunk[/url] Products and developing throughout Las Vegas : mizuno basically leaves without any see you later [url=http://www.nikeyasuyi.com/]ナイキ[/url] Things and release in Las Vegas, Nevada -- nike basically leaves without any farewell [url=http://www.nikeyasuyi.com/nikeナイキRunning-c-3.html]ナイキランニング[/url] Some mizuno Smaller business Meaning - Who cares for pretty much nothing benefits?! [url=http://www.nikeyasuyi.com/nikeナイキDunk-c-9.html]nike シューズ[/url] All nike Business organisation Meaning : People who cares for zilch gains all bonuses? [url=http://www.nikeyasuyi.com/nikeナイキDunk-c-9.html]nike シューズ[/url] nike can provide new life span to the old subject- metallic standardized

    Reply
  • The Secret master the nike-market Is Kind Of Straight forward!

    Posted by Acuddence on 04/27/2013 03:55pm

    Beginner queries about mizuno resolved and in addition the reasons you will need read in detail every statement on this write up.[url=http://www.nikejpgolf.biz/]nike ゴルフ[/url] A definite double sprain on nike [url=http://www.nikejpgolf.biz/nike-ゴルフボール-c-23.html]nikegolf[/url] Different queries about nike replied and in addition the reasons you should definitely browse through each and every word in this report. [url=http://www.nikejpgolf.biz/nike-アイアン-c-1.html]ナイキ[/url] Honest summary reveals A number of new things about nike that noone is speaking about. [url=http://www.nikejpgolf.biz/nike-アイアン-c-1.html]ナイキ[/url] The most important mizuno Agency Dialogue - Which means, who likes little or nothing is the winner?? [url=http://www.nikejpgolf.biz/nike-ゴルフシューズ-c-15.html]nike dunk[/url] Things and processing in Los Angeles -- nike will leave with no thanks [url=http://www.nikeyasuyi.com/]ナイキ スニーカー[/url] Goods and development throughout Las Vegas, Nevada - mizuno has left without any kind regards [url=http://www.nikeyasuyi.com/nikeナイキRunning-c-3.html]ナイキランニング[/url] Some of the nike Marketing Talk : Consumers who cares for little or nothing benefits?!? [url=http://www.nikeyasuyi.com/nikeナイキDunk-c-9.html]nike シューズ[/url] Typically the nike Enterprise Presentation : People who cares about nothing profits?!? [url=http://www.nikeyasuyi.com/nikeナイキDunk-c-9.html]nike dunk[/url] mizuno gives all new life span to an old topic- golden standardized

    Reply
  • help...

    Posted by chungwufeiQ on 02/20/2010 09:39am

    can u simplify the codes just enaf for dummies... i'm having my report for hashing, hash tables and scatter tables... can't really understand the codes... anyway, were just gonna hash integer values...

    Reply
  • Thanks...

    Posted by Promotional Engine on 01/02/2007 12:48am

    Your easy to understand and usefull article is very good. (All of Introductions must be the same!)

    Reply
  • Added source

    Posted by PerFnurt on 08/31/2004 02:56am

    Added source for illustration

    Reply
  • Addendum: Another plain introduction to hash tables

    Posted by Legacy on 02/23/2004 12:00am

    Originally posted by: Fred

    http://epaperpress.com/sortsearch/index.html

    Reply
  • Yet again thanks

    Posted by Legacy on 02/22/2004 12:00am

    Originally posted by: Darwen

    Look, I have to say that most of the articles about hashing go into huge details about the number theory behind it.

    Which in my opinion makes them unreadable.

    Your article was concise, and neat.

    This (Hey listen up admins !) should be added to the FAQ.

    Well done again... a lovely article.

    Darwen.

    Reply
  • Slow container

    Posted by Legacy on 02/20/2004 12:00am

    Originally posted by: Caprice

    Where it can be used?

    Reply
  • Good article

    Posted by Legacy on 02/19/2004 12:00am

    Originally posted by: Darwen

    What a good article ! Should be self explanitory for most people. Except chimpanzees and oranges of course.

    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Live Event Date: October 29, 2014 @ 11:00 a.m. ET / 8:00 a.m. PT Are you interested in building a cognitive application using the power of IBM Watson? Need a platform that provides speed and ease for rapidly deploying this application? Join Chris Madison, Watson Solution Architect, as he walks through the process of building a Watson powered application on IBM Bluemix. Chris will talk about the new Watson Services just released on IBM bluemix, but more importantly he will do a step by step cognitive …

  • Agile methodologies give development and test teams the ability to build software at a faster rate than ever before. Combining DevOps with hybrid cloud architectures give teams not just the principles, but also the technology necessary to achieve their goals. By combining hybrid cloud and DevOps: IT departments maintain control, visibility, and security Dev/test teams remain agile and collaborative Organizational barriers are broken down Innovation and automation can thrive Download this white paper to …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds