Using a Pair Class as Efficient Key Objects

Efficient Keys with the CPair Class

Sometimes, you need to store complex data in a key object for a map class, such as a Hashtable or a TreeMap.

I recently worked on a project where some data related to two integers. Initially, the code used a key of type String that was created by the string representation of the two key integers separated by a fixed delimiter. I wanted to see how much this would speed up with a class that specifically stores the two integers. I created a CPair<int, int> class and compared the speed with the previous implementation. The results surprised me.

The implementation of a CPair class, used as a key, is trivial. To be used as a key, a type must implement the equals() and the hashCode() methods from Object. The equals() method just compares the two ints in the pair. To implement hashCode(), the CPair class multiplies the first member of the pair with a prime and adds the second member. To choose the prime, see the following test results.

//Moderately high prime used for hashing
private final int m_radix = 2221;

/** Calculate the hash code. We cache the hash so we don't have
 *  to calculate it on each hashCode() call.
 */
public int calcHash()
{
   // TODO: Could make this overflow proof -- with large numbers,
   // xor higher and lower words
   return (m_first * m_radix + m_second);
}

/** Hash code. Object.hashCode override.
 *
 */
public int hashCode() {
   return m_hash;
}

/** Equals -- Object.equals override
 * 
 *  @return true if both first and second values in obj match this.
 */
public boolean equals(Object obj)
{
   if (null == obj)
      return false;
   if (obj instanceof CPair)
   {
      final CPair right = (CPair)obj;

      return ((right.m_first  == m_first) &&
              (right.m_second == m_second));
   }
   return false;
}

I found a slight performance advantage to using ints over longs.

The CPair class as provided is immutable and used only as a key. Thus, there are no getters or setters.

Test Results

There is test code in the CPair class to compare the performance of using a String and using the CPair class as a key object to a Hashtable. The test code is accessed through the main() method. The command line takes three arguments: two sizes and a lookup count. A Hashtable is created and objects added in a double-loop; the first loop has a number of iterations given by the first argument, and the second loop is limited by the second argument. The values in the map are Integer objects with the value of the current iteration of each of the two loops multiplied together.

More importantly, all this is done twice, once with String keys and once with CPair keys. All the logic besides the keys is identical. The third argument, the lookup count, specifies how many lookups to do on each Hashtable, as a minimum. The code will make a full iteration through each Hashtable a number of times, until it has done a number of lookups equal to or larger than the lookup count. Aside: Although in a manner of speaking, there are two dimensions to each entry, a Hashtable is one-dimensional (barring collisions, but from the interface it's always one-dimensional).

Times are measured with System.currentTimeMillis(). There are more precise mechanisms, but this gave a good ballpark and for large enough volumes, easily captured the difference in performance. I inserted gc() calls and optimized the initial heap size to avoid garbage collection during the loops. Although the reduced memory usage of CPair will result in less time spent in the garbage collector, I wanted to know the difference in lookup times.

In general, the CPair method was 30%-75% faster in setting up the Hashtables. Accessing the values in the Hashtable was a factor of 4-8 times faster. In a few test cases, I saw the CPair implementation almost ten times faster than String.

One curiosity is that CPair seemed to do worse with a small number of iterations in the first loop, and a high number of iterations in the second loop. The String class did worse with a single iteration in the first loop and a high count in the second loop. This gives a bit of insight into the distributions of the hashing algorithms of the two classes, respectively, assuming their relatively low performance occurs where there are the most collisions in the hash distribution. At its worst, the CPair class still outperformed the String class by a factor of two, and a factor of four after fine tuning.

I used the results to calibrate the factor I used to calculate the hash for the CPair class. Given the range I expected the Hashtable to operate in, I settled on a value of 2221. In the code, this is the radix value. I wanted to find the optimum value and use a final int. If you use pairs where the first value covers a wide range, or can get large, you could get by with a smaller radix, which would reduce overflows. If your Java implementation throws a runtime exception on overflows, you will want to eliminate them completely. If you use pairs where the first value is low and the second value covers a large range, you may want to use an even higher radix to reduce collisions.

If you use different maps with different weights, you can also make the radix a member variable and pass it in the constructor, or, preferably, use generics.

When to Use the Class

The pair class can be used whenever you need to store information relating to two objects. Variants can be used to store information relating to multiple objects. For instance, you can implement a triplet class to store information relating to three objects. The following two examples involve the pair class as is.

Imagine you are building an e-commerce site that displays special offers. There's a global list of special offers. Specific products may have additional discounts associated with an offer. For instance, a web site selling computer equipment may have a special offer on a USB storage card, with another discount if the user is viewing a laptop, where the storage card is considered an accessory. Thus, you will need to quickly look up any additional discount(s) associated with a product and an offer. In this example, you can cache the discounts in a hashtable using CPair<product-id, offer-id> as the key.

As another example, imagine you are building a simulation involving people. For two people, pA and pB, you store how pA is disposed towards pB (acquaintance, friend, likes, dislikes, and so forth). By using a hashtable using CPair<person-id, person-id> as the key, looking up pA's disposition towards pB is fast.

Note: The key is directional, so that CPair<pA, pB> is a different key from CPair<pB, pA>, assuming pA and pB are different.

In these examples, by storing the information in a hashtable, you get constant lookup times for fast access to the data. Hashtables give you linear lookup times in the worst case, but this only happens with an overwhelming number of objects to the number of bins. Because the library implementation of hashtable uses integers, you can't increase the number if bins, but you can fine-tune the key distribution as described above. The memory footprint of both hashtable and CPair is tiny, so overall you have a fast and lean way of storing cross-related data.

Conclusion

Creating data classes to represent keys where multiple, primitive types are involved can gain you up to an order of magnitude in lookup performance. You can use the CPair class verbatim if your data consists of two integers, or modify it if your data is more complex. You can use the test code to calibrate your hashing algorithms.

I look forward to your comments.



About the Author

Henri Hein

Granite Tower
Granite Tower Software
See also my article: Auto-Testing Internet Explorer Browser Control

Downloads

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Live Event Date: October 29, 2014 @ 11:00 a.m. ET / 8:00 a.m. PT Are you interested in building a cognitive application using the power of IBM Watson? Need a platform that provides speed and ease for rapidly deploying this application? Join Chris Madison, Watson Solution Architect, as he walks through the process of building a Watson powered application on IBM Bluemix. Chris will talk about the new Watson Services just released on IBM bluemix, but more importantly he will do a step by step cognitive …

  • It's no secret what keeps CIOs up at night. Mobile, cloud, data, security, and social have become the "five imperatives," the drivers of business progress, innovation, and competitive differentiation. Business leaders around the world want to hear how other companies are succeeding. How are they applying the latest technologies? How did they get started? What outcomes are they achieving? Read this online magazine for success stories from organizations like the NBA, Pfizer, and San Jose State University as they …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds