Using a Pair Class as Efficient Key Objects

Efficient Keys with the CPair Class

Sometimes, you need to store complex data in a key object for a map class, such as a Hashtable or a TreeMap.

I recently worked on a project where some data related to two integers. Initially, the code used a key of type String that was created by the string representation of the two key integers separated by a fixed delimiter. I wanted to see how much this would speed up with a class that specifically stores the two integers. I created a CPair<int, int> class and compared the speed with the previous implementation. The results surprised me.

The implementation of a CPair class, used as a key, is trivial. To be used as a key, a type must implement the equals() and the hashCode() methods from Object. The equals() method just compares the two ints in the pair. To implement hashCode(), the CPair class multiplies the first member of the pair with a prime and adds the second member. To choose the prime, see the following test results.

//Moderately high prime used for hashing
private final int m_radix = 2221;

/** Calculate the hash code. We cache the hash so we don't have
 *  to calculate it on each hashCode() call.
 */
public int calcHash()
{
   // TODO: Could make this overflow proof -- with large numbers,
   // xor higher and lower words
   return (m_first * m_radix + m_second);
}

/** Hash code. Object.hashCode override.
 *
 */
public int hashCode() {
   return m_hash;
}

/** Equals -- Object.equals override
 * 
 *  @return true if both first and second values in obj match this.
 */
public boolean equals(Object obj)
{
   if (null == obj)
      return false;
   if (obj instanceof CPair)
   {
      final CPair right = (CPair)obj;

      return ((right.m_first  == m_first) &&
              (right.m_second == m_second));
   }
   return false;
}

I found a slight performance advantage to using ints over longs.

The CPair class as provided is immutable and used only as a key. Thus, there are no getters or setters.

Test Results

There is test code in the CPair class to compare the performance of using a String and using the CPair class as a key object to a Hashtable. The test code is accessed through the main() method. The command line takes three arguments: two sizes and a lookup count. A Hashtable is created and objects added in a double-loop; the first loop has a number of iterations given by the first argument, and the second loop is limited by the second argument. The values in the map are Integer objects with the value of the current iteration of each of the two loops multiplied together.

More importantly, all this is done twice, once with String keys and once with CPair keys. All the logic besides the keys is identical. The third argument, the lookup count, specifies how many lookups to do on each Hashtable, as a minimum. The code will make a full iteration through each Hashtable a number of times, until it has done a number of lookups equal to or larger than the lookup count. Aside: Although in a manner of speaking, there are two dimensions to each entry, a Hashtable is one-dimensional (barring collisions, but from the interface it's always one-dimensional).

Times are measured with System.currentTimeMillis(). There are more precise mechanisms, but this gave a good ballpark and for large enough volumes, easily captured the difference in performance. I inserted gc() calls and optimized the initial heap size to avoid garbage collection during the loops. Although the reduced memory usage of CPair will result in less time spent in the garbage collector, I wanted to know the difference in lookup times.

In general, the CPair method was 30%-75% faster in setting up the Hashtables. Accessing the values in the Hashtable was a factor of 4-8 times faster. In a few test cases, I saw the CPair implementation almost ten times faster than String.

One curiosity is that CPair seemed to do worse with a small number of iterations in the first loop, and a high number of iterations in the second loop. The String class did worse with a single iteration in the first loop and a high count in the second loop. This gives a bit of insight into the distributions of the hashing algorithms of the two classes, respectively, assuming their relatively low performance occurs where there are the most collisions in the hash distribution. At its worst, the CPair class still outperformed the String class by a factor of two, and a factor of four after fine tuning.

I used the results to calibrate the factor I used to calculate the hash for the CPair class. Given the range I expected the Hashtable to operate in, I settled on a value of 2221. In the code, this is the radix value. I wanted to find the optimum value and use a final int. If you use pairs where the first value covers a wide range, or can get large, you could get by with a smaller radix, which would reduce overflows. If your Java implementation throws a runtime exception on overflows, you will want to eliminate them completely. If you use pairs where the first value is low and the second value covers a large range, you may want to use an even higher radix to reduce collisions.

If you use different maps with different weights, you can also make the radix a member variable and pass it in the constructor, or, preferably, use generics.

When to Use the Class

The pair class can be used whenever you need to store information relating to two objects. Variants can be used to store information relating to multiple objects. For instance, you can implement a triplet class to store information relating to three objects. The following two examples involve the pair class as is.

Imagine you are building an e-commerce site that displays special offers. There's a global list of special offers. Specific products may have additional discounts associated with an offer. For instance, a web site selling computer equipment may have a special offer on a USB storage card, with another discount if the user is viewing a laptop, where the storage card is considered an accessory. Thus, you will need to quickly look up any additional discount(s) associated with a product and an offer. In this example, you can cache the discounts in a hashtable using CPair<product-id, offer-id> as the key.

As another example, imagine you are building a simulation involving people. For two people, pA and pB, you store how pA is disposed towards pB (acquaintance, friend, likes, dislikes, and so forth). By using a hashtable using CPair<person-id, person-id> as the key, looking up pA's disposition towards pB is fast.

Note: The key is directional, so that CPair<pA, pB> is a different key from CPair<pB, pA>, assuming pA and pB are different.

In these examples, by storing the information in a hashtable, you get constant lookup times for fast access to the data. Hashtables give you linear lookup times in the worst case, but this only happens with an overwhelming number of objects to the number of bins. Because the library implementation of hashtable uses integers, you can't increase the number if bins, but you can fine-tune the key distribution as described above. The memory footprint of both hashtable and CPair is tiny, so overall you have a fast and lean way of storing cross-related data.

Conclusion

Creating data classes to represent keys where multiple, primitive types are involved can gain you up to an order of magnitude in lookup performance. You can use the CPair class verbatim if your data consists of two integers, or modify it if your data is more complex. You can use the test code to calibrate your hashing algorithms.

I look forward to your comments.



About the Author

Henri Hein

Granite Tower
Granite Tower Software
See also my article: Auto-Testing Internet Explorer Browser Control

Downloads

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Wednesday, September 24, 2014 8:00 AM - 9:00 AM PDT According to a recent Forrester Research report, many companies are choosing low-code platforms over traditional programming platforms, due to the speed with which low-code apps can be assembled and tested. With customer-facing applications on the rise, traditional programming platforms simply can't keep up with the "short schedules and rapid change cycles" required to develop these applications. Check out this upcoming webinar and join Clay Richardson from …

  • Live Event Date: September 10, 2014 @ 11:00 a.m. ET / 8:00 a.m. PT Modern mobile applications connect systems-of-engagement (mobile apps) with systems-of-record (traditional IT) to deliver new and innovative business value. But the lifecycle for development of mobile apps is also new and different. Emerging trends in mobile development call for faster delivery of incremental features, coupled with feedback from the users of the app "in the wild". This loop of continuous delivery and continuous feedback is …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds