Comparing Large Bodies of Text with Hash Codes

Welcome to this week's installment of .NET Tips & Techniques! Each week, award-winning Architect and Lead Programmer Tom Archer demonstrates how to perform a practical .NET programming task.

While most people think of hash codes in relation to security, hash codes actually are a very fast means of comparing large text values. Using the standard Windows CryptoAPI can be very cumbersome, but the various classes defined in the .NET Cryptography namespace make using hash codes—and other cryptographic functions—easier and more accessible than ever. In this article, I illustrate just how easy it is to compare two text values in a .NET application using hash codes.

Creating a hash code for a body of text is as simple as deciding which hashing algorithm you wish to use (for example, MD5, SHA1, and so forth), instantiating the appropriate .NET service provider object, and then calling that object's ComputeHash method. (All hash algorithm classes ultimately derive from the HashAlgorithm class and inherit its ComputeHash method, which is usually overridden.) Other than that, there's just the typical conversion between Byte (or Char) arrays to String objects, and you're done.

Figure 1 contains a screen capture of the demo application included with this article.

Figure 1: Simple C++ Managed Extensions example illustrating the comparison of two text (string) values using hash codes

The application uses the MD5 hash code algorithm to compare two input strings. The two fields below the two input fields are the actual hash codes. Below you'll find the code used to generate those hash codes and compare the results.

The code first uses the Encoding::ASCII::GetBytes method to convert from the String values returned from the input controls to Byte arrays. A MD5CryptoServiceProvider object is then instantiated and its ComputeHash method is called for each Byte array, resulting in a second Byte array containing the hash code for the text value. The hash values are converted to String values and displayed on the demo dialog and compared for equality where the results of the comparison are shown in a message box. That's it—just a few lines of code to compare two text values of virtually any length!

using namespace System::Security::Cryptography;
using namespace System::Text;

...

private: System::Void btnCompare_Click(System::Object *  sender,
                                       System::EventArgs *  e)
{
  try
  {
    // Convert the text values into Byte arrays
       Byteba1[]=
    Encoding::ASCII->GetBytes(txt1->Text); Byte
              ba2[]=Encoding::ASCII->GetBytes(txt2->Text);

    MD5CryptoServiceProvider* md5csp = new MD5CryptoServiceProvider();

    // Get the hash values for each text value using ComputeHash
    Byte baHashCode1[] = md5csp->ComputeHash(ba1);
    Byte baHashCode2[] = md5csp->ComputeHash(ba2);
    
    // Convert the two hash code arrays into strings for display
    // and comparison
    ASCIIEncoding* encoding = new
    ASCIIEncoding();txtHash1->Text =
    BitConverter::ToString(baHashCode1);txtHash2->Text =
                  BitConverter::ToString(baHashCode2);

    // Display the results of the comparisons of the two hash codes
    MessageBox::Show(
      String::Format(S"The two values are {0}",
                     (0 == String::Compare(txtHash1->Text,
                                           txtHash2->Text)
                       ? S"the same" : S"different")));
  }
  catch(Exception* e)
  {
    MessageBox::Show(e->Message);
  }
}


About the Author

Tom Archer - MSFT

I am a Program Manager and Content Strategist for the Microsoft MSDN Online team managing the Windows Vista and Visual C++ developer centers. Before being employed at Microsoft, I was awarded MVP status for the Visual C++ product. A 20+ year veteran of programming with various languages - C++, C, Assembler, RPG III/400, PL/I, etc. - I've also written many technical books (Inside C#, Extending MFC Applications with the .NET Framework, Visual C++.NET Bible, etc.) and 100+ online articles.

Downloads

Comments

  • unique value

    Posted by mahmedm on 03/01/2006 12:42am

    does it give the unique value for all the strings.... say if I have 50,000 words...... will it return unique hashcode for every word

    Reply
  • How is this fast?

    Posted by KevinHall on 06/24/2004 02:26am

    Why wouldn't a byte-by-byte comparison (i.e. String::Compare() of the original text) be faster? With the crypto API, all the bytes must still be processed, but then there's the additional overhead of calculating the hash. Can you back up your claims with profiling data? I don't mean to come across offensive -- I hope this comment doesn't sound that way. I am really trying to see if there is any true value in this method -- that's all. Also, there is the chance (however unlikely) that different texts could produce the same hash. This is something that should addressed in your article.

    • This is not how to compare "large bodies of text"

      Posted by peljam on 04/21/2009 04:06pm

      I think if your goal is to compare large bodies of text String::Compare is faster and simpler as mentioned by Kevin. Your solution is the Rube Goldberg equivalent of string compares. It doesn't matter how fast the hash algorithm is. The point I believe Kevin and PhiLho made is that it isn't necessary, and only ADDS additional overhead and complication to this process. Performance wise it's probably similar to comparing both strings and also hashing them. The hashing part isn't needed. In addition PhiLho's point about having to compare them again anyway as the hash might not be unique makes your whole post irrelevant. You should name your post 'How to Hash a string'. That would be more accurate.

      Reply
    • Not evasive at all

      Posted by Tom Archer on 07/01/2004 05:18pm

      My question is not "evasive" simply because you don't agree with it. Kevin's specific question had to do with performance which I *did* answer. Regarding your take on hash codes, I definitely disagree with you, but that has nothing to do with the performance question that was asked and answered.

      Reply
    • Evasive answer?

      Posted by PhiLho on 07/01/2004 07:17am

      Sorry, but I find your answer a bit out of the point, perhaps because you think at higher level than us...

      To answer Kevin, I believe hash is indeed useless if you just have to compare two strings once: just iterate simulaneously byte by byte along the two string and stop at the end or if different bytes are found.

      Now, hashing can be useful if you have a reference string, and several strings to compare: you just have to compute the hash once for the reference, you don't have to scan it over and over.

      Hashing is also useful for smaller strings, like finding if a textual key is inside an associative array (ie. an array indexed by strings).
      Here, the good side is that hash is computed once for the keys (and stored with them), and once for the string to search, and you just have to compare integer values.

      The risk of collision depends on the algorithm.
      MD5 and such may be lengthly to compute but are useful for large strings and reputed to be quite robust against the risk of collision (same value for different strings).

      Some simplier hashes, like CRC32, are faster and smaller to store, and still useful for small strings at the risk of more collisions. Here, if hash values differs, you are sure you have different string. If it is equal, you should compare byte by byte to be sure of the equality, which isn't so costly for small strings.

      HTH.
      Philippe Lhoste

      Reply
    • Answers to questions:

      Posted by Tom Archer on 06/24/2004 11:25am

      Thanks for your feedback. Here the answers to your questions: The hash algorithms has been optimized and tweaked for many years by people whose sole programming role is writing the fastest and most efficient algorithms possible. Can others duplicate their efforts? Of course. However, the overwhelming majority of application developers are more focused (as they should be) on writing applications - not bit-twidling to the extent that algorithm experts are expected to. Regarding the uniqueness of hash codes, this is something I address in my "Extending MFC Application with the .NET Framework" book. The purpose of this column of articles is to provide short tips and techniques. In other words, CodeGuru specifically asked that I keep these articles short and to the point such that a reader can read the article in just a few minutes and save themselves anywhere from a half hour to several hours of digging through documentation looking for the solution.

      Reply
    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Java developers know that testing code changes can be a huge pain, and waiting for an application to redeploy after a code fix can take an eternity. Wouldn't it be great if you could see your code changes immediately, fine-tune, debug, explore and deploy code without waiting for ages? In this white paper, find out how that's possible with a Java plugin that drastically changes the way you develop, test and run Java applications. Discover the advantages of this plugin, and the changes you can expect to see …

  • The first phase of API management was about realizing the business value of APIs. This next wave of API management enables the hyper-connected enterprise to drive and scale their businesses as API models become more complex and sophisticated. Today, real world product launches begin with an API program and strategy in mind. This API-first approach to development will only continue to increase, driven by an increasingly interconnected web of devices, organizations, and people. To support this rapid growth, …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds