Message Digests and Digital Fingerprints

1 ABSTRACT

A one-way hash function, also known as a message digest, takes an input string of arbitrary length and produces a fixed length hash. You could think of it as a digital fingerprint. It is generally thought that no two keys (e.g. pass phrases) can produce the same hash value. Therefore, when a one-way cryptographic hash is used as a password authentication method, it does not try to decrypt the hash and compare the resulting plain text but rather encrypt the user-supplied plain text (password), hash it, and then compare the hash values.

MD5 (Message Digest version 5) is a one-way hash function designed by Ron Rivest of RSA fame. After processing a message of arbitrary length, it produces a 128-bit hash as a representation of the original plain text.

2 AIM

To implement the MD5 Message-Digest algorithm based on RFC1321 in the C++ programming language using the Standard Template Library (STL). The resulting application must produce message digests, or digital fingerprints, of both text and binary files and text strings. The focus is not optimisation, but rather the integration of STL containers and algorithms into an object-oriented MD5 implementation and achieve a better understanding of how the MD5 algorithm works.

This will not simply be a copy of Ron Rivest’s C code found in [1] pasted into a class called MD5, which, unfortunately a lot of articles seem to do.

The rest of the document will refer to a program, jmd5, which is the author’s implementation of the MD5 algorithm. The prefix j simply represents the author’s initial.

3 CONVENTIONS USED IN THIS DOCUMENT

Italic: Reserved for program variables.

BOLD: Used for headings such as the ones on this page.

Constant width: Used for code snippets and examples throughout the paper.

Underlined words: Reserved for sub-headings or sections related to the headings in bold.

MD5

The MD5 algorithm was designed and first implemented in C by Ron Rivest of RSA Data Security. It is an open standard (patent and trademark free) and is documented well in RFC1321. Compared to other cryptographic algorithms, MD5 is quite compact and not a lot of code is needed to implement it.

The input message is first padded to be sure it is 64 bits short of a multiple of 512. A single ‘1’ bit is first appended and then a series of ‘0’ bits until the message is congruent to 448 modulo 512. The remaining 64 bits are used to store the message size prior to the padding. At this stage, the message should now be an exact multiple of 512. Along with the initialisation of four 32-bit constants, a set of 64 ‘magic’ numbers is constructed according to the following formula:

Where S is the set, i is a subscript, a is a function that calculates the absolute value of its input, and s calculates the sine of its input which, in this case, is i .These constants are used along with several bitwise operations to process the message in blocks resulting in a 128-bit message digest.

This C++ implementation is based heavily on the original C code by Ron Rivest. However, many design changes have been made to incorporate the C++STL. The resulting application was tested against the tool ‘md5sum’ available on the majority of Linux distributions and, although slower (C++ STL overhead), produces the same hash.

jmd5—The MD5 class

Diagram 1: The UML Class diagram of jmd5

The code associated with the above class diagram can be found in Appendix A. Other than swapping arrays for vectors, memset() for fill(), and memcpy() for copy(), and so forth, the class design tries to keep in with the framework of STL-based containers and templates.

The class constructor uses the STL template class istreambuf_iterator <char> to copy the input into a vector <char> if the input is a text or binary file and a simple copy() if the input is a text string from stdin or a time stamp:

if (file || input == "-") {
   istream *stream = &cin;

   if (input != "-")
      stream = new ifstream (input.c_str (), ios::in);

   if (!stream ->good ())
      throw ErrorMessage ("Failed to open '%s' for reading.",
                          input.c_str ());

   copy (istreambuf_iterator<char> (*stream),
         istreambuf_iterator<char> (),
         back_inserter (plaintext));

   if (!file)
      plaintext.pop_back ();
   if (stream && stream != &cin)
      delete stream;
} else
   copy (input.begin (), input.end (), back_inserter (plaintext));

The above code determines whether the input is a text string or file and whether the source (file) is a file on disk or the stdin stream. Once the input has been copied to the class attribute plaintext, jmd5 processes the plain text in blocks of 64 bytes (512 bits) using the following algorithm:

byte *data        = NULL;
const uint block  = 64;
uint total        = 0, bytes = 0, last = 0;
const uint length = plaintext.size ();
while ((total += bytes) < length) {
   if ((total + block) > length)
      bytes = length - total;
   else
      bytes = block;

   if (!data)
      data = new byte [bytes ];
   else if (bytes != last) {
      delete [] data;
      data = NULL;
      data = new byte [bytes ];
   }
   if (!data)
      throw ErrorMessage ("Failed to allocate %d bytes of heap
                           storage.", bytes);

   fill (data, data + bytes, 0);
   copy (plaintext.begin () + total, (plaintext.begin () + total)
         + bytes, data);

   update data, bytes);
   last = bytes;
}
finalise ();

The data type ‘byte’ shown above is one of two typedefs:

typedef unsigned char byte;
typedef unsigned long int word;

You will later see a mathematical formula that describes how 64 ‘magic’ numbers are generated. The C++ code that does this can be found within the init() function:

<snip>

static vector<word> magic (64);    // global static variable

</snip>

static const double p = pow (2, 32);

for (uint r = 0 ; r < magic.size () ; r++)
   magic [r ] = static_cast<word> (p * fabs (sin (r  + 1)));

The update() member function is the starting point of the actual MD5 algorithm implemented within jmd5. Update() is called for every block of data and upon exit of the while loop the function finalise() is called. It is inside this function where the padding is done (and its length appended) and the final digest is encoded according to the four internal states (32-bit words) of the class set by the transform() function. Most of the work is done by transform() and the other member functions it calls: decode(), op(), f(), g(), h(), i(), and rl(). The functions f(), g() ,h(), and i() are all bit-manipulators and rl() simply left-rotates all the bits in a word:

inline word jmd5::rl (const word &x, const word &n) const {
   return ((x << n) | (x >> (32 - n)));
}

inline word jmd5::f (const word &x, const word &y, const word &z) const {
   return x & y | ~x & z;
}

inline word jmd5::g (const word &x, const word &y, const word &z) const {
   return x & z | y & ~z;
}

inline word jmd5::h (const word &x, const word &y, const word &z) const {
   return x ^ y ^ z;
}

inline word jmd5::i (const word &x, const word &y, const word &z) const {
   return y ^ (x | ~z);
}

The transform() function calls the op() function 16 times for each of the four rounds passing as parameters the current states (each time in a different order) and two magic numbers (one being an integer constant less than 24 and the other produced by the formula described later).

inline void jmd5::op (const char &func, const word &a, const word &b,
                      const word c&, word &d, const word &x,
                      const word &s, const word &m) const {
   switch (func) {
      case 'f':
         a += f (b, c, d) + x + m;
         break;
      case 'g':
         a += g (b, c, d) + x + m;
         break;
      case 'h':
         a += h (b, c, d) + x + m;
         break;
      case 'i':
         a += i (b, c, d) + x + m;
         break;
   }
   a = rl (a, s) + b;
}

A snippet of the transform() member function:

inline void jmd5::transform (byte block [64 ]) {
   word a = state [0 ], b = state [1 ], c = state [2 ],
        d = state [3 ], x [16 ];

   decode  (block, x, 64);

   op ('f', a, b, c, d, x [ 0 ], S11, magic.at (0));   // Round 1
   ...    // 15 more calls to op() here
   op ('g', a, b, c, d, x [ 1 ], S21, magic.at (16));  // Round 2
   ...    // 15 more calls to op() here
   op ('h', a, b, c, d, x [ 5 ], S31, magic.at (32));  // Round 3
   ...    // 15 more calls to op() here
   op ('i', a, b, c, d, x [ 0 ], S41, magic.at (48));  // Round 4
   ... // 15  more  calls  to  op() here

   state [0 ] += a;
   state [1 ] += b;
   state [2 ] += c;
   state [3 ] += d;
}

The finalise() function concatenates the resulting four states to produce the 128-bit message digest. Jmd5 provides a friend function that overloads the <<(extraction) operator so std::ostream objects can print the digest similarly to other data types:

jmd5 digest ("PI ");
const double pi = 3.14159265358979323846;

std::cout << "PI is " << pi << " the MD5 hash of 'PI' is "
          << digest << endl;

The friend function is defined within the class declaration as:

friend ostream& operator << (ostream& stream, const jmd5 &digest) {
   string hash;
   char bits [3 ];

   fill (bits, bits + sizeof (bits), '\0');

   for (uint i = 0 ; i < digest.size () ; i++) {
      snprintf (bits, sizeof (bits), "%02x", digest [i ]);
      hash += bits;
   }

   return stream << hash;
};

It was originally planned to use std::hex along with std:cout; however, std::cout did not allow the same precision as the standard C function snprintf() (or printf()) and would have resulted in jmd5 being incompatible with other MD5 implementations.

The access operator, [ ], is used to access individual elements of the digest by passing it a subscript—just as you would with other STL containers:

const uint jmd5::operator [] (const uint &index) const {
   return digest [index ];
}

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read