A Deterministic Method of Determining a Document's Modified State

Introduction

This article addresses the problem of determining when a document has been modified. In the particular case of word processors, it appears that most programs tend to have a 'dirty' flag that is set when a user types. This seem to be the case even when a letter is typed, and then backspaced over (essentially leaving the document unchanged).

This article will demonstrate an improved method that uses a deterministic method. In addition, this article will bring together some of the author's personal favorites—MFC programming and cryptographic programming using Wei Dai's Crypto++ Library. The sample program is simply a Document/View SDI, with the View class derived from CEditView.

During recent reading, the author came across an article by Dr. Newcomer. Dr. Newcomer proposes the use of a Checksum Algorithm. Checksums are generally used to detect errors, not collisions. Because Dr. Newcomer's method employs a checksum, only 32 bits of material are used to determine the digest (a DWORD). The author prefers something a little larger. In addition, there are other algorithms that have been evaluated by the academic community and are generally believed to be collision resistant. Dr. Newcomer's algorithm may be collision resistant; the author does not know (and is not qualified to make a decision).

The method chosen is SHA or the Secure Hash Algorithm. SHA is a NIST-approved, collision-resistant hash developed by the NSA for NIST. Other hashes exist, such as MD4, MD5, and RIPE. Some popular hashes, such as MD4 and MD5, are no longer considered to be cryptographically secure. However, any collision resistant hash should suit your needs because cryptographic security is not a requirement.

Compiling and Integrating Crypto++ into the Microsoft Visual C++ Environment

Please see the related article, "Compiling and Integrating Crypto++ into the Microsoft Visual C++ Environment." This article is based upon basic assumptions presented in the previosly mentioned article. It also addresses most problems encountered from Command Line to MFC Projects (Errors C1083, C1189, LNK2001, LNK2004, and LNK2005).

The DirtyPad Program

MSDN has quite a few text editor examples—Superpad, Multipad, and WordPad to name a few. This article presents DirtyPad—a CEditView-derived class. The benefits of using a CEditView class is that the CEdit base class function GetWindowText(). One has easy and convenient access to the underlying string data for hashing.

Create a new MFC SDI application.

At Step 5, choose to use MFC as a statically linked library.

At the final screen of the Wizard (Step 6), choose to derive the View from CEditView.

In stdafx.h, add the following:

#ifdef UNICODE
#  pragma comment( linker, "/ENTRY:wWinMainCRTStartup" )
#endif

#ifdef _DEBUG
#  pragma comment( lib, "cryptlibd" )
#else
#  pragma comment( lib, "cryptlib" )
#endif

In dirtypadDoc.h, add the following protected member:

BYTE m_pcbDigest[ CryptoPP::SHA256::DIGESTSIZE ];

DIGESTSIZE is a constant defined in the SHA class, which is part of the CryptoPP namespace. In the case of SHA256, it is 256/8 = 32 bytes. If you chose a different hash, the digest may be a different size. m_pcbDigest will be updated as outlined below.

As a side note, the author tries not to pollute the global names space with symbols. So, instead of issuing a 'using namespace CryptoPP', objects are brought in as required with the scope operator. See "Migrating to Namespaces" by Herb Sutter in the October 2000 issue of Dr. Dobbs Journal for an excellent treatment of the subject.

DirtyPad will create snapshots of the document by means of a hash at strategic times during the document's life.

OnOpenDocument()

Although a new document is empty, the program needs to hash the document. An empty document does not equate to a NULL hash.

BOOL CDirtypadDoc::OnNewDocument()
{
   if (!CDocument::OnNewDocument())
      return FALSE;

   ((CEditView*)m_viewList.GetHead())->SetWindowText(NULL);

   CString szText;
   GetText( szText );

   CalculateDigest( szText, szText.GetLength(), m_pcbDigest );

   return TRUE;
}

OnSaveDocument(…)

OnSaveDocument(…) is a fairly trivial implementation: It calculates the hash of the document (saving its m_pcbDigest), and then allows the CDocument base class to perform its work.

BOOL CDirtypadDoc::OnSaveDocument(LPCTSTR lpszPathName)
{
   CString szText;
   GetText( szText );

   CalculateDigest( szText, szText.GetLength(), m_pcbDigest );

   return CDocument::OnSaveDocument(lpszPathName);
}

SaveModified()

SaveModified() is the function that will prove to be most interesting for this article. The function calculates a hash of the current document and compares it with a previously saved hash.

BOOL CDirtypadDoc::SaveModified() 
{
   BYTE pcbDigest[ CryptoPP::SHA256::DIGESTSIZE ];
   CString szText;

   if( FALSE == GetText( szText ) ){

      return CDocument::SaveModified();
   }

   if( FALSE == CalculateDigest( szText, szText.GetLength(),
                                 pcbDigest ) ){

      return CDocument::SaveModified();
   }

   if( 0 != memcmp( m_pcbDigest, pcbDigest,
                    CryptoPP::SHA256::DIGESTSIZE ) ) {

      return CDocument::SaveModified();
   }

   // Any non-zero will do
   return TRUE;
}

DirtyPad will always defer to CDocument::SaveModified() under the following two conditions:

  • The underlying CEdit string data is not available (GetText(…) returned FALSE)
  • The program fails at calculating the current document's hash

Should the above conditions not fail, the previous document's hash is compared to the current document's hash. If the hashes are the same, TRUE is returned to indicate DirtyPad handled the message. If the hashes are different, base class CDocument::SaveModified() is called. The later case presents the dialog below.

Add the following protected member function to calculate the hash of the document:

BOOL CDirtypadDoc::CalculateHash( LPCTSTR pszText, UINT nLength,
                                  BYTE *pcbDigest )
{
   try {

      // nLength is Character Count (not BYTE count)
      CryptoPP::SHA256 hash;
      hash.Update( (BYTE*)(LPCTSTR)pszText, nLength * sizeof(TCHAR) );
      hash.Final( pcbDigest );

   } catch( ... ) { bReturn = FALSE;  }

   return TRUE;
}

When hashing, if one finds the document has other data members (such as an INT) that should be included in the document's state for hashing, perform the following:

CryptoPP::SHA256 hash;

INT i = 1;
hash.Update( (UCHAR*)&i, sizeof(INT) );

One may be able to actually call CEdit::SetModify(...). However, be aware that MSDN states the following (from CEditView::GetEditCtrl(...) ):

WARNING! Using the CEdit object can change the state of the underlying Windows edit control. For example, you should not change the tab settings using the CEdit::SetTabStops function because CEditView caches these settings for use both in the edit control and in printing. Instead, use CEditView::SetTabStops.

A Note on Document Size

The author's sampling of common word processor documents revealed most were less than 50 Kb in size (although a 100 Kb file was not uncommon). In the case of Word documents, the statistic was 35 Kb with N ≈ 2800 and σ = 15 Kb. With that said, the following is offered for argumentative readers who like to offer 100 MB or 1.0 Gb files as an argument. Programming Applications for Microsoft Windows by Jeffrey Richter, Chapter 17, "Memory Mapped Files," should prove invaluable for working with a small subset of a large file. If one is modifying many parts of a large file, forgo hashing a write the entire file. At this point, the reader should consider redisigning his or her work to accomodate saving smaller file subsets.

Revisions

  • 11-14-2006 Updated Crypto++ Content Link
  • 9-19-2005 Initial Release


About the Author

Jeffrey Walton

In the past, I have worked as an IT consultant for County Government (Anne Arundel County), the Nuclear Energy Institute, the Treasury Department, and Social Security Administration as a Network Engineer and System Administrator. Primary Administration experience includes Microsoft Windows and Novell Netware, with additional exposure and familiarity with Mac and Linux OSes. Previous to the US government, I was a programmer for a small business using Microsoft Visual Languages (Basic 5.0, 6.0, and C++ 5.0, 6.0) and Scripting Languages. An undergraduate degree (BS in Computer Science) was obtained from University of Maryland, Baltimore County. Graduate work includes a Masters of Science (Computer Science) from Johns Hopkins University (expected before 2009). Training and Certifications include Microsoft, Checkpoint, and Cisco.

Downloads