A Deterministic Method of Determining a Document's Modified State

Introduction

This article addresses the problem of determining when a document has been modified. In the particular case of word processors, it appears that most programs tend to have a 'dirty' flag that is set when a user types. This seem to be the case even when a letter is typed, and then backspaced over (essentially leaving the document unchanged).

This article will demonstrate an improved method that uses a deterministic method. In addition, this article will bring together some of the author's personal favorites—MFC programming and cryptographic programming using Wei Dai's Crypto++ Library. The sample program is simply a Document/View SDI, with the View class derived from CEditView.

During recent reading, the author came across an article by Dr. Newcomer. Dr. Newcomer proposes the use of a Checksum Algorithm. Checksums are generally used to detect errors, not collisions. Because Dr. Newcomer's method employs a checksum, only 32 bits of material are used to determine the digest (a DWORD). The author prefers something a little larger. In addition, there are other algorithms that have been evaluated by the academic community and are generally believed to be collision resistant. Dr. Newcomer's algorithm may be collision resistant; the author does not know (and is not qualified to make a decision).

The method chosen is SHA or the Secure Hash Algorithm. SHA is a NIST-approved, collision-resistant hash developed by the NSA for NIST. Other hashes exist, such as MD4, MD5, and RIPE. Some popular hashes, such as MD4 and MD5, are no longer considered to be cryptographically secure. However, any collision resistant hash should suit your needs because cryptographic security is not a requirement.

Compiling and Integrating Crypto++ into the Microsoft Visual C++ Environment

Please see the related article, "Compiling and Integrating Crypto++ into the Microsoft Visual C++ Environment." This article is based upon basic assumptions presented in the previosly mentioned article. It also addresses most problems encountered from Command Line to MFC Projects (Errors C1083, C1189, LNK2001, LNK2004, and LNK2005).

The DirtyPad Program

MSDN has quite a few text editor examples—Superpad, Multipad, and WordPad to name a few. This article presents DirtyPad—a CEditView-derived class. The benefits of using a CEditView class is that the CEdit base class function GetWindowText(). One has easy and convenient access to the underlying string data for hashing.

Create a new MFC SDI application.

At Step 5, choose to use MFC as a statically linked library.

At the final screen of the Wizard (Step 6), choose to derive the View from CEditView.

In stdafx.h, add the following:

#ifdef UNICODE
#  pragma comment( linker, "/ENTRY:wWinMainCRTStartup" )
#endif

#ifdef _DEBUG
#  pragma comment( lib, "cryptlibd" )
#else
#  pragma comment( lib, "cryptlib" )
#endif

In dirtypadDoc.h, add the following protected member:

BYTE m_pcbDigest[ CryptoPP::SHA256::DIGESTSIZE ];

DIGESTSIZE is a constant defined in the SHA class, which is part of the CryptoPP namespace. In the case of SHA256, it is 256/8 = 32 bytes. If you chose a different hash, the digest may be a different size. m_pcbDigest will be updated as outlined below.

As a side note, the author tries not to pollute the global names space with symbols. So, instead of issuing a 'using namespace CryptoPP', objects are brought in as required with the scope operator. See "Migrating to Namespaces" by Herb Sutter in the October 2000 issue of Dr. Dobbs Journal for an excellent treatment of the subject.

DirtyPad will create snapshots of the document by means of a hash at strategic times during the document's life.

OnOpenDocument()

Although a new document is empty, the program needs to hash the document. An empty document does not equate to a NULL hash.

BOOL CDirtypadDoc::OnNewDocument()
{
   if (!CDocument::OnNewDocument())
      return FALSE;

   ((CEditView*)m_viewList.GetHead())->SetWindowText(NULL);

   CString szText;
   GetText( szText );

   CalculateDigest( szText, szText.GetLength(), m_pcbDigest );

   return TRUE;
}

OnSaveDocument(…)

OnSaveDocument(…) is a fairly trivial implementation: It calculates the hash of the document (saving its m_pcbDigest), and then allows the CDocument base class to perform its work.

BOOL CDirtypadDoc::OnSaveDocument(LPCTSTR lpszPathName)
{
   CString szText;
   GetText( szText );

   CalculateDigest( szText, szText.GetLength(), m_pcbDigest );

   return CDocument::OnSaveDocument(lpszPathName);
}

SaveModified()

SaveModified() is the function that will prove to be most interesting for this article. The function calculates a hash of the current document and compares it with a previously saved hash.

BOOL CDirtypadDoc::SaveModified() 
{
   BYTE pcbDigest[ CryptoPP::SHA256::DIGESTSIZE ];
   CString szText;

   if( FALSE == GetText( szText ) ){

      return CDocument::SaveModified();
   }

   if( FALSE == CalculateDigest( szText, szText.GetLength(),
                                 pcbDigest ) ){

      return CDocument::SaveModified();
   }

   if( 0 != memcmp( m_pcbDigest, pcbDigest,
                    CryptoPP::SHA256::DIGESTSIZE ) ) {

      return CDocument::SaveModified();
   }

   // Any non-zero will do
   return TRUE;
}

DirtyPad will always defer to CDocument::SaveModified() under the following two conditions:

  • The underlying CEdit string data is not available (GetText(…) returned FALSE)
  • The program fails at calculating the current document's hash

Should the above conditions not fail, the previous document's hash is compared to the current document's hash. If the hashes are the same, TRUE is returned to indicate DirtyPad handled the message. If the hashes are different, base class CDocument::SaveModified() is called. The later case presents the dialog below.

Add the following protected member function to calculate the hash of the document:

BOOL CDirtypadDoc::CalculateHash( LPCTSTR pszText, UINT nLength,
                                  BYTE *pcbDigest )
{
   try {

      // nLength is Character Count (not BYTE count)
      CryptoPP::SHA256 hash;
      hash.Update( (BYTE*)(LPCTSTR)pszText, nLength * sizeof(TCHAR) );
      hash.Final( pcbDigest );

   } catch( ... ) { bReturn = FALSE;  }

   return TRUE;
}

When hashing, if one finds the document has other data members (such as an INT) that should be included in the document's state for hashing, perform the following:

CryptoPP::SHA256 hash;

INT i = 1;
hash.Update( (UCHAR*)&i, sizeof(INT) );

One may be able to actually call CEdit::SetModify(...). However, be aware that MSDN states the following (from CEditView::GetEditCtrl(...) ):

WARNING! Using the CEdit object can change the state of the underlying Windows edit control. For example, you should not change the tab settings using the CEdit::SetTabStops function because CEditView caches these settings for use both in the edit control and in printing. Instead, use CEditView::SetTabStops.

A Note on Document Size

The author's sampling of common word processor documents revealed most were less than 50 Kb in size (although a 100 Kb file was not uncommon). In the case of Word documents, the statistic was 35 Kb with N ≈ 2800 and σ = 15 Kb. With that said, the following is offered for argumentative readers who like to offer 100 MB or 1.0 Gb files as an argument. Programming Applications for Microsoft Windows by Jeffrey Richter, Chapter 17, "Memory Mapped Files," should prove invaluable for working with a small subset of a large file. If one is modifying many parts of a large file, forgo hashing a write the entire file. At this point, the reader should consider redisigning his or her work to accomodate saving smaller file subsets.

Revisions

  • 11-14-2006 Updated Crypto++ Content Link
  • 9-19-2005 Initial Release


About the Author

Jeffrey Walton

In the past, I have worked as an IT consultant for County Government (Anne Arundel County), the Nuclear Energy Institute, the Treasury Department, and Social Security Administration as a Network Engineer and System Administrator. Primary Administration experience includes Microsoft Windows and Novell Netware, with additional exposure and familiarity with Mac and Linux OSes. Previous to the US government, I was a programmer for a small business using Microsoft Visual Languages (Basic 5.0, 6.0, and C++ 5.0, 6.0) and Scripting Languages. An undergraduate degree (BS in Computer Science) was obtained from University of Maryland, Baltimore County. Graduate work includes a Masters of Science (Computer Science) from Johns Hopkins University (expected before 2009). Training and Certifications include Microsoft, Checkpoint, and Cisco.

Downloads

Comments

  • I need your help

    Posted by allooba on 11/12/2007 06:39am

    I want to create rich text editor project. but I want to handle user input, I want to caption just 6 keys from keyboard and when user press these keys I want to print dots in screen. other thing the user can press more than one key at time. the project gaol is to simulate braille dots on screen. can you help me for this issu please? thanks. my email: alloob2@hotmail.com

    Reply
  • Limited practical usage...

    Posted by RoboTact on 09/23/2005 07:48pm

    If you indeed deal with document, command stack should handle simple cases with unchanged document (within some allowed depth at least). If it doesn't, hash may prove misleading as identical document may be represented by different internal representatin. Absence of such machinery in presence of discussed issue isn't good, and it may be stimulated by such hash hacks.

    • Re: Limited practical usage...

      Posted by jeffrey@toad.net on 09/23/2005 11:03pm

      Hi RoboTact, I'm not sure what you are stating. Can you give me an example? Thanks, Jeff

      Reply
    Reply
  • Can you explain the 'Probabilistic' aspect of the Method?

    Posted by Legacy on 05/22/2003 12:00am

    Originally posted by: Steve

    Jeff,

    Can you explain the 'Probabilistic' aspect of the Method for us? It looks like you hash the before and after documents and do a direct compare...

    Thanks,
    Steve

    • Re: Can you explain the 'Probabilistic' aspect of the Method?

      Posted by jwalto1 on 04/01/2004 05:41pm

      Hi Steve,
      
      It is probabilistic because there exists a chance
      that a document and the modified document will
      hash to the same value. I admit that it is very
      unlikely.
      
      Jeff

      Reply
    Reply
  • FIPS 140 Compliance of Crypto++

    Posted by Legacy on 05/21/2003 12:00am

    Originally posted by: Jeffrey Walton

    To those who are concerned about Wei Dai's Crypto++, and the FIPS 140 compliance...

    Per Wei:
    The testing lab received the first round of comments from NIST and is responding to them. Hopefully there will not need to be a second round, in which case we may be able to get the certificate in about a month.

    The original list message can be found on the Crypto++ mailing list, with ESMTP id h4L09e1d002743 from eskimo.com

    Jeff

    • Re: Only good for small documents

      Posted by jwalto1 on 04/01/2004 06:14pm

      Hi hmm,
      
      See Richter's 'Programming Applications for Microsoft Windows', 4th ed.,
      Chapter 17 - Memory-Mapped Files for ideas on working with larger files.
      In particular, page 620 has a section titled 'Processing a Big
      File Using Memory-Mapped Files'.
      
      Jeff

      Reply
    Reply
  • How does this perform with multi-MB documents?

    Posted by Legacy on 05/20/2003 12:00am

    Originally posted by: hmm

    .

    • Re: How does this perform with multi-MB documents?

      Posted by jwalto1 on 04/01/2004 06:12pm

      Hi hmm,
      
      See Richter's 'Programming Applications for Microsoft Windows', 4th ed.,
      Chapter 17 - Memory-Mapped Files for ideas on working with larger files.
      In particular, page 620 has a section titled 'Processing a Big
      File Using Memory-Mapped Files'.
      
      Jeff

      Reply
    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Live Event Date: October 29, 2014 @ 11:00 a.m. ET / 8:00 a.m. PT Are you interested in building a cognitive application using the power of IBM Watson? Need a platform that provides speed and ease for rapidly deploying this application? Join Chris Madison, Watson Solution Architect, as he walks through the process of building a Watson powered application on IBM Bluemix. Chris will talk about the new Watson Services just released on IBM bluemix, but more importantly he will do a step by step cognitive …

  • Companies must routinely transfer files and share data to run their business, work with partners, and speed operations. However, many find the traditional approach to file transfer lacks necessary security, is too complex and difficult to manage, does not support the levels of automation needed, and breaks down when addressing the file transfer requirements of new areas like Big Data analytics and mobile applications. This QuinStreet SmartSelect discusses how the changing business environment is making the use …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds