CFileInfoArray

FCompare screenshot

Abstract



I’ve been envolved in some projects which required file gathering through
directories and this class allows just that: gather file information
recursively by directory and, as a bonus track, it also calculates 32bit
file-checksum (note this is not NT’s executables checksum calculated with
MapFileAndChecksum) and 32bit file-CRC (with a borrowed code,
I didn’t feel like re-inventing the wheel and the other option was to
review my Codification Theory notes and I’m a bit alergic to dust).


The second part of this article presents FCompare, a sample
application of CFileInfo and CFileInfoArray usage.
This application does a:



  1. Recursive search of source and target files to compare, given a
    directory and a filemask.

  2. Binary comparison of source with target files by their size,
    partial/total content, partial/total checksum or partial/total CRC.


  3. Feeds a listview with matched filenames and paths.

Updates:



1999-9-23 ATL (v1.4)


  • Corrected yet another bug in GetCRC and
    GetChecksum as suggested (again!) by
    Rsbert Szucs:
    It ought to be 4-(dwRead & 0x3) instead of dwRead
    & 0x3
    when calc’ing the padding mask.

1999-9-16 ATL (v1.3)


  • Corrected bug in GetCRC and GetChecksum
    as suggested by Rsbert Szucs:
    There was a buffer overflow and checksum and crc for last dword +1 was
    calc’ed instead of the ones for last dword. Instead accessing
    buffer[dwRead +3…]
    it ought to access buffer[dwRead…] (shame on me!
    :'().

1999-9-2 ATL (v1.2)


  • Corrected bug in Create(CString, LPARAM) as suggested
    by Nhycoh:
    There was some weird stuff at CFileInfo::Create(strFilePath,
    lparam)
    stating strFilePath.GetLength()-nBarPos
    instead of nBarPos+1 (I’m quite sure I left my head on
    my pillow the day I did that %-#).
  • Updated GetCRC & GetChecksum to avoid
    some bug cases

1999-4-30 ATL (v1.1, Internal Release)


  • Corrected a bug when setting timers: requested timer was id 0 and MSDN help states
    that the timer id must be greater than 0. This bug was pointed out to me by
    Javier Maura (although the timer _does_ work even if its id is 0!!!).
    The user is also warned about progress not being reported if SetTimer fails.

1999-4-7 ATL (v1.1, Internal Release)


  • Updated source code doc to conform Autoduck 2.0 standard

  • Corrected bug in CFileInfoArray::AddDir as suggested by Zhuang Yuyao:
    bIncludeDirs wasn’t used if bRecurse was false.

Building environment


VC++ 6.0, with warning level 4.

Tested on Windows NT 4.0 and W’95.

Although not tested, I guess CFileInfo, CFileInfoArray and
FCompare can be safely recompiled to unicode.

CFileInfo and CFileInfoArray


/**
* @class Stores information about a file in a way like does
*/
class CFileInfo {
public:
/** @access Public members */
CFileInfo();
/**
* @cmember Copy constructor
* @parm CFileInfo to copy member variables from.
*/
CFileInfo(const CFileInfo& finf);

/**
* @cmember Destructor
*/
~CFileInfo();

/**
* @cmember Initializes CFileInfo member variables.
* @parm Values to init member variables.
* @parm Path of the file the CFileInfo refers to.
* @parmopt User defined parameter.
*/
void Create(const WIN32_FIND_DATA* pwfd, const CString strPath,
LPARAM lParam=NULL);

/**
* @cmember Initializes CFileInfo member variables.
* @parm Absolute path for file or directory
* @parmopt User defined parameter.
*/
void Create(const CString strFilePath, LPARAM lParam = NULL);

/**
* @cmember Calcs 32bit checksum of file (i.e. sum of all the DWORDS
* of the file, truncated to 32bit).
* @parmopt Number of maximum bytes read for checksum calculation. This
* number is
* up-rounded to a multiple of 4 bytes (DWORD). If 0 or bigger than
* uhFileSize, checksum for all the file is calculated.
* @parmopt Force recalculation of checksum (otherwise if checksum
* has already been calculated, it isn't calculated again and previous
* calculated value is returned).
* @parmopt Flag to allow calling application to abort the calculation
* of checksum (for multithreaded applications).
* @parmopt Pointer to counter of bytes whose checksum has been calculated.
* This value is updated while checksum is being calculated, so calling
* application can view the progress of checksum calc (for multithreaded
* applications).
* Maximum value for pulCount is uhFileSize.
*/
DWORD GetChecksum(const ULONGLONG uhUpto=0, const BOOL bRecalc = FALSE,
const volatile BOOL* pbAbort=NULL, volatile ULONG* pulCount = NULL);

/**
* @cmember Calcs 32bit CRC of file contents (i.e. CRC of all the
* DWORDS of the file).
* @parmopt Number of maximum bytes read for CRC calculation. This
* number is up-rounded to a multiple of 4 bytes (DWORD). If 0 or
* bigger than uhFileSize, CRC for all the file is calculated.
* @parmopt Force recalculation of CRC (otherwise if CRC has already
* been calculated, it isn't calculated again and previous calculated
* value is returned).
* @parmopt pbAbort Flag to allow calling application to abort the
* calculation of CRC (for multithreaded applications).
* @parmopt Pointer to counter of bytes whose CRC has been calculated.
* This value is updated while CRC is being calculated, so calling
* application can view the progress of CRC calc (for multithreaded
* applications).
* Maximum value for pulCount is uhFileSize.
*/
DWORD GetCRC(const ULONGLONG dhUpto=0, const BOOL bRecalc = FALSE,
const volatile BOOL* pbAbort=NULL, volatile ULONG* pulCount = NULL);

/** @cmember File size in bytes as a DWORD value. */
DWORD GetLength(void) const { return (DWORD) m_uhFileSize; };
/** @cmember File size in bytes as an ULONGLONG value. */
ULONGLONG GetLength64(void) const { return m_uhFileSize; };

/** Get File split info (equivalent to CFindFile members) */

/**
* @cmember Gets the file drive
* @rdesc Returns C: for C:\WINDOWS\WIN.INI
*/
CString GetFileDrive(void) const;
/**
* @cmember Gets the file dir
* @rdesc Returns \WINDOWS\ for C:\WINDOWS\WIN.INI
*/
CString GetFileDir(void) const;
/** @cmember returns WIN for C:\WINDOWS\WIN.INI */
CString GetFileTitle(void) const;
/** @cmember returns INI for C:\WINDOWS\WIN.INI */
CString GetFileExt(void) const;
/** @cmember returns C:\WINDOWS\ for C:\WINDOWS\WIN.INI */
CString GetFileRoot(void) const { return GetFileDrive() + GetFileDir(); };
/** @cmember returns WIN.INI for C:\WINDOWS\WIN.INI */
CString GetFileName(void) const { return GetFileTitle() + GetFileExt(); };
/** @cmember returns C:\WINDOWS\WIN.INI for C:\WINDOWS\WIN.INI */
const CString& GetFilePath(void) const { return m_strFilePath; }

/* Get File times info (equivalent to CFindFile members) */
/** @cmember returns creation time */
const CTime& GetCreationTime(void) const { return m_timCreation; };
/** @cmember returns last access time */
const CTime& GetLastAccessTime(void) const { return m_timLastAccess; };
/** @cmember returns las write time */
const CTime& GetLastWriteTime(void) const { return m_timLastWrite; };

/* Get File attributes info (equivalent to CFindFile members) */
/** @cmember returns file attributes */
DWORD GetAttributes(void) const
{ return m_dwAttributes; };

/** @cmember returns TRUE if the file is a directory */
BOOL IsDirectory(void) const
{ return m_dwAttributes & FILE_ATTRIBUTE_DIRECTORY; };

/** @cmember Returns TRUE if the file has archive bit set */
BOOL IsArchived(void) const
{ return m_dwAttributes & FILE_ATTRIBUTE_ARCHIVE; };

/** @cmember Returns TRUE if the file is read-only */
BOOL IsReadOnly(void) const
{ return m_dwAttributes & FILE_ATTRIBUTE_READONLY; };

/** @cmember Returns TRUE if the file is compressed */
BOOL IsCompressed(void) const
{ return m_dwAttributes & FILE_ATTRIBUTE_COMPRESSED; };

/** @cmember Returns TRUE if the file is a system file */
BOOL IsSystem(void) const
{ return m_dwAttributes & FILE_ATTRIBUTE_SYSTEM; };

/** @cmember Returns TRUE if the file is hidden */
BOOL IsHidden(void) const
{ return m_dwAttributes & FILE_ATTRIBUTE_HIDDEN; };

/** @cmember Returns TRUE if the file is temporary */
BOOL IsTemporary(void) const
{ return m_dwAttributes & FILE_ATTRIBUTE_TEMPORARY; };

/** @cmember Returns TRUE if the file is a normal file */
BOOL IsNormal(void) const { return m_dwAttributes == 0; };

LPARAM m_lParam; /** User-defined parameter */
private:
/** @access Private members */

/** @cmember Full filepath of file (directory+filename) */
CString m_strFilePath;

/** @cmember File attributes of file (as returned by FindFile() */
DWORD m_dwAttributes;

/** @cmember File of size. (COM states LONGLONG as hyper, so "uh" means
unsigned hyper) */
ULONGLONG m_uhFileSize;

CTime m_timCreation; /** @cmember Creation time */
CTime m_timLastAccess; /** @cmember Last Access time */
CTime m_timLastWrite; /** @cmember Last write time */

/** @cmember Checksum calculated for the first m_uhChecksumBytes bytes */
DWORD m_dwChecksum;

/** @cmember CRC calculated for the first m_uhCRCBytes bytes */
DWORD m_dwCRC;

/** @cmember Number of file bytes with CRC calc'ed (4 multiple or filesize ) */
DWORD m_uhCRCBytes;

/** @cmember Number of file bytes with Checksum calc'ed (4 multiple or filesize) */
DWORD m_uhChecksumBytes;
};

/**
* @class Allows to retrieve s from files/directories
in a directory
*/
class CFileInfoArray : public CArray {
public:
/** @access Public members */

/**
* @cmember Default constructor
*/
CFileInfoArray();

/**
* @cmember,menum Default values for
*/
enum {
/** @@emem Insert s in a unordered manner */
AP_NOSORT=0,

/** @@emem Insert s in a ascending order */
AP_SORTASCENDING=0,

/** @@emem Insert s in a descending number */
AP_SORTDESCENDING=1,

/** @@emem AP_SORTBYSIZE | Insert s ordered by
uhFileSize (presumes array is previously ordered by uhFileSize). */
AP_SORTBYSIZE=2,

/** @@emem AP_SORTBYNAME | Insert s ordered by
strFilePath (presumes array is previously ordered by strFilePath) */
AP_SORTBYNAME=4

};

/**
* @cmember Adds a file or all contained in a directory to the
* CFileInfoArray
* Only "static" data for CFileInfo is filled (by default CRC and
* checksum are NOT calculated when inserting CFileInfos).
* Returns the number of s added to the array
* @parm Name of the directory, ended in backslash.
* @parm Mask of files to add in case that strDirName is a directory
* @parm Wether to recurse or not subdirectories
* @parmopt Parameter to pass to protected member function AddFileInfo
* @parmopt Wether to add or not CFileInfos for directories
* @parmopt Pointer to a variable to signal abort of directory retrieval
* (multithreaded apps).
* @parmopt pulCount Pointer to a variable incremented each time a
* CFileInfo is added to the
* array (multithreaded apps).
* @xref
*
*/
int AddDir(const CString strDirName, const CString strMask,
const BOOL bRecurse, LPARAM lAddParam=AP_NOSORT,
const BOOL bIncludeDirs=FALSE, const volatile BOOL* pbAbort = NULL,
volatile ULONG* pulCount = NULL);

/**
* @cmember Adds a single file or directory to the CFileInfoArray.
* In case of directory, files contained in the directory are NOT
* added to the array. Returns the position in the array where
* the was added (-1 if wasn't added)
* @parm Name of the file or directory to add. NOT ended with backslash.
* @parm Parameter to pass to protected member function AddFileInfo.
* @xref
*/
int AddFile(const CString strFilePath, LPARAM lAddParam);

protected:
/** @access Protected Members */

/**
* @cmember Called by AddXXXX to add a CFileInfo to the array.
* Can be overriden to:
* 1. Add only desired CFileInfos (filter)
* 2. Fill user param lParam
* 3. Change sort order/criteria
* Returns the position in the array where the CFileInfo was added
* or -1 if the CFileInfo wasn't added to the array.
* Default implementation sorts by lAddParam values and adds all
* CFileInfos (no filtering)
* @parm CFileInfo to insert in the array.
* @parm Parameter passed from AddDir function.
* @xref
*/
virtual int AddFileInfo(CFileInfo& finf, LPARAM lAddParam);
};

How to use it



I recommend you to read thoroughly the above class header to get an overall view of the
classes and their methods. For further refference, you can inspect FCompare’s source code
(see second half of article).



Anyway, there it goes some sample code:



This code adds all files in root directory and its subdirectories (but not directories themselves)
to the array and TRACEs them:

CFileInfoArray fia;

fia.AddDir(
   "C:\\",                                     // Directory
   "*.*",                                      // Filemask (all files)
   TRUE,                                       // Recurse subdirs
   fia::AP_SORTBYNAME | fia::AP_SORTASCENDING, // Sort by name and ascending
   FALSE                                       // Don't add entries for dirs
);
TRACE("Dumping directory contents\n");
for (int i=0;i<fia.GetSize();i++) TRACE(fia[i].GetFilePath()+"\n");

You can also call AddDir multiple times. The example shows files in root directories
(but not subdirectories) of C:\\ and D:\\:

CFileInfoArray fia;

// Note both AddDir use the same sorting order and direction
fia.AddDir("C:\\", "*.*", FALSE,
 fia::AP_SORTBYNAME | fia::AP_SORTASCENDING, FALSE );

fia.AddDir("D:\\", "*.*", FALSE,
 fia::AP_SORTBYNAME | fia::AP_SORTASCENDING, FALSE );

TRACE("Dumping directory contents for C:\\ and D:\\ \n");
for (int i=0;i<fia.GetSize();i++) TRACE(fia[i].GetFilePath()+"\n");


Or you can add individual files:

CFileInfoArray fin;

// Note both AddDir and AddFile must use the same sorting order
// and direction
fia.AddDir("C:\\WINDOWS\\", "*.*", FALSE,
 fia::AP_SORTBYNAME | fia::AP_SORTDESCENDING, FALSE );

fia.AddFile("C:\\AUTOEXEC.BAT",
 fia::AP_SORTBYNAME | fia::SORTDESCENDING);

TRACE("Dumping directory contents for C:\\WINDOWS\\ and "
 "file C:\\AUTOEXEC.BAT\n");

for (int i=0;i<fia.GetSize();i++) TRACE(fia[i].GetFilePath()+"\n");



And mix directories with individual files:

CFileInfoArray fin;

// Note both AddDir and AddFile must use the same sorting order and direction
// Note also the list of filemasks *.EXE and *.COM
fia.AddDir("C:\\WINDOWS\\", "*.EXE;*.COM", FALSE,
 fia::AP_SORTBYNAME | fia::AP_SORTDESCENDING, FALSE );

fia.AddFile("C:\\AUTOEXEC.BAT",
 fia::AP_SORTBYNAME | fia::SORTDESCENDING);

// Note no trailing bar for next AddFile (we want to insert
// an entry for the directory itself, not for the files inside
// the directory)
fia.AddFile("C:\\PROGRAM FILES",
 fia::AP_SORTBYNAME | fia::SORTDESCENDING);

TRACE("Dumping directory contents for C:\\WINDOWS\\, "
 "file C:\\AUTOEXEC.BAT and "
 " directory \"C:\\PROGRAM FILES\" \n");

for (int i=0;i<fia.GetSize();i++) TRACE(fia[i].GetFilePath+"\n");

Implementation details and rationale



  • I could have made CFileInfo as a descendant of CFindFile, but I
    don’t like its FindFile, FindNextFile and Close methods
    at all (I don’t need them) and CFindFile stores information as pointers, which I
    also didn’t like (see To pointer or not to pointer discussion below about wether to
    use pointers to elements or elements themselves for CArray‘s contents).

  • I wanted it to be sort of win64 compliant, so I used Win32 API file access functions
    (when calculating checksum and CRC) which allow to address up to 64bit sized files.
    I studied the posibility of going memory-mapped, but I don’t think it would pay the effort
    (volunteers welcome).

  • Windows seems not to buffer API file access functions (at least not as
    fread does) so I wrote a few more lines in file reading loops in order to make
    a little of buffered access.

  • To store filesizes, instead of API’s nFileSizeHigh & nFileSizeLow
    scheme, I used the type ULONGLONG, a MS-propietary unsigned long long (64bit).
    BTW, Visual C++ 6.0 doesn’t support unsigned long long type (although it
    defines this ULONGLONG for this purpose).

  • I wanted the code to be abortable, thread safe and progress-reportable. Some checksum/CRC calculation and
    directory retrieving can be quite time-consuming. After aborting any of abortable functions,
    stored values are correct, although can be incomplete:

    • If AddDir is aborted, some CFileInfos will be missing, but all
      the CFileInfos contained in the array are OK.
    • If GetCRC or GetChecksum are aborted, CRC or checksum will not
      be entirely calculated, and will return the corect value calculated up to the abort moment.

    In any case, you don’t have to do anything special to use again either function and obtain
    correct results.

  • You can see quite a bunch of volatile qualifiers in AddDir definition.
    It’s because those parameters are to be set in multithreaded applications, where they are read
    by AddDir loop and are set by another thread, so they must not be cached on a
    register.

  • I don’t think it’s sctrictly necessary to use any kind of safe accesing
    to common multithread-variables (InterlockedIncrement and the like): just don’t
    rely too much in a temporary weird pulCount value, but just for the
    sake of rightness, I use InterlockedExchange and InterlockedIncrement
    to increment pulCount in CFileInfoArray::AddDir, in CFileInfo::GetCRC
    and CFileInfo::GetChecksum.

  • Due to volatile qualifier, main application doesn’t need to modify
    multithreaded-vars with thread-safe functions (InterlockedIncrement…): the
    only variable of this kind the application needs to modify is pbAbort and due
    to its boolean nature, it is not prone to errors because a non-atomic modification of it.

  • When using MFC’s array template classes, I always think twice wether to store pointers to
    elements or elements themselves in the array. This time I’ve decided to store elements and not
    pointers
    because of the overhead memory allocation produces: Recursively gathering files (at least in
    my top-full HD) often involves allocating several thousands of CFileInfo
    structures.

    Storing elements in the array reduces memory fragmentation and, with an apropiate
    CArray regrowing increment, it also reduces the number of calls to memory
    allocation routines (at least is far from the one-allocation-call-per-element ratio that
    would otherwise be necessary).

    It has some inconveniences, though, for example when switching elements from place to place
    for sorting, or when inserting elements in the middle of the array: it’s almost always
    quicker to move a pointer than an structure. Another caveeat that appears when dealing with
    elements instead of pointers, is that when you externally refference elements by pointer (for
    example via lParam of a listview item, as it happens in FCompare app)
    and you add new elements to the array, those refferences aren’t up to date anymore and you
    have to update them somehow (in FCompare I do it by rebuilding the listview).

  • Another benefit of CArray‘s element storing in front of pointer storing is the
    fact that elements are automagically deallocated when the array is deallocated.
    When storing pointers, a template function DeleteItems ought to be written to
    deallocate individual elements as they are removed from the array.

Sample application: FCompare


FCompare or Binary File Compare is an application to binary compare a group
of files, selectable recursively from a given directory and filemask.

Binary comparison can be done by comparing files’ size, CRC, checksum or contents. When
comparing by CRC, checksum and contents you can limit the number of bytes the comparison will
take into account.

Technical Features



  • MFC Dialogbox-style multithreaded application.
  • Threads can be aborted at any time.
  • Progress report of threads through WM_TIMER message, isolating worker threads
    from UI tasks as much as possible and avoiding to overload them (graphical information is
    displayed and updated at a constant time rate, not at worker thread’s looping rate).
  • Lock of listviews to speed element inserting up and to avoid continuous redrawing while
    inserting elements.
  • Use of CTabCtrl.
  • Use of LPSTR_TEXTCALLBACK listview items.

How to use it


I think it’s pretty straightforward to use, anyway there it goes the normal procedure of use:


  1. Fill Directory editbox either by typing a directory or by selecting one through the
    browse directory dialog that appears when pressing .... If you want to recurse subdirectories, check Recurse dirs checkbox.

  2. Fill File masks editbox with a semicolon separated list of filemasks, for example
    *.htm;*.html;*.shtml;*.asp to find all HTML-related files.

  3. Press Add to Source button. The files in the selected directory will be gathered
    and the Source files listview filled.

  4. Select another (or the same) directory and filemasks.

  5. Press Add to Target button. The files in the selected directory will be gathered
    and the Target files listview filled.

  6. Select a comparison method:

    • By Size: files with equal sizes will match.
    • By Checksum: files with equal checksum will match.
    • By CRC: files with equal CRC will match.
    • By Contents: files with equal contents (byte per byte) will match.

    For checksum, CRC and contents you can enter in UpTo editbox the number of bytes
    of the file that will be used to calc the value (thus speeding up the calculation). Enter 0
    to use all the bytes of the file for calculation.

  7. If you want to supress duplicated files (files that appear in both target and source
    listviews) from appearing in matched listview, uncheck Compare duplicates.

  8. Press Compare button.

  9. Matched files will appear in Compare tab. You can export the three lists to a file by
    pressing Export… button and selecting a file.

Implementation details and rationale




  • Using a property sheet (CPropertySheet) embedded in a bigger dialog, is quite a
    pain:


    • You can’t edit property sheets as normal dialogs in MS’s Dialog Editor,
      so it’s quite an adventure to place controls in a property
      sheet that has something more than the typical three or four overlapping
      property pages.

    • Even if you could edit property sheets with the Dialog Editor, the way of
      editing property pages (creating independent dialogs) is confusing, at least for me,
      because you can’t figure out how will the page fit in the universal harmony of its
      big-brother dialog box.

    That’s why I’ve used directly the tab control, neglecting to use CProperty stuff.

  • Dialogbox applications generated by AppWizard don’t trigger OnIdle nor
    WM_ENTER_IDLE messages. Also, PumpMessage doesn’t work properly.
    Due to that, the only way I’ve found to make a progress-report loop (even if work is being done in a
    worker thread) is to set a timer.
    The desired approach would have been to use
    OnIdle or a similar hack based on PeekMessage /
    PumpMessage pair, but as I stated before, they don4t work for Dialogbox apps (or
    at least I haven’t been able to make them work).

  • Contents matching option is not “win64 friendly”, i.e., uses ANSI C fread
    so it can’t address 64bit-sized files.

  • I use a cute algorithm for file comparing that translates an O(n^2) “normal operation” to O(n)
    (2n to be precise). At first glance, the most obvious algorithm for file comparing is comparing
    each source file with each target file, this is O(n^2).

    As I have the arrays sorted by increasing filesize, I can convert it to O(n) :


    1. Let iSource and iTarget be indexes to source[] and target[] arrays.
    2. iSource = 0, iTarget=0
    3. while target[] and source[] have elements do
    4. if target[iTarget] = source[iSource] there is a probable match, do further comparing
      by checksum or whatever (if you inspect FCompares‘s source code, you’ll see some tricky code here to
      ensure every needed comparison is made). If further match is positive, add to match array.
    5. if target[iTarget]>=source[iSource] then iSource++ else iTarget++
    6. end while



    BTW, I didn’t say the algo was the top-work of computer-science, I just stated that it was cute
    (and I guess it appears somewhere in Knuth’s Art of Computer Programming series).

Recycling bits


For FCompare I’ve borrowed:

  • CDirDialog, the directory browsing
    not-so-common-dialog-box wrapper initially by
    Girish Bharadwaj and Lars Klose
    and later enhanced by Vladimir Kvash. BTW, I slightly modified
    DirDialog.h in order MS VC++ not to complain about not using csStatusText
    and lpcsSelection in SelChanged declaration.
  • Fancy CHyperLinkEx by Giancarlo Iovino,
    where I also commented unused parameter nFlag at CHyperLink::OnMouseMove.

Downloads

Download demo exe – 19 KB
Download demo source – 36 KB
Download source – 9 KB

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read