CRC32: Generating a checksum for a file

Environment: VC5, VC6

Introduction

Recently I wrote a program in which I wanted to generate a CRC for a given file. I did some checking on the web for sample CRC code, but found very few algorithms to help me. So I decided to learn more about CRCs and write my own code. This article describes what a CRC is, how to generate them, what they can be used for, and lastly source code showing how it's done.

What is a CRC

CRC is an acronym for Cyclic Redundancy Checksum or Cyclic Redundancy Check (depending on who you ask). A CRC is a "digital signature" representing data. The most common CRC is CRC32, in which the "digital signature" is a 32-bit number. The "data" that is being CRC'ed can be any data of any length; from a file, to a string, or even a block of memory. As long as the data can be represented as a series of bytes, it can be CRC'ed. There is no single CRC algorithm, there can be as many algorithms as there are programmers. The ideal CRC algorithm has several characteristics about it. First, if you CRC the same data more than once, you must get the same CRC every time. Secondly, if you CRC two different pieces of data you should get two very different CRC values. If you CRC the same data twice, you get the same digital signature. But if you CRC data that differs (even by a single byte) then you should get two very different digital signatures. With a 32-bit CRC there are over 4 billion possible CRC values. To be exact that's 232 or 4,294,967,296. With that many CRC values it's not difficult for every piece of data being CRC'ed to get a unique CRC value. However, it is possible for spurious hits to happen. In other words two completely different pieces of data can have the same CRC. This is rare, but not so rare that it won't happen.

Why use CRCs

Most of the time CRCs are used to compare data as an integrity check. Suppose there are two files that need to be compared to determine if they are identical. The first file is on Machine A and the other file is on Machine B. Each file is a rather large file (say 500 MB), and there is no network connection between the two machines. How do you compare the two files? The answer is CRC. You CRC each of the two files, which gives you two 32-bit numbers. You then compare those 32-bit numbers to see if they are identical. If the CRC values are different, then you can be 100% guaranteed that the files are not the same. If the CRC values are the same, then you can be 99% sure that the files are the same. Remember, because spurious hits can happen you cannot be positive that the two files are identical. The only way to be positive they are the same is to break down and do a comparison one byte at a time. But CRCs offer a quick way to be reasonably certain that two files are identical.

How to generate CRCs

Generating CRCs is a lot like cryptography in that involves a lot of mathematical theories. Since I don't fully understand it myself, I won't go into a lot of those details here. Instead I'll focus on how to program a CRC algorithm. Once you know how the algorithm works you should be able to write a CRC algorithm in any language on any platform. The first part of generating CRCs is the CRC lookup table. In CRC32 this is a table of 256 specific CRC numbers. These numbers are generated by a polynomial (the computation of these numbers and what polynomial to use are part of that math stuff I'm avoiding). The next part is a CRC lookup function. This function takes two things, a single byte of data to be CRC'ed and the current CRC value. It does a lookup in the CRC table according to the byte provided, and then does some math to apply that lookup value to the given CRC value resulting in a new CRC value. The last piece needed is the actual data that is to be CRC'ed. The CRC algorithm reads the first byte of data and calls the CRC lookup function which returns the CRC value for that single byte. It then calls the CRC lookup function with the next byte of data and passes the previous CRC value. After the second call, the CRC value represents the CRC of the first two bytes. You continuously call the CRC lookup function until all the bytes of the data have been processed. The resulting value is the CRC for the whole data.

Code Details

In this sample program I wanted to show that there are many different ways of generating CRCs. There are over 8 different CRC functions, all based on the above steps for generating CRCs. Each function differs slightly in it's intended use or optimization. There are four main CRC functions, each described below. There are also two separate CRC classes, but more on that later. And lastly there are a few helper functions that CRC strings.

C++ Streams: The first function represents the simplest CRC function. The file is opened using the C++ stream classes (ifstream). This function uses nothing but standard C++ calls, so this function should compile and run using any C++ compiler on any OS.

Win32 I/O: This function is more optimized in that it uses the Win32 API for file I/O; CreateFile, and ReadFile. This will speed up the processing, but by using the Win32 API the code is no longer platform independent.

Filemaps: This function uses memory mapped files to process the file. Filemaps can be used to greatly increase the speed with which files are accessed. They allow the contents of a file to be accessed as if it were in memory. No longer does the programmer need to call ReadFile and WriteFile.

Assembly: The final CRC function is one that is optimized using Intel Assembly. By hand writing the assembly code the algorithm can be optimized for speed, although at the sacrifice of being easy to read and understand.

Those are the four main CRC functions. But there are actually two versions of each function. There are two classes, CCrc32Dynamic and CCrc32Static, each of which have the above four functions for a total of eight. The only difference between the static and dynamic classes is the CRC table. With the static class the CRC table and all the functions in the class are static. The trade off is simple. The static class is simpler to use, but the dynamic class uses memory more efficiently because the CRC table (1K in size) is only allocated when needed.

// Using the static class is as easy as one 
//  line of code
dwErrorCode = 
     CCrc32Static::FileCrc32Assembly(m_strFilename, 
                                     dwCrc32);

// Whereas there is more involved when using the 
//   dynamic class
CCrc32Dynamic *pobCrc32Dynamic = new CCrc32Dynamic;
pobCrc32Dynamic->Init();
dwErrorCode = 
     pobCrc32Dynamic->FileCrc32Assembly(m_strFilename, 
                                        dwCrc32);
pobCrc32Dynamic->Free();
delete pobCrc32Dynamic;

Whenever you calculate a CRC you need to take into account the speed of the algorithm. Generating CRCs for files is both a CPU and a disk intensive task. Here is a table showing the time it took to CRC three different files. The columns are the different file sizes, the rows are the different CRC functions, and the table entries are in seconds. The system these numbers were captured on is a dual Pentium III at 1 GHz with a 10,000 RPM SCSI Ultra160 hard drive.

44 KB 34 MB 5 GB
C++ Streams 0.0013 0.80 125
Win32 I/O 0.0009 0.60 85
Filemaps 0.0010 0.60 87
Assembly 0.0006 0.35 49

As expected the C++ streams is the slowest function followed by the Win32 I/O. However, I was very surprised to see the filemaps was not were not faster than the Win32 I/O, in fact they are slower. After I thought about it some, I realized memory mapped files are designed to provide fast random access to files. But when you CRC you access the file sequentially. Thus filemaps are not faster, and the extra overhead of creating the "views" of the file are why it's slower. Filemaps do have one advantage that none of the other functions have. Memory mapped files are guaranteed to be able to access files up to the maximum file size in NT which is 264 or 18 exabytes. Although the Win32 I/O may handle files of this size, none of the documentation confirms this. [Note: The largest file I have CRC'ed is 40 GB, which all eight functions successfully CRC'ed, but took over 10 minutes each.]

If anyone who reads this article knows a way to improve the speed even more, please post the code or email me. Especially if you know of a speed improvement for the assembly code. I will bet there are further optimizations that can be made to the assembly code. After all I don't know Intel Assembly very well, therefore I'm sure there are optimizations I don't know about.

Downloads

Download demo project - 8 Kb
Download source - 17 Kb


Comments

  • no, much better than 99%

    Posted by Paul H on 03/14/2013 11:31am

    I think it was irresponsible of Brian Friesen to write that equal checksums implies 99% probability of equal files. I assume that Brian just pulled that number out of the air. If you don't know a probability, Brian, then don't guess one! I agree with commenter msg555, that the probabilities are FAR better than Brian said. The probability would be 1-2^32, for a 32 bit checksum, if we assume all byte values are equally probable. Brian should fix his article!

    Reply
  • Download CRC32 code here

    Posted by sedi on 02/15/2009 11:47pm

    http://www.cryptopp.com/ Has CRC32 and tons of other crypto and hash codes over there. Open source.

    Reply
  • Thank you for your article

    Posted by FrankHsieh on 12/05/2007 08:50pm

    Thank you! Your article, performance chart, sample code help me a lot! Thank you very much!

    Reply
  • 99% guaranteed...

    Posted by msg555 on 10/23/2005 09:47pm

    There is a 99.999999977% (assuming a good CRC algorithm) chance that two pieces of data with the same checksum are identicle. I think that is good enough odds to assume the data is indeed the same. If you use a 64 bit CRC there is a 99.99999999999999999458% chance.

    Reply
  • Not RFC 1952 Compliant, thus not PKZip Compatible

    Posted by hankdane on 11/18/2004 01:58pm

    The project is useful, but I found that the algorithm, at least in Crc32Dynamic, is not PKZip compatible.  I had to fix the function CalcCrc32 to the following:
    
    [code]
    {
    //	dwCrc32 = ((dwCrc32) >> 8) ^ m_pdwCrc32Table[(byte) ^ ((dwCrc32) & 0x000000FF)];
    	dwCrc32 = m_pdwCrc32Table[(byte ^ dwCrc32) & 0x000000FF] ^ (dwCrc32 >> 8);
    }
    [/code]
    
    Old code contained in the line comment.

    • The code is RFC1952 compliant

      Posted by bfriesen on 12/15/2004 07:18pm

      I can't explain why you might have gotten differnet results, but I can prove the code is compliant.  In a general case (x^y)&z may not be equivalent to x^(y&z), but specifically in my algorithm it works just fine.
      
      Let's look closer at each one (I'm going to use '?' to represent bits with data):
      
      Mine:
      (byte) ^ (dwCrc32 & 0x000000FF)
      
      1.  0x000000?? ^ (0x???????? & 0x000000FF)
      2.  0x000000?? ^ (0x000000??)
      3.  0x000000??
      
      RFC1952:
      (byte ^ dwCrc32) & 0x000000FF
      
      1.  (0x000000?? ^ 0x????????) & 0x000000FF
      2.  (0x????????) & 0x000000FF
      3.  0x000000??
      
      Follow the logic yourself, it evaluates the same.

      Reply
    • RFC Web Ref Code

      Posted by hankdane on 11/22/2004 06:08pm

      See the code at the bottom of http://www.faqs.org/rfcs/rfc1952.html. That's where I got my change from. It produces different results. Notice how inside the bracket, it xor's before it and's. BTW, I don't mean my correction as a slight. Your project was useful to me, so thanks are in order. However, I still hold it isn't 100% compliant as posted.

      Reply
    • I got different results

      Posted by hankdane on 11/22/2004 05:56pm

      How come it didn't work for me then? Also, I'm pretty sure the expressions are different. (x ^ y) & z != x ^ (y & z).

      Reply
    • RFC 1952

      Posted by bfriesen on 11/19/2004 06:07pm

      Actually my code is RFC 1952 compliant. If you look carefully at the two lines of code you'll see they are semantically the same, just in a different order. They each generate the same result, which means my code is RFC 1952 compliant. I have checked my CRCs against PKZip many times and never found a discrepancy.

      Reply
    Reply
  • Simple version for small files

    Posted by Legacy on 12/02/2003 12:00am

    Originally posted by: Jamie Whitham

    Great code!  Thanks Brian :-)
    
    

    I am using this for a small settings file 300 bytes or so.
    I have this in memory before saving to disk when I quit.
    In my application power could be ditched at any point so I
    always write two copies and use CRC to check them on loading.

    Here is the code I am using. I have changed Brian's code
    as with such a small file I thought there's not much need
    to create the CRC table etc. However it will be
    inefficient on larger files.

    So you might find this simpler for smaller files....

    Cheers

    Jamie

    // returns calculated CRC32
    // pData points to data for which the checksum is needed
    // dwLen is the length of that data in bytes
    DWORD CheckSum(BYTE* pData,DWORD dwLen)
    {
    DWORD dwCrc32=0xFFFFFFFF;
    for (DWORD ct=0;ct<dwLen;ct++)
    {
    DWORD i=(pData[ct]) ^ ((dwCrc32) & 0x000000FF);
    for(INT j = 8; j > 0; j--)
    {
    if(i & 1)
    i = (i >> 1) ^ 0xEDB88320;
    else
    i >>= 1;
    }
    dwCrc32 = ((dwCrc32) >> 8) ^ i;
    }
    return ~dwCrc32;
    }

    Reply
  • fast check sum asm code (home made) 16 bit easy to be made 32 bit

    Posted by Legacy on 11/25/2003 12:00am

    Originally posted by: wolf bit

    I hope that the comments will be helpful. If you find some bugs or you have a time to test the performens please mail me !
    
    Code is free ... use it whenever you need it. For something more mail me!

    Enjoy !


    unsigned short check_sum32bit(void *data_ptr, unsigned int length)
    {

    __asm {
    mov ecx,dword ptr [ebp+0Ch] // get length
    mov ebx,dword ptr [ebp+8] // get data_ptr

    mov edx, ecx

    shr ecx, 2 // set loop counter
    and edx, 0x3 // set the rest

    xor eax, eax // equal mov eax,0 just is better

    cmp ecx, 0 // if loop counter = 0 no loop
    je noloop

    check_sum32: // all given data loop
    add eax, ds:[ebx]
    adc eax, 0
    add ebx, 4
    loop check_sum32 // end loop
    noloop:

    sub ebx, 4 // correct last loop variable befor exit ebx+4 should be dec again

    cmp edx, 0 // check for the rest
    je noxdata

    add ebx, edx // !!! positioned the offset to the last data (could be a problem if data is less then 4 bytes)

    mov ecx, 4 // get valid bytes from the rest data
    sub ecx, edx

    mov edx, ds:[ebx]

    shrloop:
    shr edx, 8 // set no valid bytes to zero
    loop shrloop
    add eax, edx // add the rest data
    adc eax, 0 // add carry flag

    noxdata:
    mov ebx, eax // make 32 bits check sum to 16 bits chesk sum if you need 32 bits just make the function "unsigned int" and delete next lines !
    shr ebx, 16
    add ax, bx
    adc ax, 0
    not ax
    }
    return ; // returned value is in ax register
    }

    Reply
  • Reliability of CRC

    Posted by Legacy on 07/25/2003 12:00am

    Originally posted by: krishna

    I need to implement a CRC check on file to detect whether it has been modified. I compare it with old CRC. As it is 99% guranteed how do I make sure that the file has changed or not. One way could be that I use two CRC's generated by different algorithms. Could that be useful.

    Reply
  • Great Code

    Posted by Legacy on 07/23/2003 12:00am

    Originally posted by: Jonathan

    Thanks for the useful tidbit of code, perfect for what I need!

    Reply
  • CRC Algorithms: Verification of correct implementation

    Posted by Legacy on 04/28/2003 12:00am

    Originally posted by: Imran

    Hi Guys!
    I am having a little bit of problem with understanding the two implementations of source code i have.This is because of the fact that the implementations I have seem quite different from the CRC methods explained in a typical Communication & Networking books.
    I would like someone to explain these implementations to me ( reason why we we assume initial value to be ffffffff, why we and crc with 0x80000000 etc). Moreover I will appreciate if you could point out the flaws in both algorithms.
    Also please let me know how can i verify the out put of CRC algorithm.
    Thanks for your help
    Imran

    Here is the code:
    #define CRC_POLYNOMIAL 0xEDB88320
    int CRC_Algorithm1(int *addr)
    {
    int i, j;
    long crc = 0xFFFFFFFF;
    int carry;
    int maddr[6];


    /* Put the ethernet address into a local array */
    memcpy(maddr, addr, 6);

    /* Cycle through each character of the address */
    for (i = 0; i < 6; ++i)
    {
    /* Cycle through each bit of this character */
    for (j = 0; j < 8; ++j)
    {
    /* Update the CRC for this address */
    carry = ( (crc & 0x80000000) ? 0x01 : 0x00) ^ (maddr[i] & 0x01);
    crc <<= 1;
    maddr[i] >>= 1;
    if (carry)
    {
    crc = ((crc ^ CRC_POLYNOMIAL) | carry);
    }
    }
    }


    return (crc);

    } /* CRC_Algorithm1 */


    SECOND ALGORITHM

    static int crc32( char * s, int length ) {
    /* indices */
    int perByte;
    int perBit;
    /* crc polynomial for Ethernet */
    const unsigned long poly = 0xedb88320;
    /* crc value - preinitialized to all 1's */
    unsigned long crc_value = 0xffffffff;

    for ( perByte = 0; perByte < length; perByte ++ ) {
    unsigned char c;

    c = *(s++);
    for ( perBit = 0; perBit < 8; perBit++ ) {
    crc_value = (crc_value>>1)^
    (((crc_value^c)&0x01)?poly:0);
    c >>= 1;
    }
    }
    return crc_value;
    }

    Reply
  • Loading, Please Wait ...

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Protecting business operations means shifting the priorities around availability from disaster recovery to business continuity. Enterprises are shifting their focus from recovery from a disaster to preventing the disaster in the first place. With this change in mindset, disaster recovery is no longer the first line of defense; the organizations with a smarter business continuity practice are less impacted when disasters strike. This SmartSelect will provide insight to help guide your enterprise toward better …

  • Live Event Date: August 14, 2014 @ 2:00 p.m. ET / 11:00 a.m. PT Data protection has long been considered "overhead" by many organizations in the past, many chalking it up to an insurance policy or an extended warranty you may never use. The realities of today make data protection a must-have, as we live in a data driven society. The digital assets we create, share, and collaborate with others on must be managed and protected for many purposes. Check out this upcoming eSeminar and join eVault Chief Technology …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds