CRC32: Generating a checksum for a file
Environment: VC5, VC6
Introduction
Recently I wrote a program in which I wanted to generate a CRC for a given file. I did some checking on the web for sample CRC code, but found very few algorithms to help me. So I decided to learn more about CRCs and write my own code. This article describes what a CRC is, how to generate them, what they can be used for, and lastly source code showing how it's done.
What is a CRC
CRC is an acronym for Cyclic Redundancy Checksum or Cyclic Redundancy Check (depending on who you ask). A CRC is a "digital signature" representing data. The most common CRC is CRC32, in which the "digital signature" is a 32-bit number. The "data" that is being CRC'ed can be any data of any length; from a file, to a string, or even a block of memory. As long as the data can be represented as a series of bytes, it can be CRC'ed. There is no single CRC algorithm, there can be as many algorithms as there are programmers. The ideal CRC algorithm has several characteristics about it. First, if you CRC the same data more than once, you must get the same CRC every time. Secondly, if you CRC two different pieces of data you should get two very different CRC values. If you CRC the same data twice, you get the same digital signature. But if you CRC data that differs (even by a single byte) then you should get two very different digital signatures. With a 32-bit CRC there are over 4 billion possible CRC values. To be exact that's 232 or 4,294,967,296. With that many CRC values it's not difficult for every piece of data being CRC'ed to get a unique CRC value. However, it is possible for spurious hits to happen. In other words two completely different pieces of data can have the same CRC. This is rare, but not so rare that it won't happen.
Why use CRCs
Most of the time CRCs are used to compare data as an integrity check. Suppose there are two files that need to be compared to determine if they are identical. The first file is on Machine A and the other file is on Machine B. Each file is a rather large file (say 500 MB), and there is no network connection between the two machines. How do you compare the two files? The answer is CRC. You CRC each of the two files, which gives you two 32-bit numbers. You then compare those 32-bit numbers to see if they are identical. If the CRC values are different, then you can be 100% guaranteed that the files are not the same. If the CRC values are the same, then you can be 99% sure that the files are the same. Remember, because spurious hits can happen you cannot be positive that the two files are identical. The only way to be positive they are the same is to break down and do a comparison one byte at a time. But CRCs offer a quick way to be reasonably certain that two files are identical.
How to generate CRCs
Generating CRCs is a lot like cryptography in that involves a lot of mathematical theories. Since I don't fully understand it myself, I won't go into a lot of those details here. Instead I'll focus on how to program a CRC algorithm. Once you know how the algorithm works you should be able to write a CRC algorithm in any language on any platform. The first part of generating CRCs is the CRC lookup table. In CRC32 this is a table of 256 specific CRC numbers. These numbers are generated by a polynomial (the computation of these numbers and what polynomial to use are part of that math stuff I'm avoiding). The next part is a CRC lookup function. This function takes two things, a single byte of data to be CRC'ed and the current CRC value. It does a lookup in the CRC table according to the byte provided, and then does some math to apply that lookup value to the given CRC value resulting in a new CRC value. The last piece needed is the actual data that is to be CRC'ed. The CRC algorithm reads the first byte of data and calls the CRC lookup function which returns the CRC value for that single byte. It then calls the CRC lookup function with the next byte of data and passes the previous CRC value. After the second call, the CRC value represents the CRC of the first two bytes. You continuously call the CRC lookup function until all the bytes of the data have been processed. The resulting value is the CRC for the whole data.
Code Details
In this sample program I wanted to show that there are many different ways of generating CRCs. There are over 8 different CRC functions, all based on the above steps for generating CRCs. Each function differs slightly in it's intended use or optimization. There are four main CRC functions, each described below. There are also two separate CRC classes, but more on that later. And lastly there are a few helper functions that CRC strings.
C++ Streams: The first function represents the simplest CRC function. The file is opened using the C++ stream classes (ifstream). This function uses nothing but standard C++ calls, so this function should compile and run using any C++ compiler on any OS.
Win32 I/O: This function is more optimized in that it uses the Win32 API for file I/O; CreateFile, and ReadFile. This will speed up the processing, but by using the Win32 API the code is no longer platform independent.
Filemaps: This function uses memory mapped files to process the file. Filemaps can be used to greatly increase the speed with which files are accessed. They allow the contents of a file to be accessed as if it were in memory. No longer does the programmer need to call ReadFile and WriteFile.
Assembly: The final CRC function is one that is optimized using Intel Assembly. By hand writing the assembly code the algorithm can be optimized for speed, although at the sacrifice of being easy to read and understand.
Those are the four main CRC functions. But there are actually two versions of each function. There are two classes, CCrc32Dynamic and CCrc32Static, each of which have the above four functions for a total of eight. The only difference between the static and dynamic classes is the CRC table. With the static class the CRC table and all the functions in the class are static. The trade off is simple. The static class is simpler to use, but the dynamic class uses memory more efficiently because the CRC table (1K in size) is only allocated when needed.
// Using the static class is as easy as one // line of code dwErrorCode = CCrc32Static::FileCrc32Assembly(m_strFilename, dwCrc32); // Whereas there is more involved when using the // dynamic class CCrc32Dynamic *pobCrc32Dynamic = new CCrc32Dynamic; pobCrc32Dynamic->Init(); dwErrorCode = pobCrc32Dynamic->FileCrc32Assembly(m_strFilename, dwCrc32); pobCrc32Dynamic->Free(); delete pobCrc32Dynamic;
Whenever you calculate a CRC you need to take into account the speed of the algorithm. Generating CRCs for files is both a CPU and a disk intensive task. Here is a table showing the time it took to CRC three different files. The columns are the different file sizes, the rows are the different CRC functions, and the table entries are in seconds. The system these numbers were captured on is a dual Pentium III at 1 GHz with a 10,000 RPM SCSI Ultra160 hard drive.
| 44 KB | 34 MB | 5 GB | |
|---|---|---|---|
| C++ Streams | 0.0013 | 0.80 | 125 |
| Win32 I/O | 0.0009 | 0.60 | 85 |
| Filemaps | 0.0010 | 0.60 | 87 |
| Assembly | 0.0006 | 0.35 | 49 |
As expected the C++ streams is the slowest function followed by the Win32 I/O. However, I was very surprised to see the filemaps was not were not faster than the Win32 I/O, in fact they are slower. After I thought about it some, I realized memory mapped files are designed to provide fast random access to files. But when you CRC you access the file sequentially. Thus filemaps are not faster, and the extra overhead of creating the "views" of the file are why it's slower. Filemaps do have one advantage that none of the other functions have. Memory mapped files are guaranteed to be able to access files up to the maximum file size in NT which is 264 or 18 exabytes. Although the Win32 I/O may handle files of this size, none of the documentation confirms this. [Note: The largest file I have CRC'ed is 40 GB, which all eight functions successfully CRC'ed, but took over 10 minutes each.]
If anyone who reads this article knows a way to improve the speed even more, please post the code or email me. Especially if you know of a speed improvement for the assembly code. I will bet there are further optimizations that can be made to the assembly code. After all I don't know Intel Assembly very well, therefore I'm sure there are optimizations I don't know about.
Downloads
Download demo project - 8 KbDownload source - 17 Kb

Comments
no, much better than 99%
Posted by Paul H on 03/14/2013 11:31amI think it was irresponsible of Brian Friesen to write that equal checksums implies 99% probability of equal files. I assume that Brian just pulled that number out of the air. If you don't know a probability, Brian, then don't guess one! I agree with commenter msg555, that the probabilities are FAR better than Brian said. The probability would be 1-2^32, for a 32 bit checksum, if we assume all byte values are equally probable. Brian should fix his article!
ReplyDownload CRC32 code here
Posted by sedi on 02/15/2009 11:47pmhttp://www.cryptopp.com/ Has CRC32 and tons of other crypto and hash codes over there. Open source.
ReplyThank you for your article
Posted by FrankHsieh on 12/05/2007 08:50pmThank you! Your article, performance chart, sample code help me a lot! Thank you very much!
Reply99% guaranteed...
Posted by msg555 on 10/23/2005 09:47pmThere is a 99.999999977% (assuming a good CRC algorithm) chance that two pieces of data with the same checksum are identicle. I think that is good enough odds to assume the data is indeed the same. If you use a 64 bit CRC there is a 99.99999999999999999458% chance.
ReplyNot RFC 1952 Compliant, thus not PKZip Compatible
Posted by hankdane on 11/18/2004 01:58pmThe project is useful, but I found that the algorithm, at least in Crc32Dynamic, is not PKZip compatible. I had to fix the function CalcCrc32 to the following: [code] { // dwCrc32 = ((dwCrc32) >> 8) ^ m_pdwCrc32Table[(byte) ^ ((dwCrc32) & 0x000000FF)]; dwCrc32 = m_pdwCrc32Table[(byte ^ dwCrc32) & 0x000000FF] ^ (dwCrc32 >> 8); } [/code] Old code contained in the line comment.-
Reply
-
-
-
ReplyThe code is RFC1952 compliant
Posted by bfriesen on 12/15/2004 07:18pmRFC Web Ref Code
Posted by hankdane on 11/22/2004 06:08pmSee the code at the bottom of http://www.faqs.org/rfcs/rfc1952.html. That's where I got my change from. It produces different results. Notice how inside the bracket, it xor's before it and's. BTW, I don't mean my correction as a slight. Your project was useful to me, so thanks are in order. However, I still hold it isn't 100% compliant as posted.
ReplyI got different results
Posted by hankdane on 11/22/2004 05:56pmHow come it didn't work for me then? Also, I'm pretty sure the expressions are different. (x ^ y) & z != x ^ (y & z).
ReplyRFC 1952
Posted by bfriesen on 11/19/2004 06:07pmActually my code is RFC 1952 compliant. If you look carefully at the two lines of code you'll see they are semantically the same, just in a different order. They each generate the same result, which means my code is RFC 1952 compliant. I have checked my CRCs against PKZip many times and never found a discrepancy.
ReplySimple version for small files
Posted by Legacy on 12/02/2003 12:00amOriginally posted by: Jamie Whitham
Replyfast check sum asm code (home made) 16 bit easy to be made 32 bit
Posted by Legacy on 11/25/2003 12:00amOriginally posted by: wolf bit
ReplyReliability of CRC
Posted by Legacy on 07/25/2003 12:00amOriginally posted by: krishna
I need to implement a CRC check on file to detect whether it has been modified. I compare it with old CRC. As it is 99% guranteed how do I make sure that the file has changed or not. One way could be that I use two CRC's generated by different algorithms. Could that be useful.
ReplyGreat Code
Posted by Legacy on 07/23/2003 12:00amOriginally posted by: Jonathan
Thanks for the useful tidbit of code, perfect for what I need!
ReplyCRC Algorithms: Verification of correct implementation
Posted by Legacy on 04/28/2003 12:00amOriginally posted by: Imran
Hi Guys!
I am having a little bit of problem with understanding the two implementations of source code i have.This is because of the fact that the implementations I have seem quite different from the CRC methods explained in a typical Communication & Networking books.
I would like someone to explain these implementations to me ( reason why we we assume initial value to be ffffffff, why we and crc with 0x80000000 etc). Moreover I will appreciate if you could point out the flaws in both algorithms.
Also please let me know how can i verify the out put of CRC algorithm.
Thanks for your help
Imran
Here is the code:
#define CRC_POLYNOMIAL 0xEDB88320
int CRC_Algorithm1(int *addr)
{
int i, j;
long crc = 0xFFFFFFFF;
int carry;
int maddr[6];
/* Put the ethernet address into a local array */
memcpy(maddr, addr, 6);
/* Cycle through each character of the address */
for (i = 0; i < 6; ++i)
{
/* Cycle through each bit of this character */
for (j = 0; j < 8; ++j)
{
/* Update the CRC for this address */
carry = ( (crc & 0x80000000) ? 0x01 : 0x00) ^ (maddr[i] & 0x01);
crc <<= 1;
maddr[i] >>= 1;
if (carry)
{
crc = ((crc ^ CRC_POLYNOMIAL) | carry);
}
}
}
return (crc);
} /* CRC_Algorithm1 */
SECOND ALGORITHM
static int crc32( char * s, int length ) {
/* indices */
int perByte;
int perBit;
/* crc polynomial for Ethernet */
const unsigned long poly = 0xedb88320;
/* crc value - preinitialized to all 1's */
unsigned long crc_value = 0xffffffff;
for ( perByte = 0; perByte < length; perByte ++ ) {
unsigned char c;
c = *(s++);
for ( perBit = 0; perBit < 8; perBit++ ) {
crc_value = (crc_value>>1)^
(((crc_value^c)&0x01)?poly:0);
c >>= 1;
}
}
return crc_value;
}
ReplyLoading, Please Wait ...