Using Two Common 'Compression' Techniques

Data compression is useful to the computer end-user. It can be found in hardware and software, and it is well integrated into modern operating systems for end-user usage. For instance, a modern file explorer may allow a pop-up menu that causes a group of files to be zipped up. Perhaps that PHP Web page you are browsing has data compression enabled to help those Web pages get to you faster on that ancient dial-up modem you can't seem to replace.

Applied data compression is a fun topic to study. The gurus of the field, Mark Nelson and Jean-Loup Gailly, published a book about data compression; I found a copy tucked away in a shelf at the local bookstore. It is called The Data Compression Book, Second Edition, and ships with C source code on a 3.5" diskette. This book completely describes the Huffman encoding technique using theory, pictures of tree data structures, and C source code. It also covers many other modern techniques such as the lempel-ziv family, jpeg, and even audio compression. It does not cover run length encoding becausr this is not really a data compression method. You'll have to get and read the book to know why that is so. I'll save the fun for you.

Applied assembly programming is not as fun as data compression, but has been known to yield a brief moment or two of joy. I found a great book on assembly program for the 486 processor called Optimizing C with Assembly Code by Peter Gulutzan and Trudy Pelzer. This is a great book for those who don't want to write a program entirely in assembly, like myself. This book teaches you how to use assembly to add, subtract, multiply, divide, goto, perform stack manipulation, and use strings. From my perspective, the goal is to know my algorithm in pseudo-code and then use the book as a reference to convert it into assembly.

Briefly, and to the point about Huffman encoding and run length encryption. Huffman encoding deals with using the frequency of data to compute the data's entropy, or number of bits it actually takes to store that data versus the bloated bits that you are using to store the data. Run length encoding, in a general way, is data stored as a packet. This packet contains the frequency of the data and the data itself.

Now, look at a simple example for Huffman compression. I have a black and white bitmap, a scanned file, that I need to fax to my lawyer so that he can keep me out of jail. But, for some strange reason, I still have a 300 baud modem and 10 minutes left on my calling card. Needlees to say, I won't be able to upload this bitmap file without running out of minutes. What do I do? I use the Huffman compression technique to reduce my file size so that I can upload it to my lawyer's office in New York City. My system locale is English, so I happen to have 256 characters in my computer's alphabet. My file only uses two of these characters. Basically, my alphabet requires that I use 8 bits per character. Using the powers of binary mathematics, I realize that black and white are 2 states, on and off, and that only requires 1 bit of data per color. 1 bit out of 8 is much better than having to use all 8 bits, and that is what the Huffman method does. It determines the alphabet of your data along with the frequency of each data element. A data element that occurs more frequently should use fewer number of bits that one that appears infrequently.

Look at a simple example for Run Length encoding. For some reason, I want to develop an animated bitmap class that I can use on my MFC buttons. Because these frames are animated, I can assume that each frame is similar to the one that came before it. I don't really want to bloat my resource space with a full bitmap image for each frame, so instead I take the difference of frame2 to frame1, frame 3 to frame 2, and so forth, and run length encode it. By taking the difference of each frame, I'm subtracting out all the similar bits and leaving those that are different. The similar bits will occur in runs which encode nicely with the run length encoding technique. The other bits can be stored using a special escape sequence that say it is not encoded, and can thus be encoded with minimal overhead.

On to the code....

The main project of this article, the quick compression library, is an ATL/COM project that creates two objects. These are the Huffman and run length encoding objects. The quick compress interface is very simple. Each of these objects provides two methods. For the Huffman object, these are Compress and Uncompress. For Run Length Encoding, these are Decode and Encode. Both of these methods expect binary data stored into a BSTR. Using a BSTR in this way allows you to make use of the system API's that will tell you how much data the BSTR stores.

The secondary project of this article, the compressor front end, is an MFC dialog application. Its sole purpose is to create instances of these objects and use them. The MFC application allows you to type in paths for input and output files and then click compress, decompress, encode, or decode. It also shows the before and after file sizes. It is rudimentaryn but encapsulates the complete idea of the library.

Note: All methods are done in-place in memory. As a coworker of mine likes to say, youre mileage may vary if you attempt to compress a 1GB file but have only 512MB of memory.

I first implemented the compression code a number of years ago, and as part of another CodeGuru article that I uploaded. In that past article, I presented a method for remote controlling one computer with another. This library is really just a thin wrapper around that code, but it solves one important problem. The library is now available to VBScript, JavaScript, or other C/C++ COM-enabled applications.

Here is a code snippet for using the compression Huffman ATL objects in an MFC application. The Run Length Encode object is used similarly. Please refer to the secondary project for a full code listing. Also, note that the quick compress type library and DLL have been added to the zip archive. You only really need to compile the MFC application (ensuring that the type library and DLL are properly located. Also, you will have to regsvr32 the DLL).

// Header file
#include <atlbase.h>
#import "QuickCompress.tlb" named_guids raw_interfaces_only

// Source file

// Initialize COM
CoInitialize(NULL);

// Create an instance of the library
QUICKCOMPRESSLib::IHuffmanPtr pHuffmanPtr;
pHuffmanPtr.CreateInstance(__uuidof(QUICKCOMPRESSLib::Huffman));

// Copy your data here --> Change {} to fit your input data
BSTR InputData = SysAllocStringByteLen(NULL,{Data Length Here});
memcpy(InputData,{Data Variable Here},{Data Length Here});

// Compress using the Huffman method
BSTR OutputData = NULL;
pHuffmanPtr->Compress(InputData,&OutputData);

// Uncompress using the Huffman method
BSTR OutputData2 = NULL;
pHuffmanPtr->Uncompress(OutputData,&OutputData2);

// Get the compressed size
DWORD dwLength = SysStringByteLen(OutputData);

// Cleanup
SysFreeString(InputData);
SysFreeString(OutputData);
SysFreeString(OutputData2);

// Clean up COM
pHuffmanPtr.Release();
CoUninitialize();


Downloads

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Learn How A Global Entertainment Company Saw a 448% ROI Every business today uses software to manage systems, deliver products, and empower employees to do their jobs. But software inevitably breaks, and when it does, businesses lose money -- in the form of dissatisfied customers, missed SLAs or lost productivity. PagerDuty, an operations performance platform, solves this problem by helping operations engineers and developers more effectively manage and resolve incidents across a company's global operations. …

  • On-demand Event Event Date: December 3, 2014 Microsoft SQL Server includes a high availability option called AlwaysOn Availability Groups (AGs). In this session we will see how AGs are implemented using a Windows Server Failover Cluster and SQL Server 2014 with the Pure Storage FlashArray. We will showcase how to use Pure Storage snapshots to seed secondary database copies and then replicate to another location for backup or disaster recovery operations. Windows PowerShell and the Pure Storage PowerShell …

Most Popular Programming Stories

More for Developers

RSS Feeds