Enhance Your Dynamic Memory Allocation with an Undocumented MFC Class

Introduction

I never cease being amazed by what you can find by browsing the MFC source code files. My latest discovery is rather spectacular. It consists of a small utility class that allows you to modify how objects for a given class are dynamically allocated. This class is CFixecAlloc. It is used inside the famous CString class to allocate string buffers. Using CFixedAlloc with the existing code requires very minimal changes and yields surprising improvements in performance. This article will discuss CRT dynamic allocation overheads, usual alternatives to CRT shortcomings, the anatomy of the CFixedAlloc class, how to use CFixedAlloc in your programs, and finally, a demo application so you can see for yourself what CFixedAlloc can do for your programs.

Why Use Alternatives to malloc/free, new/delete?

It is a well-known fact that dynamic memory allocation is expensive. It incurs performance overhead both in execution time and space. If you need to be convinced, just take a peek in the malloc.c file provided with the CRT source code files. The malloc() function is a somewhat complex function that takes some time to execute. Concerning the space overhead, the subject is a bit complex. Here is an overview. Depending on the version of VC++ you are using when you compile your program, and the Windows version where your program is running, a different version of heap will be used. This is out of scope for this article, but if you are curious about how CRT chooses the heap version, you can find the function __heap_select() in the heapinit.c file. In the CRT winheap.h header file, three heap versions are defined:

  • __SYSTEM_HEAP
  • __V5_HEAP
  • __V6_HEAP

The V5 and V6 heap versions map to heap implementations private to the CRT library, whereas the system heap maps directly to the Win32 heap services (HeapCreate(), HeapAlloc(), and so forth). What is interesting about the V5 and V6 versions is that you can know the space overhead for these heaps exactly by looking at the CRT source code. For example, you can find this statement for the V6 heap:

//  add 8 bytes entry overhead and round up to next para size
sizeEntry = (intSize + 2 * (int)sizeof(int) + 
            (BYTES_PER_PARA - 1)) & ~(BYTES_PER_PARA - 1);

BYTES_PER_PARA equals 16 bytes. This is a huge overhead. Consider the case where you would request twelve bytes. The CRT would reserve 32 bytes for this request. This is more than double what you requested! Unfortunately (or fortunately; it depends on your perspective), the V5 and the V6 heap are nowadays rarely used and you cannot know for sure what the HeapAlloc() overhead is because you do not have access to this source code. However, it is a very good assumption that there is an overhead. Microsoft states this fact in the HeapCreate() documentation:

The system uses memory from the private heap to store heap support structures, so not all of the specified heap size is available to the process. For example, if the HeapAlloc function requests 64 kilobytes (K) from a heap with a maximum size of 64K, the request may fail because of system overhead.

Limitations of Other Alternatives

Some people have written memory pool classes to work around the CRT heap overhead. The excellent article from Uri Twig is an example of such a class. However, most of them require massive code change to replace all new and delete invocations to some calls to the buffer pool class member functions. Another common potential limitation ("potential" because it depends on your needs) is that they lack thread safety. CFixedAlloc addresses these limitations and it will be explained in the next section.

Anatomy of CFixedAlloc

You can start by viewing the class declaration:

class CFixedAlloc
{
// Constructors
public:
    CFixedAlloc(UINT nAllocSize, UINT nBlockSize = 64);

// Attributes
    UINT GetAllocSize() { return m_nAllocSize; }

// Operations
public:
    void* Alloc();      // return a chunk of memory of
                        // nAllocSize
    void Free(void* p); // free chunk of memory returned
                        // from Alloc
    void FreeAll();     // free everything allocated from
                        // this allocator

// Implementation
public:
    ~CFixedAlloc();

protected:
    struct CNode
    {
        CNode* pNext;   // only valid when in
                        // free list
    };

    UINT m_nAllocSize;  // size of each block from Alloc
    UINT m_nBlockSize;  // number of blocks to get at a time
    CPlex* m_pBlocks;   // linked list of blocks
                        // (is nBlocks*nAllocSize)
    CNode* m_pNodeFree; // first free node
                        // (NULL if no free nodes)
    CRITICAL_SECTION m_protect;
};

As you can see, it is a very simple class. The first thing that you can notice is that the class provides thread safety with a critical section. Next, m_AllocSize contains the object size of the class that will be using its service, and m_blockSize specifies the number of objects each fixed block of memory can contain. Both data members are set at the construction. The remaining unknown is the mysterious CPlex pointer. I will not go in to the details of this class, but just know that it is the class that does the actual CRT dynamic allocation calls. It only contains a pointer to another CPlex object to create a linked list of CPlex so its size is only 4 bytes. When allocating memory, it requests:

m_allocSize*m_blockSize+sizeof(CPlex)

If you are interested in knowing more about CPlex, I recommend you to read the book MFC Internals, which discusses CPlex in more details. With this information, you already can compare the memory overhead difference between CRT and CFixedAlloc. By using CRT, you will have the CRT overhead for each object, whereas with CFixedAlloc, it is the CRT overhead plus the size of CPlex (4 bytes) divided by the number of allocated objects. When the number of objects is big, the memory overhead is very close to zero, and as you will see with the demo, it can make a huge difference of many MBs. Now, look more closely at the two more important CFixedAlloc functions:

void* CFixedAlloc::Alloc()
{
    EnterCriticalSection(&m_protect);
    if (m_pNodeFree == NULL)
    {
        CPlex* pNewBlock = NULL;
        TRY
        {
            // add another block
            pNewBlock = CPlex::Create(m_pBlocks, 
                        m_nBlockSize, m_nAllocSize);
        }
        CATCH_ALL(e)
        {
            LeaveCriticalSection(&m_protect);
            THROW_LAST();
        }
        END_CATCH_ALL

        // chain them into free list
        CNode* pNode = (CNode*)pNewBlock->data();
        // free in reverse order to make it  easier to debug
        (BYTE*&)pNode += 
            (m_nAllocSize * m_nBlockSize) - m_nAllocSize;
        for (int i = m_nBlockSize-1; i >= 0; i--, 
                     (BYTE*&)pNode -= m_nAllocSize)
        {
            pNode->pNext = m_pNodeFree;
            m_pNodeFree = pNode;
        }
    }
    ASSERT(m_pNodeFree != NULL);  // we must have something

    // remove the first available node from the free list
    void* pNode = m_pNodeFree;
    m_pNodeFree = m_pNodeFree->pNext;

    LeaveCriticalSection(&m_protect);
    return pNode;
}

void CFixedAlloc::Free(void* p)
{
    if (p != NULL)
    {
        EnterCriticalSection(&m_protect);

        // simply return the node to the free list
        CNode* pNode = (CNode*)p;
        pNode->pNext = m_pNodeFree;
        m_pNodeFree = pNode;
        LeaveCriticalSection(&m_protect);
    }
}

In Alloc(), the code checks whether the pool of free memory is empty; if it is, it creates a new block of memory (CPlex) and sets up a list of free nodes. It is interesting to note that there is no wasted space to place node information because it is placed directly into each block. Of course, for this scheme to work, m_nAllocSize must be bigger than sizeof(CNode). This is why it is checked in the constructor:

ASSERT(nAllocSize >= sizeof(CNode));

If you want to see the CRT code, you can tell with a glimpse that CFixedAlloc is much lighter. From an algorithmic point of view, the main factor for this simplicity is because the memory block's size is fixed (thus the name of the class). In a heap that allows variable size allocation, after many allocations and deallocations, the heap gets fragmented and the heap code has to scan the heap once in a while to merge adjacent small free blocks into big ones to make an efficient use of memory. With my demo program, I got the allocation time shred by a factor of 4 to 5.

And finally, let me introduce to you macros that help you use CFixedAlloc with your classes:

// DECLARE_FIXED_ALLOC -- used in class definition
#define DECLARE_FIXED_ALLOC(class_name) \
public: \
    void* operator new(size_t size) \
    { \
        ASSERT(size == s_alloc.GetAllocSize()); \
        UNUSED(size); \
        return s_alloc.Alloc(); \
    } \
    void* operator new(size_t, void* p) \
        { return p; } \
    void operator delete(void* p) { s_alloc.Free(p); } \
    void* operator new(size_t size, LPCSTR, int) \
    { \
        ASSERT(size == s_alloc.GetAllocSize()); \
        UNUSED(size); \
        return s_alloc.Alloc(); \
    } \
protected: \
    static CFixedAlloc s_alloc \

// IMPLEMENT_FIXED_ALLOC -- used in class implementation file
#define IMPLEMENT_FIXED_ALLOC(class_name, block_size) \
CFixedAlloc class_name::s_alloc(sizeof(class_name), block_size) \

The DECLARED_FIXED_ALLOC() macro overloads the new and delete operators of your class, thus allowing you to use CFixedAlloc with absolutely no code change. With the IMPLEMENT_FIXED_ALLOC(), this is where you specify the block size.

Enhance Your Dynamic Memory Allocation with an Undocumented MFC Class

How to Use CFixedAlloc

  1. Include "fixalloc.h" in the header file that contains the class definition that you want to modify.
  2. Add the DECLARE_FIXED_ALLOC() macro in the class declaration.
  3. Add the IMPLEMENT_FIXED_ALLOC() macro in the CPP file that contains the class definition.
  4. Because CFixedAlloc is a private MFC class, you have to add an additional include directory in the compiler options to point to the MFC source code directory. (I am assuming that you installed it during Visual Studio setup. It is installed by default, I think.)
  5. Recompile.
  6. Fine tune the block size.

Now, you are done. Your class is upgraded to use the MFC fixed allocation service. One word of caution: If the define _DEBUG is present during the compilation, the CFixedAlloc macros will expand to nothing and the result will be that your class will behave as if you made no change to it. During the development of the demo program, I got this small problem because, when I chose to link with the debug version of the run-time library, the _DEBUG macro was implicitly set. Be aware of that. To figure that out, I added garbage in this block that MFC automatically adds when it creates a new file:

#ifdef _DEBUG
#define new DEBUG_NEW
#undef THIS_FILE
static char THIS_FILE[] = __FILE__;
#endif

Also, here is a small comment on the last step. It is very important that you provide a good value for the block size. If it is too big, you will be wasting memory; if it is too small, you will not get as much performance improvement as you could get. However, even if the block size value is too small, you will still reduce the number of CRT allocation calls by a factor equal to the block size that is non negligible. The ideal value for the block size is the exact number of objects that will be allocated like in the demo program, but of course it is usually not possible to know that value.

Memory deallocation

CFixedAlloc returns a node in a "free list" when an object is deleted but the "free list" is inside the same allocated CPlex. Because plexes are single linked, they will remain allocated for the entire duration of the program (until their head is deleted). In fact, CFixedAlloc is more than an allocator; it is a recycler: It allocates on need but does not release. It just keeps apart for eventual subsequent allocation requests.

This results in better performance speed in case of frequent alternate new / delete / new calls (because you will mostly reuse the system resource you already own), but may create problems in those situations when a huge amount of memory is required only for a limited time: The program will retain the memory until the head of the plexes is deleted (and because it is static ... at the end of the program) unless the user explicitly releases the memory with CFixedAlloc::FreeAll() once he is done with the memory. You can call CFixedAlloc::FreeAll() anytime to free the memory yourself if you have previously deleted all the allocated objects. Otherwise, the memory will be freed when the CFixedAlloc object is destroyed. Here is a little reminder: Global objects are created in startup code before entering WinMain(). Global objects are destroyed after the program has exited WinMain() and right before the program gets unloaded.

CFixedAlloc and inheritance

When using CFixedAlloc with the derived class, there are a few things that you must be aware of. Say you have:

class Base
{
    DECLARE_FIXED_ALLOC(Base);
};

class Child : public Base
{
};

If you do:

Child *p = new Child;

The Child::operator new() will be called and you won't be using CFixedAlloc. That being said, nothing stops you from using CFixedAlloc in a Base class *and* its derived classes.

class Base
{
    DECLARE_FIXED_ALLOC(Base);
};

class Child : public Base
{
    DECLARE_FIXED_ALLOC(Child);
};

Of course, it doesn't make sense to declare the Base class as FIXED_ALLOC if the class is abstract. One thing to be aware about using CFixedAlloc in class hierarchy is that the operator delete is static and not virtual. Consider the following example:

Base *p = new Child;
delete p;

In this case, Base::operator delete() would be called (I haven't tested it, but I'm 99% confident that this is what would happen) and that would be a serious problem. To work around that problem, here is a potential solution:

class Base
{
public:
    virtual void Destroy() {delete this;} 
    DECLARE_FIXED_ALLOC(Base);
};

class Child : public Base
{
public:
    virtual void Destroy() {delete this;} 
    DECLARE_FIXED_ALLOC(Child);
};

Base *p = new Child;
p->Destroy();

The Demo Program

The demo program is very simple. It just declares two almost identical classes: one without CFixedAlloc and the other using it. Then, a bunch of objects is created. This operation is timed, and once completed, the result is displayed. For computing the size overhead of the class without CFixedAlloc, I used the overhead you would have if your program were using the __V6_HEAP, which is very unlikely and thus the computed value might not be totally exact. However, from experience, it seems to be quite accurate.

#define ITER_NUM 1000*1024

class A
{
public:
    A( A *next ) : m_next(next) {}
    A  *m_next;
    int dummy1;
    int dummy2;
};

class B
{
public:
    B( B *next ) : m_next(next) {}
    B  *m_next;
    int dummy1;
    int dummy2;

    DECLARE_FIXED_ALLOC(B);
};

IMPLEMENT_FIXED_ALLOC(B,ITER_NUM);

void CCFixedAllocDemoDlg::OnTestTimebutton()
{
    // TODO: Add your control notification
    // handler code here
    A *curAObj, *firstAObj = NULL;
    B *curBObj, *firstBObj = NULL;
    DWORD startA, endA, startB, endB;
    register int i;
    {
        CWaitCursor wait;

        startA = GetTickCount();

        for( i = 0; i < ITER_NUM; i++ )
        {
            firstAObj = new A(firstAObj);
        }
        while( firstAObj )
        {
            curAObj = firstAObj->m_next;
            delete firstAObj;
            firstAObj = curAObj;
        }

        startB = endA = GetTickCount();

        for( i = 0; i < ITER_NUM; i++ )
        {
            firstBObj = new B(firstBObj);
        }
        while( firstBObj )
        {
            curBObj = firstBObj->m_next;
            delete firstBObj;
            firstBObj = curBObj;
        }

        endB = GetTickCount();
    }

    displayResult( endA-startA,endB-startB );
}

#define BYTES_PER_PARA 16

void CCFixedAllocDemoDlg::displayResult( DWORD timeA,
                                         DWORD timeB )
{
    TCHAR buf[1024];

    /*
     * Each object A takes 32 bytes. See article for details
     */
    int overheadA = (32-sizeof(A))*ITER_NUM;

    /*
     * First compute the allocated size then substract the
     * requested size
     */
    int overheadB = 
      (8+sizeof(B)*ITER_NUM + sizeof(CPlex) + (BYTES_PER_PARA-1))
                    & ~(BYTES_PER_PARA - 1);
    overheadB    -= sizeof(B)*ITER_NUM;

    wsprintf( buf, __TEXT("Creating and destroying %d objects\n")
                   __TEXT("without CFixedAlloc\t: %4d ms\n")
                   __TEXT("with CFixedAlloc\t: %4d ms\n")
                   __TEXT("You saved %d bytes with CFixedAlloc"),
                   ITER_NUM, timeA, timeB,
                   overheadA - overheadB );
    MessageBox(buf,__TEXT("Results"));
}

Enhance Your Dynamic Memory Allocation with an Undocumented MFC Class

Suggestions

You could make a single thread version of CFixedAlloc just by making a private copy of the class and removing the critical section stuff to further increase the performance gain. Johan Strand pointed out that the single thread support has already been added to MFC7 (VC++.NET). If you want to use the single thread version of CFixedAlloc, you can use the following macros:

// DECLARE_FIXED_ALLOC -- used in class definition
#define DECLARE_FIXED_ALLOC_NOSYNC(class_name) \
public: \
    void* operator new(size_t size) \
    { \
        ASSERT(size == s_alloc.GetAllocSize()); \
        UNUSED(size); \
        return s_alloc.Alloc(); \
    } \
    void* operator new(size_t, void* p) \
        { return p; } \
    void operator delete(void* p) { s_alloc.Free(p); } \
    void* operator new(size_t size, LPCSTR, int) \
    { \
        ASSERT(size == s_alloc.GetAllocSize()); \
        UNUSED(size); \
        return s_alloc.Alloc(); \
    } \
protected: \
    static CFixedAllocNoSync s_alloc \

// IMPLEMENT_FIXED_ALLOC_NOSYNC -- used in class
// implementation file
#define IMPLEMENT_FIXED_ALLOC_NOSYNC(class_nbame, block_size) \
CFixedAllocNoSync class_name::s_alloc(sizeof(class_name),
                                      block_size) \

There is a mistake in the IMPLEMENT_FIXED_ALLOC_NOSYNC() macro. The macro parameter class_nbame should be class_name. These macros do not appear to be used by MFC, which makes sense because there is a syntax error in the macro. That means that the no sync version has probably not been tested by the MFC team. However, once the syntax error is fixed and because the change was trivial, it should work fine. You can fix the syntax error and use the fixed fixalloc.h in your program without having to recompile MFC (the macro isn't used anywhere in MFC). I have modified the demo program so that you can test out the single thread version. All you have to do is:

  1. Fix the syntax error in your fixalloc.h file.
  2. Uncomment the following define statement.
/*
 * Uncomment the next define statement
 * to use the single thread version of
 * CFixedAlloc. Prior to recompilation,
 * you must fix a syntax error in
 * fixalloc.h. The IMPLEMENT_FIXED_ALLOC_NOSYNC()
 * macro parameter class_nbame
 * should be rename to class_name.
 *
 * Note: The single thread version is only
 * available in VC++.NET and up.
 */
//#define _USE_SINGLE_THREAD

#ifndef _USE_SINGLE_THREAD
class B
{
public:
    B( B *next ) : m_next(next) {}
    B  *m_next;
    int dummy1;
    int dummy2;

    DECLARE_FIXED_ALLOC(B);
};

IMPLEMENT_FIXED_ALLOC(B,ITER_NUM);
#else
class B
{
public:
    B( B *next ) : m_next(next) {}
    B  *m_next;
    int dummy1;
    int dummy2;

    DECLARE_FIXED_ALLOC_NOSYNC(B);
};

IMPLEMENT_FIXED_ALLOC_NOSYNC(B,ITER_NUM);
#endif

My testing of the no sync version has shown a speedup improvement of 40% over the original CFixedAlloc version. The bottom line is that you should use the no sync version if you are not developing a multithread application.

Conclusion

That is it! I hope you enjoyed this article, and if you did and found it useful, please take a few seconds to rank it. You can do so right at the bottom of the article.

Bibliography

History

  • 01-20-2006
    • Expanded the "How to use" section based on the comments from readers.
    • Expanded the "Suggestions" section based on the comments from Johan Strand.
  • 11-28-2005
    • Original article.


About the Author

Olivier Langlois

I have been in the telecomm industry doing, among other things, embedded protocol stacks. I also worked for the FAA to develop the next generation oceanic ATM system. I was (surprise, surprise) in charge ATM protocols. I got tired of working with old technologies aka UNIX, so I am now working at Quazal, best known for offering the best products and services for Multiplayer Game Development. You can contact me through my website

Downloads

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • On-demand Event Event Date: September 10, 2014 Modern mobile applications connect systems-of-engagement (mobile apps) with systems-of-record (traditional IT) to deliver new and innovative business value. But the lifecycle for development of mobile apps is also new and different. Emerging trends in mobile development call for faster delivery of incremental features, coupled with feedback from the users of the app "in the wild." This loop of continuous delivery and continuous feedback is how the best mobile …

  • Is your IT Automation strategy saving you money or just becoming more complex and costly? With the right unified strategy, IT Automation can pay for itself and deliver far more business value. Watch this on-demand webinar, "New Strategies to Manage IT Automation Complexity" and learn how to: Reduce costs by integrating automation for servers, middleware, networks and databases Eliminate manual and tedious IT Operations tasks with both new and existing technology Save time and money by consolidating …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds