Fast, Efficient Allocator for Small Blocks of Memory

Introduction

Dynamic memory allocation is a funny thing. Most people call malloc/free without thinking about the overhead associated with it. When it comes to heap-based memory allocation, managing blocks of memory so that memory can be reclaimed and reused entails a significant amount of bookkeeping and this bookkeeping costs CPU cycles. The first rule when trying to write fast chunks of code: Touch the allocator as little as possible. But, malloc/free are there for a reason; it's a tool like any other and just has to be used appropriately. But, can you make it cheaper to use?

A few guys and myself, in an astonishing display of what too much testosterone can drive grown men to do, took up the small block challenge. The goal was to write the fastest, most "efficient" small block allocator and the chance to obtain everlasting bragging rights over the others. So, okay... it wasn't just macho posturing; the reason it came up at all was that it turns out that a UI kit that we were all working on allocated an astounding amount of memory at an even more astounding rate and much of it consisted of very small memory requests under 1024 bytes. We just had some gentlemanly disagreements on who was bigger... err... I mean who could write a better allocator. Hence the contest, and its ground rules:

  • Two functions with the following signatures: void* alloc(long size); void free(void*p);
  • Must support > 1024 allocation (speed and efficiency testing are on allocations <= 1024)
  • Must be "naturally" aligned to the next power of 2, to a maximum of an 8-byte alignment
  • Must not crash on NULL free
  • Must support allocation of 0 (return a valid ptr)
  • Must call blockalloc()/blockfree() (essentially malloc and free) for "underlying" pool allocations

The major components of the scoring were speed and efficiency, with the lack of efficiency being measured by how much waste that you have during the benchmark.

efficiency = (amount of memory requested of your allocator) / (amount of memory that you have requested from blockalloc)

Given efficiency, the score was essentially calculated as follows:

score = (time in ms) / (efficiency * efficiency * efficiency)

The lowest score wins. From this, you can see that there is a very high penalty for not being as 'efficient' as possible. In our tests for performance and efficiency, it shows my allocator beating Visual C's malloc/free by 25 times on speed and more than 13% on efficiency.

No claim is made to establish this as the fastest possible small block allocator, although this implementation is pretty fast. I have lots of ideas on how it could be made faster and I'm sure there are implementations out there that blow this away. But, it is pretty interesting how easy it was to beat Microsoft's out-of-the-box implementation and it serves as a good starting point for those of you out there are interested in taking it further.

A good axiom is to make things as simple as possible and only introduce complexity as the need arises. My first couple of allocators were very complicated beasts, mostly because my focus was on the wrong problems (for example, minimizing block headers sizes and so forth...). In the end, my allocator became extremely simple, essentially a fixed block allocator that managed 129 separate heaps with each heap managing a specific fixed allocation size starting at 4 bytes and then 8 bytes and then in 8-byte increments up to 1024 bytes. Technically, this is a suballocator; it uses malloc and free to allocate/free larger blocks of memory. This allocator manages these memory blocks and uses them to allocate smaller blocks of memory. In our testing, this allocator wins by managing these smaller allocations more efficiently than the more general purpose malloc/free.

A fixed block allocator, just like it sounds, is an allocator than can only allocate blocks of a fixed or given size. By only needing to deal with blocks of a given size, the amount of code and the complexity of the datastructures needed to manage the memory is minimized; this maps directly to performance.

Allocating Memory

Take a look at the rtAllocator::alloc method:

void* rtAllocator::alloc(long ls);

The first thing that happens here is that you need to identify which of the 129 separate heaps is appropriate to satisfy the request. First, the size (ls) of the allocation is checked to see whether it is > 1024 bytes. If the request is for larger than 1024 bytes, the request is simply thrown to the general purpose allocator (malloc) because I wasn't concerned with allocations of this size. If the size of the request is <= 1024, you need to determine which of the 129 fixed size heaps should be used to satisfy the request. To do this quickly and efficiently, a lookup table of 1024 elements is used. This lookup table has been initialized with a number that indicates which of the 129 heaps should be used to satisfy the request. Look at the code for this:

void* rtAllocator::alloc(long ls)
{
   if (ls == 0) ls = 1;
   int bdIndex = -1;
   if (ls <= 1024) bdIndex = mBDIndexLookup[ls];

The first line handles the special case of attempting to allocate zero bytes; in this case, you simply treat it as an allocation of one byte by changing the size to one. In the next line, you initialize the index bdIndex to -1 and, if the allocation is within your target range (<= 1024), the lookup table is used with the size as the offset into the table to determine which of the 129 heaps to use.

If the allocation request was larger than 1024, the index, bdIndex, will be -1 and the request is simply passed through to the general purpose allocator, as in the following code:

if (bdIndex < 0)
{
   // Not handling blocks of this size throw to blockalloc
   INCALLOCCOUNTER(bdCount);
   return ALLOCJR_ALLOC(ls);
}
Note: The macro ALLOCJR_ALLOC is used to wrap malloc so that you can instrument and record allocation statistics. ALLOCJR_FREE is used for a similar purpose to wrap calls to the free function.

At this point in the code, you know you are handling the allocation (<=1024) and you know which of the 129 fixed-size heaps will be used to satisfy the request. So, the next thing to do is to check to see whether you already have a block with the necessary space to satisfy the request. Each heap maintains a double-linked list of free blocks (blocks that have room for at least one more allocation). If there are no free blocks, you allocate a block (using malloc) and link it into your linked list of free blocks with the following code:

if (!mFreeBlocks[bdIndex])
{
   INCBLOCKCOUNTER();
   block* b = (block*)ALLOCJR_ALLOC(
      block::getAllocSize(bd[bdIndex].fixedAllocSize,
      bd[bdIndex].chunks));
   if (b)
   {
      b->init(bd[bdIndex].fixedAllocSize, bdIndex, bd[bdIndex].chunks);

      addBlockToArray(b);
      mFreeBlocks[bdIndex] = b;
   }
}

At this point, there should be at least one block available that can be used to satisfy the allocation request. The allocation request is then forwarded to block::alloc to allocate the memory from within the available free block. Each block has a number of chunks, with each chunk being large enough to satisfy one allocation request. In the block datastructure, a single-linked list of free chunks is maintained within the interior of the block.

To avoid the cost of setting up the linked list when you initialize a newly allocated block, a fence pointer is maintained and points at the first uninitialized chunk. When allocating memory, you look first to the linked list to see whether the list contains any free chunks. If it does not, you see whether there are chunks left in the block that have yet to be included in the list of free chunks by examining the fence pointer. If there is room, you increment mInitCursor to the next chunk and utilize the chunk that was previously pointed to by mInitCursor.

inline void* alloc()
{
   void* result;

   if (mFreeChunk)
   {
      result = mFreeChunk;
      mFreeChunk = (void**)*mFreeChunk;
   }
   else
   {
      result = mInitCursor;
      mInitCursor += mFixedAllocSize;
   }

   mAllocCount++;
   return result;
}

After you return from block::alloc, you need to see whether the block is completely full. This is done through a call to block::isFull. If it is full, you remove it from the double-linked free list so that you won't consider it when looking for space to satisfy future requests. When the block is removed from the free list, a sentinel value is assigned to the blocks mNextFreeBlock pointer so that you can easily determine that the block is full. Examine the following code:

block *b = mFreeBlocks[bdIndex];

if (b->mNextFreeBlock != ALLOCJR_FULLBLOCK && b->isFull())
{
   // Unlink from freelist
   if (b->mNextFreeBlock)
   {
      b->mNextFreeBlock->mPrevFreeBlock = b->mPrevFreeBlock;
   }
   if (b->mPrevFreeBlock)
   {
      b->mPrevFreeBlock->mNextFreeBlock = b->mNextFreeBlock;
   }
   mFreeBlocks[bdIndex] = b->mNextFreeBlock;
   // special value means removed from free list
   b->mNextFreeBlock = ALLOCJR_FULLBLOCK; 
   b->mPrevFreeBlock = ALLOCJR_FULLBLOCK;
}

At this point, you have successfully allocated a block of the requested size. Now that you've walked through the process of allocating memory, I will now walk you through the process of freeing memory.

Fast, Efficient Allocator for Small Blocks of Memory

Freeing Memory

The process of freeing memory begins with a call to rtAllocator::free.

void rtAllocator::free(void* p)

The first thing that this method does is to check to see whether you've been passed a null pointer reference. If so, you return as follows:

if (!p) return;

If the pointer is non-null, a check is then made to see whether the memory is managed by you or had been retrieved from a pass-through to malloc. To do this, you conduct a binary search through the array of block pointers that you maintain to see whether it is one of yours. This is done through a call to the following function:

block* b = findBlockInArray(p);

If the block is one of yours, b will be non-null; otherwise, you know that the pointer you are freeing is not yours and you pass it directly to free. If it is yours, a call to block::free is made to the block that was returned. In the block::free method, you simply return the chunk to the free list within the block so that the chunk can be reused.

inline void free(void* p)
{
   void **pp = (void**)p;
   *pp = mFreeChunk;
   mFreeChunk = (void**)p;
   mAllocCount--;
}

In the first line of this function, you coerce the pointer into a void **; this allows you to easily write the current linked list head pointer to the front of the chunk. Then, you do exactly that in the second line. In the third line, you complete inserting the chunk into the free list for the block by setting the head pointer to point to the newly inserted chunk. The last thing you do in this function is decrement the counter for how many allocated chunks that are currently in the block. Upon returning from the call to block::free, you still have to do a couple of more checks before you are done. First, you check to see whether the last call to block::free ended up making the block empty. You do this as follows:

if (b->isEmpty())

If the block is now completely empty, you unlink the block from the double-linked list of free blocks and return the block to the system by calling free with the block as follows:

if (b->isEmpty())
{
   // Unlink from freelist and return to the system
   if (b->mNextFreeBlock)
   {
      b->mNextFreeBlock->mPrevFreeBlock = b->mPrevFreeBlock;
   }
   if (b->mPrevFreeBlock)
   {
      b->mPrevFreeBlock->mNextFreeBlock = b->mNextFreeBlock;
   }
   if (mFreeBlocks[b->mBDIndex] == b) 
      mFreeBlocks[b->mBDIndex] = b->mNextFreeBlock;

   removeBlockFromArray(b);

   DECBLOCKCOUNTER();
   ALLOCJR_FREE(b);
}

If the block is not empty, you make sure that the block is included in the double-linked free block list because you now know that there is at least one free chunk available in the block. This check is made by comparing the mNextFreeBlock member against an invalid pointer constant, ALLOCJR_FULLBLOCK, which is assigned to mNextFreeBlock block whenever it becomes full. If this check succeeds, you link it back into the list as follows:

// need to see if block is not in free list; if not, add it back
if (b->mNextFreeBlock == ALLOCJR_FULLBLOCK)
{
   b->mPrevFreeBlock = NULL;
   b->mNextFreeBlock = mFreeBlocks[b->mBDIndex];
   if (mFreeBlocks[b->mBDIndex]) 
   mFreeBlocks[b->mBDIndex]->mPrevFreeBlock = b;
   mFreeBlocks[b->mBDIndex] = b;
}

At this point, the memory has been reclaimed and can be reused.

Conclusion

In wrapping up, I just wanted to say that it was a lot of fun hacking on this allocator and I hope that you enjoy reading though my description of how it works. Have fun looking at the code!

The author maintains a blog at http://www.liquidthought.com/ with more programming tips, snippets, and software.



Downloads

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • IBM Worklight is a mobile application development platform that lets you extend your business to mobile devices. It is designed to provide an open, comprehensive platform to build, run and manage HTML5, hybrid and native mobile apps.

  • A modern mobile IT strategy is no longer an option, it is an absolute business necessity. Today's most productive employees are not tied to a desk, an office, or a location. They are mobile. And your company's IT strategy has to be ready to support them with easy, reliable, 24/7 access to the business information they need, from anywhere in the world, across a broad range of communication devices. Here's how some of the nation's most progressive corporations are meeting the many needs of their mobile workers …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds