Dynamic memory allocation is a funny thing. Most people call malloc/free without thinking about the overhead associated with it. When it comes to heap-based memory allocation, managing blocks of memory so that memory can be reclaimed and reused entails a significant amount of bookkeeping and this bookkeeping costs CPU cycles. The first rule when trying to write fast chunks of code: Touch the allocator as little as possible. But, malloc/free are there for a reason; it’s a tool like any other and just has to be used appropriately. But, can you make it cheaper to use?
A few guys and myself, in an astonishing display of what too much testosterone can drive grown men to do, took up the small block challenge. The goal was to write the fastest, most “efficient” small block allocator and the chance to obtain everlasting bragging rights over the others. So, okay… it wasn’t just macho posturing; the reason it came up at all was that it turns out that a UI kit that we were all working on allocated an astounding amount of memory at an even more astounding rate and much of it consisted of very small memory requests under 1024 bytes. We just had some gentlemanly disagreements on who was bigger… err… I mean who could write a better allocator. Hence the contest, and its ground rules:
- Two functions with the following signatures: void* alloc(long size); void free(void*p);
- Must support > 1024 allocation (speed and efficiency testing are on allocations <= 1024)
- Must be “naturally” aligned to the next power of 2, to a maximum of an 8-byte alignment
- Must not crash on NULL free
- Must support allocation of 0 (return a valid ptr)
- Must call blockalloc()/blockfree() (essentially malloc and free) for “underlying” pool allocations
The major components of the scoring were speed and efficiency, with the lack of efficiency being measured by how much waste that you have during the benchmark.
efficiency = (amount of memory requested of your allocator) / (amount of memory that you have requested from blockalloc)
Given efficiency, the score was essentially calculated as follows:
score = (time in ms) / (efficiency * efficiency * efficiency)
The lowest score wins. From this, you can see that there is a very high penalty for not being as ‘efficient’ as possible. In our tests for performance and efficiency, it shows my allocator beating Visual C’s malloc/free by 25 times on speed and more than 13% on efficiency.
No claim is made to establish this as the fastest possible small block allocator, although this implementation is pretty fast. I have lots of ideas on how it could be made faster and I’m sure there are implementations out there that blow this away. But, it is pretty interesting how easy it was to beat Microsoft’s out-of-the-box implementation and it serves as a good starting point for those of you out there are interested in taking it further.
A good axiom is to make things as simple as possible and only introduce complexity as the need arises. My first couple of allocators were very complicated beasts, mostly because my focus was on the wrong problems (for example, minimizing block headers sizes and so forth…). In the end, my allocator became extremely simple, essentially a fixed block allocator that managed 129 separate heaps with each heap managing a specific fixed allocation size starting at 4 bytes and then 8 bytes and then in 8-byte increments up to 1024 bytes. Technically, this is a suballocator; it uses malloc and free to allocate/free larger blocks of memory. This allocator manages these memory blocks and uses them to allocate smaller blocks of memory. In our testing, this allocator wins by managing these smaller allocations more efficiently than the more general purpose malloc/free.
A fixed block allocator, just like it sounds, is an allocator than can only allocate blocks of a fixed or given size. By only needing to deal with blocks of a given size, the amount of code and the complexity of the datastructures needed to manage the memory is minimized; this maps directly to performance.
Take a look at the rtAllocator::alloc method:
void* rtAllocator::alloc(long ls);
The first thing that happens here is that you need to identify which of the 129 separate heaps is appropriate to satisfy the request. First, the size (ls) of the allocation is checked to see whether it is > 1024 bytes. If the request is for larger than 1024 bytes, the request is simply thrown to the general purpose allocator (malloc) because I wasn’t concerned with allocations of this size. If the size of the request is <= 1024, you need to determine which of the 129 fixed size heaps should be used to satisfy the request. To do this quickly and efficiently, a lookup table of 1024 elements is used. This lookup table has been initialized with a number that indicates which of the 129 heaps should be used to satisfy the request. Look at the code for this:
void* rtAllocator::alloc(long ls)
if (ls == 0) ls = 1;
int bdIndex = -1;
if (ls <= 1024) bdIndex = mBDIndexLookup[ls];
The first line handles the special case of attempting to allocate zero bytes; in this case, you simply treat it as an allocation of one byte by changing the size to one. In the next line, you initialize the index bdIndex to -1 and, if the allocation is within your target range (<= 1024), the lookup table is used with the size as the offset into the table to determine which of the 129 heaps to use.
If the allocation request was larger than 1024, the index, bdIndex, will be -1 and the request is simply passed through to the general purpose allocator, as in the following code:
if (bdIndex < 0)
// Not handling blocks of this size throw to blockalloc
Note: The macro ALLOCJR_ALLOC is used to wrap malloc so that you can instrument and record allocation statistics. ALLOCJR_FREE is used for a similar purpose to wrap calls to the free function.
At this point in the code, you know you are handling the allocation (<=1024) and you know which of the 129 fixed-size heaps will be used to satisfy the request. So, the next thing to do is to check to see whether you already have a block with the necessary space to satisfy the request. Each heap maintains a double-linked list of free blocks (blocks that have room for at least one more allocation). If there are no free blocks, you allocate a block (using malloc) and link it into your linked list of free blocks with the following code:
block* b = (block*)ALLOCJR_ALLOC(
b->init(bd[bdIndex].fixedAllocSize, bdIndex, bd[bdIndex].chunks);
mFreeBlocks[bdIndex] = b;
At this point, there should be at least one block available that can be used to satisfy the allocation request. The allocation request is then forwarded to block::alloc to allocate the memory from within the available free block. Each block has a number of chunks, with each chunk being large enough to satisfy one allocation request. In the block datastructure, a single-linked list of free chunks is maintained within the interior of the block.
To avoid the cost of setting up the linked list when you initialize a newly allocated block, a fence pointer is maintained and points at the first uninitialized chunk. When allocating memory, you look first to the linked list to see whether the list contains any free chunks. If it does not, you see whether there are chunks left in the block that have yet to be included in the list of free chunks by examining the fence pointer. If there is room, you increment mInitCursor to the next chunk and utilize the chunk that was previously pointed to by mInitCursor.
inline void* alloc()
result = mFreeChunk;
mFreeChunk = (void**)*mFreeChunk;
result = mInitCursor;
mInitCursor += mFixedAllocSize;
After you return from block::alloc, you need to see whether the block is completely full. This is done through a call to block::isFull. If it is full, you remove it from the double-linked free list so that you won’t consider it when looking for space to satisfy future requests. When the block is removed from the free list, a sentinel value is assigned to the blocks mNextFreeBlock pointer so that you can easily determine that the block is full. Examine the following code:
block *b = mFreeBlocks[bdIndex];
if (b->mNextFreeBlock != ALLOCJR_FULLBLOCK && b->isFull())
// Unlink from freelist
b->mNextFreeBlock->mPrevFreeBlock = b->mPrevFreeBlock;
b->mPrevFreeBlock->mNextFreeBlock = b->mNextFreeBlock;
mFreeBlocks[bdIndex] = b->mNextFreeBlock;
// special value means removed from free list
b->mNextFreeBlock = ALLOCJR_FULLBLOCK;
b->mPrevFreeBlock = ALLOCJR_FULLBLOCK;
At this point, you have successfully allocated a block of the requested size. Now that you’ve walked through the process of allocating memory, I will now walk you through the process of freeing memory.