Improving malloc() locality

October 30th, 2007

This is cool (if you care how malloc() works, always a big if), but it requires a lot of background.

Background

One very common way to implement malloc() is to use size classes and freelists. If you already know all about this, skip this section.

Here’s a sample malloc() that uses size classes and freelists. (Note that size classes have nothing to do with the object-oriented notion of a class.)

void * malloc(size_t nbytes)
{
    /* Say you want 24 bytes.  Well, there's an allocator that
     * specifically serves up 24-byte chunks of memory.  (Actually
     * if you ask for 21, 22, or 23 bytes, you'll probably get the
     * same allocator.) This is what I mean by size classes.
     */
    FixedSizeAllocator *allocator = getAllocatorForSize(nbytes);

    /* Each allocator maintains a linked list of free buffers.
     * This is what I mean by freelists.  Usually there's
     * something in the freelist, so allocation
     * is super-fast.  But occasionally it's empty.
     */
    if (allocator->freelist == NULL)
        return allocateNewBuf(allocator);  /* (the slow path) */

    /* The super-fast path.  Just pop the first item off the freelist. */
    void *item = allocator->freelist;
    allocator->freelist = LINKED_LIST_NEXT(item);
    return item;  /* man, that was fast! */
}

/* free() is super-fast, too. */
void free(void *buf)
{
    /* First get the allocator... */
    size_t bufsize = GET_SIZE_OF_BUF(buf);  /* (usually stashed nearby) */
    FixedSizeAllocator *allocator = getAllocatorForSize(bufsize);

    /* Then just push this buffer back onto the freelist.  Zoom zoom! */
    LINKED_LIST_NEXT(item) = allocator->freelist;
    allocator->freelist = item;
}

See how simple and fast that is? Even if all the freelists are initially empty, each time you free() memory, it goes on a freelist. So pretty soon, almost every call to malloc() is just popping a ready buffer off a linked list.

I’m glossing over allocateNewBuf() here, but it doesn’t have to be complicated. Basically it just asks the OS for pages of memory in chunks of, say, 4KB; and treats it as an array of smaller chunks, returning the next chunk in the array each time it’s called.

The locality problem

It all seems pretty awesome, right? It did to me. Then one day Brendan Eich mentioned to me that this approach can lead to fragmentation.

That surprised me. After all, the point of size classes is to avoid the fragmentation you get if you try to allocate buffers of all sizes from the same heap. Google for “best-fit” or “first-fit”—the results are all about heap fragmentation.

Now, in case you haven’t met Brendan, I should explain a bit. He talks fast and he says amazing things. His speech is full of metaphors you might need to think about for a while in a quiet room and examples you’ve never heard of. The average word Brendan uses is rarer than the average word you or I use. So Brendan’s output is ludicrously high-bandwidth, often impossible to keep up with; but at least you can Google what you don’t understand. Spending a few days puzzling over what Brendan meant by some five-word off-the-cuff remark, and ultimately decompressing it into a rewarding insight, is merely typical. At least, for me. No exaggeration.

In this case, I eventually found out Brendan was referring to a different, and potentially much worse kind of heap fragmentation than the kind I was familiar with. At program startup, ten calls to malloc(24) will typically return ten buffers that are close together in memory. Maybe even right next to each other, like an array. This is by design. It provides great cache locality and page locality. But as time goes by, the application free()s buffers in no particular order, so the freelists tend to get shuffled. Eventually, ten calls to malloc(24) might return pointers into ten different pages of otherwise unused memory: the worst possible locality.

This weekend I suddenly realized that MMgc’s fast malloc-like allocator, MMgc::FixedMalloc—a piece of code I read months ago—contains a clever workaround for this problem!

FixedMalloc uses per-page freelists. Each fixed-size allocator has a linked list of pages. Each page has a freelist for the buffers in that page. malloc() always consumes one page’s freelist completely before going to another page.

I’m sure this trick is well-known in the field, but what’s a blog for, if not advertising your own ignorance?

Mozilla 2 status post

October 23rd, 2007

I just posted to mozilla.dev.platform about Mozilla 2 platform changes. In short: Mercurial is coming, Tamarin is coming, and we’re going to give garbage collection a chance to prove itself.