Speeding up Go with custom allocators

Summary: using a custom allocator I was able to speed up an allocation heavy program (binary-trees benchmark) ~4x.

Allocation are expensive.

It’s true for all programming languages.

Go has a very sophisticated garbage collector with extremely low pauses due to GC but it’s a non-moving, non-generational collector. Because of that allocations are relatively expensive and more expensive than Java’s moving, generational GC where an allocation is just bumping a pointer in nursery allocator.

This can be seen in binary-trees benchmark which has almost no computation but millions of allocations of small objects. As a result, Java implementation is about 7x faster.

I was able to speed up Go code by about 4x by using a custom allocator.

💡

Note that in engineering there is no free lunch, just tradeoffs. Java’s GC will pay for cheap allocations later in the execution when it has to copy memory from nursery (young generation) to old generation.

The benchmark builds a large binary tree composed of nodes:

type struct Node {
 int item
 left, right *Node
}

To allocate a new node we use &Node{item, left, right}.

Improving allocation-heavy code

When I said that allocation is expensive, I over-simplified.

In garbage-collected languages allocation is actually very cheap. In a good allocator it’s just a single arithmetic operation to bump a pointer, which is orders of magnitude cheaper than even the best malloc() implementation. Malloc() has to maintain data structures to keep track of allocated memory so that free() can return it back to the OS.

More complicated reality is that it’s garbage collection phase that is expensive.

The runtime runs garbage collection periodically to free unused memory.

Starting from known roots (global variables, variables on stack) it recursively scans all the allocated objects and chasing pointers to other objects. It marks all reachable objects and the rest is not referenced by any other object and can be freed. This is the “garbage” in garbage collection.

There are 2 insights we get from knowing how garbage collection works internally:

the more objects there are, the more expensive garbage collection is
the more pointers we need to chase, the more expensive GC phase is

Knowing what the problem is, we know what the solution should be:

allocate less objects (e.g. by allocating them in bulk or re-using previously allocated objects)
don’t use pointers

As it happens, the majority of the 4x speedup I got in this particular benchmark came from not using pointers

Speeding binary-trees shootout benchmark

We said that we should avoid pointers, so that garbage collector doesn’t have to chase them. The new definition of Node struct is:

type NodeId int

type struct Node {
 int item
 left, right NodeId
}

We changed left and right fields from *Node to an alias type NodeId, which is just a unique integer representing a node.

What NodeId means is up to us to define and we define it thusly: it’s an index into a []Node array. That array is the backing store (i.e. allocator) for our nodes.

When we need to allocate another node, we expand the []Node array and return the index to that node. We can trivially map NodeId to *Node by doing &array[nodeId].

Our implementation is a bit more sophisticated. In Go it’s easy to extend the array with append() but it involves memory copy. We avoid that by pre-allocating nodes in buckets and using an array of arrays for storage. The code is still relatively simple:

const nodes_per_bucket = 1024 * 1024

var (
  all_nodes [][]Node = make([][]Node, 0)
  nodes_left int = 0
  curr_node_id int = 0
)

func NodeFromId(id NodeId) *Node {
  n := int(id) - 1
  bucket := n / nodes_per_bucket
  el := n % nodes_per_bucket
  return &all_nodes[bucket][el]
}

func allocNode(item int, left, right NodeId) NodeId {
  if 0 == nodes_left {
    new_nodes := make([]Node, nodes_per_bucket, nodes_per_bucket)
    all_nodes = append(all_nodes, new_nodes)
    nodes_left = nodes_per_bucket
  }
  nodes_left -= 1
  node := NodeFromId(NodeId(curr_node_id + 1))
  node.item = item
  node.left = left
  node.right = right

  curr_node_id += 1
  return NodeId(curr_node_id)
}

Remaining changes to the code involve adding NodeFromId() call in a few places.

You can compare original to my faster version.

Another minor advantage if using integers instead of pointers in Node struct is that on 64-bit machines we use only 4 bytes for an int vs. 8 bytes for a pointer.

Drawbacks of custom allocators

The biggest drawback is that we lost ability to free objects. Memory we’ve allocated will never be returned to the OS until the program exits.

It’s not a problem in this case, since the tree only grows and the program ends when it’s done.

In different code this could be a much bigger issue. There are ways to free memory even with custom allocators but they require more complexity and evolve towards implementing a custom garbage collector at which point it might be better to go back to simple code and leave the work to built-in garbage collector.

How come Java is so much faster?

It’s a valid question: both Java and Go have garbage collectors, why is Java’s so much better on this benchmark?

I can only speculate.

Java, unlike Go, uses generational garbage collector, which has 2 arenas: one for young (newly allocated) objects (called nursery) and one for old objects.

It has been observed that most objects die young. Generational garbage collector allocates objects in nursery. Most collections only collect objects in nursery. Objects that survived collections in nursery are moved to the second arena for old objects, which is collected at a much lower rate.

Go collector is a simpler mark-and-sweep collector which has to scan all allocated objects during every collection.

Generational garbage collectors have overhead because they have to copy objects in memory and update references between objects when that happens. On the other hand they can also compact memory, improving caching and they scan a smaller number of objects during each collection.

In this particular benchmark there are many Node objects and they never die, so they are promoted to rarely collected arena for old objects and each collection is cheaper because it only looks at a small number of recently allocated Node objects.

There’s hope for Go, though. The implementors are aware that garbage collector is not as good as it could be and there is an ongoing work on implementing a better one.

A win in C++ as well

Optimizing by reducing the amount of allocations or making allocations faster is applicable to non-gc languages as well, like C and C++, because malloc()\free() are relatively slow functions.

Back in the day when I was working on Poppler, I achieved a significant ~19% speedup by improving a string class to avoid an additional allocation in 90% of the cases. I now use this trick in my C++ code e.g. in SumatraPDF code

I also managed to improve Poppler by another ~25% by using a simple, custom allocator

It’s a good trick to know.

Edna	speedy note taking app with super powers
SumatraPDF	small, fast, free PDF / ePub / comic book reader for Windows