I’ve decided to learn how the Go runtime allocates memory and this is the first in a series of posts that are going to cover what I learned (and also the process itself: how I learned). These posts are based of my notes for the talk I gave at Gophercon UK in 2018:

Slides are available here.

In this first post we are going to cover:

  1. Memory allocation 101
  2. How the allocator is invoked
  3. An overview of the tcmalloc algorithm
  4. An overview of the Go’s runtime allocator

The objective of this post is to give a broad idea of how allocation works on the runtime, without actually showing any of it’s code. The following posts on the series will be more dense and with lots of code samples extracted from the runtime. Each of the future posts will cover an specific detail of the allocator.

Disclaimer: The code samples displayed in this series of posts were redacted to simplify the reading. The full code is available on golang/go repository. This post is based of the 1.10 branch. In the last post of the series, we are going to cover the main changes in go 1.11.

Disclaimer 2: I’m not a go runtime expert. Just someone curious and interested in those kinds of things. Feel free to reach me out and correct me.

Disclaimer 3: Most of the content in these posts are implementation details and are not defined on the Go Specification. There are no guarantees that they are going to stay the same, approach all this with caution. For instance, the Go Spec says nothing about heap or stack.

0. Memory allocation 101

Before reading this post, I recommend reading my previous post on memory allocation. That post covers the basics needed to understand what we are going to cover in this series.

1. How the allocator gets invoked

In languages like C, it’s pretty obvious when the memory allocator is invoked: that happens when you can malloc (or one of its variations) and free.

We know that Go has no malloc function and also you are not able to actively free memory allocations because the language is garbage collected. So, one may wonder how the memory allocation code is invoked in our Go programs.

Imagine we have the following Go code:

package main

func main() {
	f()
}

//go:noinline
func f() *int {
	i := 10
	return &i
}

The f() function is marked //go:noinline to prevent the compiler from inlining the call to f(), removing the function call and making the example a little less trivial to understand.

The compiler uses a technique called escape analysis to choose wether a variable is going to be placed on the heap or the stack. We are interested in heap allocations and that is exactly where i is allocated. We can see that by running go build passing -gcflags "-m -m":

The -gcflags flag is way to pass flags to the go compiler during the build and the "-m" flag tells the compiler to output optmization decisions (passing it two times increases verbosity). One can also use "-m=2".

$ go build -gcflags "-m -m" main.go
# command-line-arguments
./main.go:8:6: cannot inline f: marked go:noinline
./main.go:3:6: cannot inline main: non-leaf function
./main.go:10:9: &i escapes to heap
./main.go:10:9: 	from ~r0 (return) at ./main.go:10:2
./main.go:9:2: moved to heap: i

Imagine if i was allocated on the stack. That means that &i would be the address of a memory area on f()’s stack frame. When f() returns, it’s stack frame becomes invalid memory and thus main() can’t access any of it. Since f() is using pointer semantics to return the address i the compiler figures out that i should live on the heap, since it must be allocated on a memory area that remains valid even after f() returns.

I recommend reading Language Mechanics On Escape Analysis to know more about escape analysis.

So, we know that i is going to be allocated on the heap. But how does the runtime set that up? With the compiler’s help! We can get an idea from reading the generated assembly.

The following output was redacted for brevity, showing only the relevant parts. Some resources on Go assembly are linked on the end of the post.

$ go tool compile -S main.go
...
0x001d 00029 (main.go:9)	LEAQ	type.int(SB), AX
0x0024 00036 (main.go:9)	MOVQ	AX, (SP)
0x0028 00040 (main.go:9)	PCDATA	$0, $0
0x0028 00040 (main.go:9)	CALL	runtime.newobject(SB)
...

This shows a call to runtime.newobject passing the address of the type int.

func newobject(typ *_type) unsafe.Pointer {
	return mallocgc(typ.size, typ, true)
}

mallogc is the entrypoint to the runtime memory allocation algorithm. It receives the size, type and wether or not this memory should return zeroed (true in this case).

There are other places mallocgc gets called (e.g runtime.makeslice etc). We are going to cover mallocgc in details in the next posts, all we need to know is that it does the heap allocation work and is a implementation of the algorithm we are going to cover at a high level now.

2. Thread Cache Malloc (tcmalloc)

As stated on some comments at the code:

The Go runtime allocator was initially based on tcmalloc but has diverged quite a bit.

The page tcmalloc page, also linked in the code comments, has a great overview of how the algorithm works.

I’ll briefly summarize tcmalloc in this section. I recommend reading the refered page as a way to get into more details and also checking some of the benchmarks comparing tcmalloc with glibc malloc.

We can divide allocations into 2 buckets: small and large allocations. Most of the allocations are served from local thread caches, meaning that no locks are required, reducing contention and increasing the overall performance.

2.1 Small allocations

Small allocations are allocations with size <= 32kB.

  1. The requested sized is rounded up to one of the 170 size classes; e.g, allocations in the range 961 to 1024 bytes are rounded up to 1024.
  2. tcmalloc looks up at the local thread cache to see if there are free objects available for this size class:
    1. If there are free objects, it returns the first one from the list and we are all done. This allocation required no locks as it was totally served from the local thread cache.
    2. If there are no free objects, fetch new objects from the central cache.
  3. The central cache checks if it has free objects:
    1. If there are free objects, return a bunch to the local cache;
    2. If there are no free objects, request a chunck of memory pages from the central heap, split it into objects for this size class and return them to the local cache.

2.2 Large allocations

Large allocations are served directly by the central heap. They are rounded up to a page size 4K and a run of contiguous pages are called Spans.

The central heap maintains a array of free lists. For k < 256, the kth entry is a free list of Spans that consist of k pages. The 256th entry is a free list of Spans that have length >= 256 pages.

  1. the central heap will search for a Span on the k-th free list
    1. If one is found, return it to the user
    2. If one is not found, look for it on the k+1-th free list, and so on…
  2. If a free Span is not found in any of the free lists, allocate a large chunk of memory from the OS
  3. If the Span found, or the memory requested from the OS, is larger than the user requested, split it into multiple spans and place then on the free lists before returning to the user.

3. Overview of the Go’s Runtime Allocator

3.1 Allocations

We can divide Go heap allocations into 3 groups (as opposed to the 2 groups in tcmalloc). Each of these types are going to be covered in separated posts on this series.

  • Small allocations (< 32 kB)

Small allocations in Go are pretty similar to tcmalloc. It follows from the most local to the most central structure:

mcache -> mcentral -> mheap

  • Tiny Allocations (< 16 B and no pointers)

Follows the same path as small allocations but is stored, managed and freed as 16 B memory blocks by the mcache.

  • Large Allocations (> 32 kB)

These are served directly by the mheap, first by checking its free lists and asking for more memory from the OS when needed.

3.2 Data structures

In this section we are going to get a high level view of how each of the mentioned data structures operate.

The following diagram displays the main components that are going to be described later:

      +------------+ +------------------------+
+---+ |            | |        mheap           |
| P | |   mcache   | |                        |
+---+ |            | |  +----------+          |
      +------------+ |  | mcentral |          |
                     |  +----------+          |
                     |  +----------+          |
                     |  | mcentral |          |
                     |  +----------+          |
                     |       .                |
                     |       .                |
                     |       .                |
      +------------+ |  +----------+          |
+---+ |            | |  | mcentral |          |
| P | |   mcache   | |  +----------+          |
+---+ |            | |                        |
      +------------+ +------------------------+

We can see that each logical processor (P) has its own mcache, which is the local thread cache. There is a single mheap, which plays the part of the central heap. Surprisinly, there are multiple mcentral objects. There is one for each of the size classes. This means that even when requesting for a span from the mcentral we can reduce contention by only acquiring a lock for the requested size class. Goroutines in other processors that also require allocating from the mcentral but are requesting objects for a different size class won’t find that mcentral locked.

Thread Cache (mcache)

The thread cache is implemented by the mcache structure. There is one mcache for each P (logical processor).

        +--------------------+                          
        |       mcache       |                          
        |                    |                          
        |          +-------+ |                          
        |     8 B  | mspan | |                          
        |          +-------+ |                          
        |          +-------+ |                          
        |    16 B  | mspan | |                          
        |          +-------+ |                          
        |              .     |                          
        |              .     |                          
        |              .     |                          
        |          +-------+ |                          
        |   32 kB  | mspan | |                          
        |          +-------+ |                          
        |                    |                          
        |   +-------------+  |                          
        |   | tiny block  |  |                          
        |   +-------------+  |                          
        +--------------------+           

The mcache maintains a mspan of each class (for objects with 8 bytes up to 32 kB). Each of these spans are able to allocate multiple objects of that size. If the span for a given class is full, the mcache requests a new one from the corresponding mcentral structure for that size class.

Since it is a per-P structure, there is no need to hold locks when allocating from the mcache which is a interesting optimization.

The mcache also maintains a block that is used for tiny allocations (this is a concept that is not on tcmalloc) of objects that contain no pointers. When are going to cover how this works on a post about Tiny Allocation in go.

Central free list (mcentral)

The central free list is implemented by mcentral. There is one mcentral for each span class sizes and they are stored inside the mheap struct, which we will cover soon.

+-------------------------+                        
|        mcentral         |                        
|                         |                        
|            +----------+ |    +-----+      +-----+
|     empty  |mSpanList | |--->|mspan|<---->|mspan|
|            +----------+ |    +-----+      +-----+
|                         |                        
|            +----------+ |    +-----+      +-----+
|   nonempty |mSpanList | |--->|mspan|<---->|mspan|
|            +----------+ |    +-----+      +-----+
|       +-------+         |                        
|       | lock  |         |                        
|       +-------+         |                        
+-------------------------+                         

Each mcentral holds a lock that needs to be acquired when allocating from it, since mcentral is shared by all Ps.

It maintains two double linked lists (of type mSpanList), one for spans that contain free objects and one for spans that don’t contain any free object or that are already in a mcache.

When a mcache requests a new span, the mcentral traverses both lists looking for spans that are available and can be handled to the mcache.

If there is no span available, the mcentral requests for a new span from the central heap (mheap) and returns this new span to the mcache.

Central Heap (mheap)

The central heap is implemented by the type mheap struct{}. Large objects (> 32 kB) are directly allocated from it. Since this structure is shared by all threads, allocating from it incurs in locking. This is the largest structure on the allocator so we are going to break it down.

+----------------------------------------------------+  
|                        mheap                       |  
|                                                    |  
|+-----------------------+ +-----------------------+ |  
||         free          | |         busy          | |  
||           +---------+ | |           +---------+ | |  
||   1 page  |mSpanList| | |   1 page  |mSpanList| | |  
||           +---------+ | |           +---------+ | |  
||          .            | |          .            | |  
||          .            | |          .            | |  
||          .            | |          .            | |  
||           +---------+ | |           +---------+ | |  
|| 127 pages |mSpanList| | | 127 pages |mSpanList| | |  
||           +---------+ | |           +---------+ | |  
||                       | |                       | |  
||           +---------+ | |           +---------+ | |  
||128+ pages | mTreap  | | |128+ pages |mSpanList| | |  
||           +---------+ | |           +---------+ | |  
|+-----------------------+ +-----------------------+ |  
|                     +---------+                    |  
|                     |  lock   |                    |  
|                     +---------+                    |  
+----------------------------------------------------+  

The mheap, as on tcmalloc, maintains an array of free linked lists. The i-th item in the array contains runs of pages that consist of i pages.

The mheap also maintains a mtreap with runs of pages that are larger then 127 pages. This used to be a linked list, but changed in this commit due to performance on large heaps.

A treap is a probabilistically balanced binary tree that offers O(logN) behavior for inserting and removing spans. The tree is kept sorted both by span size and the span address is used as second sorting key.

When asked for n pages of memory, the mheap traverses its free linked lists for spans greater or equal to n. In case there are no free spans, the mheap asks the OS for more. In linux, it does so by using the mmap(2) syscall to ask for at least 1MB. Contiguous free spans are subject to coalescing and are placed on the corresponding free lists.

The mheap maintains a lookup table of virtual page address IDs to span. A large enough area to map every page that can be allocated is reserved upon initialization. This lookup table can be used to find the span responsible for a given page and is the “central array” described in tcmalloc’s page.

If the span is allocated, all pages map to the span itself.

       +--------|--------|--------|--------+
 spans | page 1 | page 2 | page 3 | page 4 |
       +=--|----|------|-|--|-----|----|---+
           |           |    |          |    
           |           |    |          |    
           |           |    |          |    
           |           |    |          |    
           |           |    |          |    
           |           |    |          |    
           |         +-v----v---+      |    
           +--------->  mspan   <------+    
                     +----------+           

If the span is not allocated, it’s first and last page points to the span. The rest points to an arbitrary span.

       +--------|--------|--------|--------+
 spans | page 1 | page 2 | page 3 | page 4 |
       +---|----|--------|--------|----|---+
           |                           |    
           |                           |    
           |                           |    
           |                           |    
           |                           |    
           |                           |    
           |         +----------+      |    
           +--------->  mspan   <------+    
                     +----------+           

Pages that have never been allocated are nil entries in this lookup table.

The mheap also maintains:

  1. the arena, which is the range of addresses used for the Go heap. This is used by the mheap to know when to request more memory from the OS and at what virtual address this new mapping needs to start;
  2. A bitmap used by the garbage collector to know what is still live and can’t be collected;
  3. A bunch of fixed sized allocators that are used to manually allocate the structures used internally by the allocator.

Spans (mspan)

A span is of type mspan and represents a run of contiguous pages that are managed by the mheap. This is the object travels between the mheap and the caches (mcentral and mcache).

Every span has a startAddr which, as the name says, is the address of its first byte and a npages which is the number of pages that this span has.

For small allocations, which means that the span is of one of the given size classes, we can imagine the span as a container for nelems of that given size. The mcache needs to check if the span has still room for another object of that size.

For large allocations (> 32kB), the span holds a single object and a pointer to its base address (startAddr) is returned by mallocgc.

Wrapping Up

I think this is enough for a first, introductory post. On the next post we are going to start exploring the runtime code and see how it handles large object allocations.

References

  1. Go internals, Chapter 1: Go assembly
  2. Language Mechanics On Escape Analysis
  3. tcmalloc